Various performance measures in Binary classification An Overview of ROC study

Similar documents
METHODS FOR DETECTING CERVICAL CANCER

Estimation of Area under the ROC Curve Using Exponential and Weibull Distributions

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

BMI 541/699 Lecture 16

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Comparing Two ROC Curves Independent Groups Design

Sensitivity, Specificity, and Relatives

ROC (Receiver Operating Characteristic) Curve Analysis

Statistics, Probability and Diagnostic Medicine

Receiver operating characteristic

EVALUATION AND COMPUTATION OF DIAGNOSTIC TESTS: A SIMPLE ALTERNATIVE

Evaluation of diagnostic tests

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

COMPARING SEVERAL DIAGNOSTIC PROCEDURES USING THE INTRINSIC MEASURES OF ROC CURVE

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Machine learning II. Juhan Ernits ITI8600

Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems

PERFORMANCE MEASURES

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn

Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta GA, USA.

2011 ASCP Annual Meeting

Computer Models for Medical Diagnosis and Prognostication

Critical reading of diagnostic imaging studies. Lecture Goals. Constantine Gatsonis, PhD. Brown University

Behavioral Data Mining. Lecture 4 Measurement

1 Diagnostic Test Evaluation

Introduction to ROC analysis

Screening (Diagnostic Tests) Shaker Salarilak

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

ROC Curves (Old Version)

Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

When Overlapping Unexpectedly Alters the Class Imbalance Effects

VU Biostatistics and Experimental Design PLA.216

4. Model evaluation & selection

Data that can be classified as belonging to a distinct number of categories >>result in categorical responses. And this includes:

The recommended method for diagnosing sleep

Week 2 Video 3. Diagnostic Metrics

Introduction to screening tests. Tim Hanson Department of Statistics University of South Carolina April, 2011

Quantitative Evaluation of Edge Detectors Using the Minimum Kernel Variance Criterion

A graphic approach to the evaluation of the performance of diagnostic tests using the software.

Net Reclassification Risk: a graph to clarify the potential prognostic utility of new markers

70 Diagnostic Accuracy. Martin Kroll MD

Fundamentals of Clinical Research for Radiologists. ROC Analysis. - Research Obuchowski ROC Analysis. Nancy A. Obuchowski 1

Psychology, 2010, 1: doi: /psych Published Online August 2010 (

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

Assessment of performance and decision curve analysis

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

Probability Revision. MED INF 406 Assignment 5. Golkonda, Jyothi 11/4/2012

Predictive Models for Healthcare Analytics

Meta-analysis of diagnostic research. Karen R Steingart, MD, MPH Chennai, 15 December Overview

Dottorato di Ricerca in Statistica Biomedica. XXVIII Ciclo Settore scientifico disciplinare MED/01 A.A. 2014/2015

Bayes theorem, the ROC diagram and reference values: Definition and use in clinical diagnosis

Predictive performance and discrimination in unbalanced classification

Unit 1 Exploring and Understanding Data

Diagnostic tests, Laboratory tests

Efficient AUC Optimization for Information Ranking Applications

3. Model evaluation & selection

A scored AUC Metric for Classifier Evaluation and Selection

Clinical Decision Analysis

Selection and Combination of Markers for Prediction

Predicting Breast Cancer Survivability Rates

Classification with microarray data

An Improved Algorithm To Predict Recurrence Of Breast Cancer

About OMICS International

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Detection of Mild Cognitive Impairment using Image Differences and Clinical Features

Part [1.0] Introduction to Development and Evaluation of Dynamic Predictions

Diagnostic Reasoning: Approach to Clinical Diagnosis Based on Bayes Theorem

Performance Evaluation of Machine Learning Algorithms in the Classification of Parkinson Disease Using Voice Attributes

City, University of London Institutional Repository

Learning Decision Trees Using the Area Under the ROC Curve

A PRACTICAL APPROACH FOR IMPLEMENTING THE PROBABILITY OF LIQUEFACTION IN PERFORMANCE BASED DESIGN

A mathematical model for short-term vs. long-term survival in patients with. glioma

COMP90049 Knowledge Technologies

I got it from Agnes- Tom Lehrer

A Practical Approach for Implementing the Probability of Liquefaction in Performance Based Design

Biomarker adaptive designs in clinical trials

Assessing Functional Neural Connectivity as an Indicator of Cognitive Performance *

DETECTING DIABETES MELLITUS GRADIENT VECTOR FLOW SNAKE SEGMENTED TECHNIQUE

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

The cross sectional study design. Population and pre-test. Probability (participants). Index test. Target condition. Reference Standard

SubLasso:a feature selection and classification R package with a. fixed feature subset

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy

Chapter 10. Screening for Disease

Evalua&ng Methods. Tandy Warnow

A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests

Characterization of a novel detector using ROC. Elizabeth McKenzie Boehnke Cedars-Sinai Medical Center UCLA Medical Center AAPM Annual Meeting 2017

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE

Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy

Module Overview. What is a Marker? Part 1 Overview

Understanding Diagnostic Research Outline of Topics

Zheng Yao Sr. Statistical Programmer

Prediction of Diabetes Using Probability Approach

CSE 255 Assignment 9

Diagnostic screening. Department of Statistics, University of South Carolina. Stat 506: Introduction to Experimental Design

Phone Number:

Outline of Part III. SISCR 2016, Module 7, Part III. SISCR Module 7 Part III: Comparing Two Risk Models

Transcription:

Various performance measures in Binary classification An Overview of ROC study Suresh Babu. Nellore Department of Statistics, S.V. University, Tirupati, India E-mail: sureshbabu.nellore@gmail.com Abstract The Receiver Operating Characteristic (ROC) curve is used to test the accuracy of a diagnostic test especially in medical related fields. Construction of an ROC curve is based on four possible counts or states. The ROC curve is a plot of Sensitivity vs (1-Specificity) and is drawn between the coordinates (0,0) and (1,1). The accuracy of an diagnostic test can be measured by the Area Under the ROC Curve (AUC). The AUC is always lies between (0, 1). Other performance measures like Positive likelihood, Negative likelihood, Precision, Recall, Youden s Index, Matthews correlation, Discriminant Power (DP) etc., obtained from the four possible counts or states are briefly discussed in this paper. Key words: ROC, AUC, Precision, Recall, Youden s index, Matthews correlation, Discrimnant Power. 1. Introduction As many different areas of science and technology most important problems in medical fields merely on the proper development and assessment of binary classifiers. A generalized assessment of the performance of binary classifiers is typically carried out the analysis of their Receiver Operating Characteristic (ROC) curves. The phrase Receiver Operating Characteristic (ROC) curve had its origin in statistical Decision theory as well as signal detection theory (SDT) and was used during world war-ii for the analysis of radar images. In medical diagnosis one needs to discriminate between a Healthy (H) and Diseased (D) subject using a procedure like ECG, Echo, Lipid values etc., Very often there exists a decision rule called Gold Standard that leads to correct classification a subject into Healthy(H) or Diseased (D) group. Such a rule could be some times more expansive or 596

invasive or may not be available at a given location or time point. In this situation we seek an alternative and user friendly discriminating procedures with reasonably good accuracy. Such procedures are often called markers. (in biological sciences called biomarkers) When there are several markers available we need to choose a marker or a combination of markers that correctly classify a new patient into D or H groups. The percentage of correct classifications made by the marker is a measure of the test performance. The distinction between the two groups of cases is normally made basing on a threshold or cut off value. Given a cutoff value on the biomarker, it is possible to classify a new subject with some reasonable level of accuracy. Let X denote the test result (continuous or discrete) and C be a cutoff value so that a subject is classified as positive if X>C and negative otherwise. Each subject will be in one of the disjoint states D or H denoting the Diseased and Healthy states respectively. While comparing a test result with the actual diagnosis then we get the following counts or states. 1) True Positives (TP) : Sick people correctly diagnosed as Sick 2) False Positives (FP) : Healthy people correctly diagnosed as Sick 3) True Negatives (TN) : Healthy people correctly diagnosed as Healthy 4) False Negatives(FN) : Sick people wrongly diagnosed as Healthy These four states can be explained in the following 2x2 matrix or contingency table, often known as confusion matrix in which the entries are non-negative integers. Diagnosis Test Result Positive Negative Positive TP FN Negative FP TN Two other important measures of performance popularly used in medical diagnosis are (i) Sensitivity (SRnR) or TPF = TP/TP+FN (ii) Specificity (SRPR) or TNF = TN/TN+FP 597

Basing on these four counts several performance measures are discussed in the following sections. 2. Commonly accepted performance evaluation measures In medical diagnosis, two popularly measures that separately estimate a classifier s performances on different classes are sensitivity and specificity (often employed in biomedical and medical applications and in studies which involve image and visual data): a) By definition Sensitivity = P[X>C/D]= TPF = TP/TP+FN It is the probability that a randomly selected person from the D group will have a measurement X>C. b) By definition Specificity = P[X<C/D]= TNF = TN/TN+FP It is the probability that a randomly selected person from the D group will have a measurement X<C. Both sensitivity and specificity lies between 0 and 1.If the value of C increases, both SRn Rand SRP Rwill decrease. A good diagnostic test is supposed to have high sensitivity with corresponding low specificity. c) Recall: It is a function of its correctly classified examples (true positives) and its misclassified examples (false negatives) Recall = TP/TP+FN Sensitivity d) Precision: It is a function of true positives and examples misclassified as positives (false positives) Precision = TP/TP+FP 598

e) Accuracy: In binary classification, commonly accepted performance evaluation empirical measure is accuracy. It does not distinguish between the number of correct labels of different classes. accuracy = TP+TN/TP+FP+FN+TN f) ROC: The Receiver Operating Characteristic (ROC) curve is a plot of SRnR versus (1-SRPR). It is an effective method of evaluating the quality or performance of a diagnostic test and is widely used in medical related fields like radiology, epidemiology etc to evaluate the performance of many medical related tests. The advantages of the ROC curve as a means of defining the accuracy of a test, construction of ROC and its Area Under the Curve (AUC). In this paper we discussed several summary measures of test accuracy like Precision, Recall, positive and negative likelihood ratios, Matthews correlation, Youden s index, Discriminant power and AUC of the diagnostic test. g) Area Under Curve (AUC): One of the problems in ROC analysis is that of evaluating the Area Under the Curve (AUC).A standard and known procedure is a non-parametric method which uses the Trapezoidal rule to find the AUC. The AUC can be used to know the overall accuracy of the diagnostic test. In the following section we briefly discuss the other summary measures of the test. 3. Other summary performance evaluation measures: In this study, we concentrate on the choice of comparison measures. In particular, we suggest evaluating the performance of classifiers using measures other than accuracy and ROC.As suggested by Baye s theory, the measures listed in the survey section have the following effect: Sensitivity or Specificity approximates the probability of the positive or negative label being true.(assesses the effectiveness of the test on a single class) 599

Precision estimates the predictive value of a test either positive or negative, depending on the class for which it is calculated.(assesses the predictive power of the test) Accuracy approximates how effective the test is by showing the probability of the true value of the class label.(assesses the overall effectiveness of the test) Based on these considerations, we cannot tell the superiority of a test with respect to another test. Now we will show that the superiority of a test with respect to another test largely depends on the applied evaluation measures. Our main requirement for possible measures is to bring in new characteristics for the test s performance. We also want the measures to be easily comparable. We are interested in two characteristics of a test: The confirmation capability with respect to classes. It means estimation of the probability of the correct predictions of positive and negative cases. The ability to avoid misclassification, namely, the estimation of the component of the probability of misclassification. These measures that caught our attention have been used in medical diagnosis to analyze tests. The measures are Youden s index, likelihood, Matthews correlation and Discriminant power. They combine sensitivity and specificity and their complements. Youden s index: It evaluates the test s ability to avoid misclassification-equally weight its performance on positive and negative examples. Youden s index = Sensitivity - (1-Specificity) = Sensitivity+Specificity-1 Youden s index has been traditionally used to compare diagnostic abilities of two tests. 600

Likelihoods: If a measure accommodates both sensitivity and specificity but treats them separately, then we can evaluate the classifier s performance to finer degree with respect to both classes. The following measure combing positive and negative likelihood allows us to do just that: Positive likelihood ratio = sensitivity/1-specificity Negative likelihood ratio = 1-sensitivity/specificity A higher positive and a lower negative likelihood mean better performance on positive and negative classes respectively. Matthews correlation: Matthews correlation coefficient (MCC) is used in machine learning and medical diagnosis as a measure of the quality of binary (two-class) classifications. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications. While there is no perfect way of describing the confusion matrix of true false positives and negatives by a single number, the MCC is generally regarded as being one of the best such measures. The MCC can be calculated directly from the confusion matrix using the formula Matthews correlation coefficient = TP TN FP FN (TP+FP)(TP+FN)(TP+FP)(TN+FN) If any of the four sums in the denominator is zero, the denominator can be arbitrarily set to one. Discriminant Power: Another measure that summarizes sensitivity and specificity is discriminant power (DP). DP = 3 (log x+log y) π Where x= sensitivity/(1-sensitivity) and y = specificity/(1-specificity) DP evaluates how well a test distinguishes between positive and negative examples. From the Discriminant Power (DP) one can discriminant, the test is a poor discriminant if DP<1, limited if DP<2, fair if DP<3, good in other cases. 601

4. Numerical results with MS-Excel We consider a data from a hospital study in which the average diameter of patients suffering from CAD (Cardionary Disease) is a criterion to classification patients into cases (D) and control (H) groups. A sample of 32 cases and 32 controls has been studied and the variable average diameter is taken as the classifier/marker. Using this data various performance measures are evaluated by Excel functions. A sample data set is shown (Excel template) in Fig-1. Fig-1: Excel Template of Data set 602

Various evaluated performance measures at cutoff=0.50 are shown in the following Table-1 Confusion matrix TEST DIAGNOSIS TOTAL 0 (+) 0 (+) 1 (-) 1 (-) 14 18 32 TOTAL 7 25 32 Performance measure Formula Evaluation value Sensitivity TPF=TP/TP+FN 0.6670 Specificity TNF=TN/TN+FP 0.5814 Positive Predictive Value TP/TP+FP 0.4375 Negative Predictive Value TN/TN+FN 0.7813 Precision TP/TP+FP 0.4375 Recall TP/TP+FN 0.6670 Positive Likelihood ratio sensitivity/1-specificity 1.5926 Negative Likelihood ratio 1-sensitivity/specificity 0.5733 Youden s index Sensitivity+Specificity-1 0.2481 Matthews correlation TP TN FP FN (TP + FP)(TP + FN)(TP + FP)(TN + FN) 0.2700 Discriminant Power(DP) 3 π (log x + log y) 0.2445 Table-1: various performance evaluation measures Conclusions Roc curve is a graphical representation of Sensitivity vs 1-Specificity in XY plane. Various performance measures in roc curve analysis have been observed that with an empirical illustration of Cardionary Disease (CAD) data. In which I found that Sensitivity is greater than Specificity, which is equal to Recall and also seen Discriminant Power (DP) <1. 603

REFERENCES [1] Metz, C. E., 1978, Basic Principles of ROC analysis, Seminars in Nuclear Medicine, 8, pp. 283 298. [2] Hanley, A., James, and Barbara, J. Mc., Neil, 1982, A Meaning and Use of the area under a Receiver Operating Characteristic (ROC) curves, Radiology, 143, pp. 29-36. [3] Lloyd, C. J., 1998, The Use of smoothed ROC curves to summarize and compare diagnostic systems, Journal of American Statistical Association, 93, pp. 1356 1364. [4] Metz et al., 1998, Maximum Likelihood Estimation of receiver operating characteristic (ROC) curves continuously distributed data, Statistics in Medicine, 17, pp. 1033-1053. [5] Pepe, MS., 2000, Receiver operating characteristic methodology, Journal of American Statistical Association, 95, pp. 308 311. [6] Farragi David, and Benjamin Reiser, 2002, Estimation of the area under the ROC curve, Statistic in Medicine, 21, pp. 3093-3106. [7] Lasko, T. A., 2005, The use of receiver operating characteristics curves in biomedical informatics, Journal of Biomedical Informatics, vol. 38, pp. 404 415. [8] Krzanowski, J. Wojtek., and David, J. Hand., 2009, ROC Curves for Continuous Data. [9] Sarma K.V.S.,et., al 2010, Estimation of the Area under the ROC Curve using Confidence Intervals of Mean, ANU Journal of Physical Sciences. [10] Suresh Babu.N. and Sarma, K. V. S., 2011. On the Comparison of Binormal ROC Curves using Confidence Interval of Means, International Journal of Statistics and Analysis, Vol. 1, pp. 213-221. 604

[11] Suresh Babu. N and Sarma K.V.S. (2013), On the Estimation of Area under Binormal ROC curve using Confidence Intervals for Means and Variances. International Journal of Statistics and Systems, Volume 8, Number 3 (2013), pp. 251-260. [12] Suresh Babu. N and Sarma K. V. S. (2013), Estimation of AUC of the Binormal ROC curve using confidence intervals of Variances. International Journal of Advanced Computer and Mathematical Sciences,Volume 4, issue 4, (2013), pp. 285-294. [13] Suresh Babu. Nellore (2015), Area Under The Binormal Roc Curve Using Confidence Interval Of Means Basing On Quartiles. International Journal OF Engineering Sciences & Management Research. ISSN 2349-6193 Impact Factor (PIF): 2.243. 605