Introduction to ROC analysis Andriy I. Bandos Department of Biostatistics University of Pittsburgh Acknowledgements Many thanks to Sam Wieand, Nancy Obuchowski, Brenda Kurland, and Todd Alonzo for previous version of this lecture
Outline 1. Basics of diagnostic accuracy evaluation 2. Why do we need ROC analysis? 3. How to construct and use the ROC curve * Focus on structure and interpretation of ROC tools, aside from statistical analysis
Basic (Classic) Set-up A number of subjects with and without the condition of interest (e.g., patients with/without distant metastases) For every subject, presence/absence of the condition of interest, or true status ( normal / abnormal ) is know from Gold (Reference) Standard diagnostic test result is obtained for every subject (e.g., radiologist s scores, 1-6, based on PET-CT interpretation) Goal of evaluation of diagnostic accuracy = assess the agreement between test result and true status Possible positive outcomes agreement is high overall = test can be used to diagnose presence or absence of the condition of interest agreement is high only for some results = test can be used to diagnose either presence or absence of the condition of interest
Illustrative Example FDG PET-CT identification of distant metastatic disease in uterine cervical and endometrial cancers: Analysis from ACRIN 6671/GOG0233. Gee et al., RSNA 2016 Goal: evaluate accuracy of staging PET-CT for detecting distant metastases (for the target population) Condition of Interest: presence/absence of distant metastatic disease Reference Standard: pathology and follow-up radiology reports Test result: radiologist interpretation of PET-CT for presence of distant metastases (scores on 1-6 scale with >3 being positive )
Simplest scenario: Binary test result (+/-) TRUE STATUS TEST RESULT Negative (-) Positive (+) Normal # True Negatives= 353 #False Positives=5 # Normal =358 Abnormal #False Negatives= #(T=1,D=0)=17 #True Positives=#(T=1,D=1)=31 # Abnormal =48 Most diagnostic tests are imperfect: errors are possible False Positive error: positive result for a normal subject False Negative error: negative result for an abnormal subject Two types of error have fundamentally different consequences (e.g., falsely suspecting versus missing the presence of distant metastases) must be kept separate #Negatives=370 #Positives=36 Total=406 Intuitively we would like to see more correct classifications, or, equivalently, fewer errors But how few is few enough?
Characterizing Diagnostic Accuracy TEST RESULT TRUE STATUS Negative (-) Positive (+) Normal # True Negatives= 353 #False Positives=5 # Normal =358 Abnormal #False Negatives=17 #True Positives=31 # Abnormal =48 How frequent are the correct classifications? 31 True Positives 31 out of 406 0.08 Probability of True Positives 31 out of 48 0.65 Sensitivity, Se, (or True Positive Fraction, TPF) 31 out of 36 0.86 Positive Predictive Value (PPV) 353 True Negatives #Negatives=370 #Positives=36 Total=406 353 out of 406 0.87 Probability of True Negatives 353 out of 50 0.99 Specificity (True Negative Fraction) 353 out of 20 0.95 Negative Predictive Value All measures can be useful under specific scenarios Se and Sp are usually used because they are typically robust (e.g., do not depend on prevalence) have fixed benchmarks of what is large (1) and what is small (0)
Graphical representation: ROC space ROC coordinates: Se (or TPF) as a vertical and 1-Sp (or FPF) as a horizontal axes Characteristics of benchmark tests: perfect (no errors): Sp=1, Se=1 most liberal (all labeled positive ): Sp=0, Se=1 most strict (all labeled negative ): guess (flip of a coin): Sp=1, Se=0 1-Sp=Se There is always a test with better Se or better Sp MUST consider both Se and Sp Simultaneous Interpretation of values of Se and Sp Bad comparable to performance of a guess (1-Sp Se, or close to the diagonal) Good close to the perfect (Se 1, Sp 1; or 1-Sp 0) A test with worse Se and Sp is objectively worse
More on Comparison of Diagnostic Tests Higher Se and Sp better test higher PPV and NPV in the same population (or better LR + and LR - ) better test higher PPV and lower NPV? E.g., PET-CT for distant metastases Central interpretations: Se=65%, Sp=99% PPV=86%, NPV=95% Institutional interpretations: Se=67%, Sp=94% PPV=62%, NPV=95% Often, this problem can be objectively solved by constructing ROC curves MarginProbe 2.0 PMA, FDA materials
ROC curve Many binary tests are actually dichotomized versions of continuous test (e.g., positive for distant metastases = (score>3) ) If we change diagnostic threshold /tune-up, both Se and Sp will change, e.g., TEST TRUTH (T>3) - + Normal 353 5 358 Abnormal 17 31 48 Se 0.65, Sp 0.99 (1-Sp 0.014) TEST TRUTH (T>2) - + Normal 344 14 358 Abnormal 14 34 48 Se 0.71, Sp=0.96 (1-Sp 0.039) ROC curve summarizes Se and 1-Sp for all thresholds ROC characterizes overall diagnostic accuracy applicable to a medical test as well as an artificial predictive tool (e.g., statistical model for binary truth )
ROC curve construction (make-up example) O X O O X X O X O X O X X O X O X O X O 1 2 3 4 threshold c TRUTH TEST - + Normal 9 1 10 Abnormal 6 4 10 Se(c)=0.4 1-Sp(c)=0.1 O X O O X X O X O X O X X O X O X O X O 1 2 3 4 TRUTH TEST - + Normal 7 3 10 Abnormal 3 7 10 Se(c)=0.7 1-Sp(c)=0.3 O X O O X X O X O X O X X O X O X O X O 1 2 3 4 TRUTH TEST - + Normal 4 6 10 Abnormal 1 9 10 Se(c)=0.9 1-Sp(c)=0.6
Use of ROC : threshold bias removal ROC describes all Se-Sp values which we can obtain by changing the diagnostic One of classic uses of the ROC curve: Can a new test achieve higher Se and Sp than for the existing (conventional) test? More informative comparison is usually based on comparison of the two ROC curves (more on that later) Gee, et al., RSNA, 2016
Use of ROC: comparison of tests Sometimes one diagnostic test is clearly better than the other Zuley ML, et al, Radiology 2013 Gee M, et al., RSNA 2016 AUCs: 0.84 vs 0.89 Sometimes, the improvement is not uniform, i.e., ROC curve is higher in some ranges and lower in others The practical/clinical purpose or objectives of the study should then drive the decision as to which test is better for the considered application
Use of ROC: evaluating a diagnostic test Benchmarks ROC curves Perfect: two segments connecting at (Sp=1, Sp=1) Guessing: diagonal (random choice) 1-Sp(c)=Se(c) Tests with imperfect ROC curves are still useful Tests with high ROC curves often lead to operating points with good Se and Sp (i.e., useful for both types of decisions) Tests with relatively low ROC curves could be useful for targeted decisions, e.g. for identifying a subset of diseased : Se>>0, Sp 1 for identifying a subset of disease-free : Se 1, Sp>>0 It is not always straightforward to quantify how good the ROC curve is Multiple summary indices are available (will discuss later)
Most typical ROC summary index Area Under the ROC curve (AUC) Single-value summary index of the entire ROC curve perfect ROC AUC=1; guessing ROC AUC=0.5 Quantifies the difference between distributions of test results for diseased and non-diseased (~ well-known Wilcoxon statistical test for comparing two groups) Interpretation: randomly selected diseased subject tests more suspicious than randomly selected non-diseased Technical advantages: well-known, objective, easy to use Limitations not very clinically relevant can be misleading (as any scalar index for the entire curve), e.g., non-guessing ROC with AUC=0.5 summarizes over the operating points outside of practical interest (e.g. Sp< 0.5)
Summary Indices for the ROC curve Area under the ROC curve (AUC) objective, i.e. does not require subjective specifications summarizes over the operating points outside of practical interest (e.g. Sp< 0.5) one of more precise summary indices Partial AUC (pauc) for s 1 <Sp<s 2 perfect pauc= s 1 -s 2 ; guess pauc= s 1 -s 2 (s 1 +s 2 )/2 focuses on the operating points in the ranges of interest uses more than one point from the curve (typically is less precise than AUC) Sensitivity corresponding to a given specificity perfect Se (Sp=s)=1; guess Se (Sp=s)=1-s too subjective due to required knowledge of needed Sp relatively imprecise (typically requires larger sample sizes)
Problems with scalar ROC indices Scalar indices lose some information stored in the ROC curve different indices could contradict to each other, e.g.: ROC curves with the same AUCs can be different at almost all points ROC curve with higher overall AUC can be lower in the range of interest (e.g. Sp>0.9) Thus, it is very important to look at the ROC curve in addition to computing summary indices
ROCs and binary tests When binary test is clearly better than the other Does it mean that it will also have a better ROC? Yes, at least between two known test point (because ROC curve is always non-decreasing) but not necessarily for all thresholds When a binary test has better PPV and NPV (in the same sample) Does it also have better ROC curve in a range two test points? Yes, if a test can be assumed reasonable (or locally better than chance ), e.g., probability of malignancy score of a trained radiologist) but this is not necessarily true for non-optimized tests e.g. raw biomarker measured for tissue A test with lower Sensitivity and Specificity
Limitations Typical limitations An entire curve can be difficult to interpret A single-number (scalar) indices of the ROC curve can be misleading ROC curves for human observers might be difficult to interpret (it is good to support ROC-based findings by the accuracy of actual decisions) Do you always need to use the ROC curve? No, e.g., in some cases, consideration of a single point (Sp, Se) can provide sufficient information Gee, et al., RSNA, 2016 Does ROC analysis always provide a definite answer? No, e.g., the ROC curves that cross in the region of interest
Overall recommendations If a reliable estimate of the ROC curve is available use ROC curve to visually evaluate or compare diagnostic systems use appropriate summary indices to quantify the results If reliable estimate of the ROC curve is not available, but the binary test results ( positive / negative ) are known: Try using a pair of intrinsic characteristics (Se, Sp, or functions thereof) to infer on diagnostic accuracy If only prevalence-dependent characteristics are available (e.g, PV s) remember the dependence on the prevalence in the sample When using a scalar summary index remember that no scalar summary index is better than others under all circumstances; study objectives should drive the interpret numeric value based on the values attained by performance of the benchmarks tests (perfect, guessing) under the same conditions
ROC analysis Many, more complicated, ROC techniques exist, e.g., Multiple estimation approaches: parametric (e.g., binormal ROC), non-parametric (empirical), semi-parametric Incorporation of information from other covariates: modeling ROC curve or its indices (e.g., ROC-GLM) Application to time-to-event data time-dependent ROC Extensions to more than two classes: multi-class ROC analysis Extensions to multiple targets per subject: free-response ROC (FROC), regions of interest approach (ROI).. Most standard software packages offer some basic types of ROC analysis (SPSS, SAS, Stata, R,.) Use of many advanced ROC techniques require help of a statistician experienced in the area A couple of great textbooks on ROC analysis and related topics Zhou, X.H., Obuchowski, N.A., McClish D.K. (2011). Statistical methods in diagnostic medicine. 2nd edition. New York: Wiley & Sons Inc. Pepe, M.S. (2003). The statistical evaluation of medical test for classification and prediction. Oxford: Oxford University Press.
THANK YOU!