Introduction to ROC analysis

Similar documents
Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Critical reading of diagnostic imaging studies. Lecture Goals. Constantine Gatsonis, PhD. Brown University

Bayesian Methods for Medical Test Accuracy. Broemeling & Associates Inc., 1023 Fox Ridge Road, Medical Lake, WA 99022, USA;

OBSERVER PERFORMANCE METHODS FOR DIAGNOSTIC IMAGING Foundations, Modeling and Applications with R-Examples. Dev P.

Introduction to screening tests. Tim Hanson Department of Statistics University of South Carolina April, 2011

METHODS FOR DETECTING CERVICAL CANCER

Sample Size Considerations. Todd Alonzo, PhD

Module Overview. What is a Marker? Part 1 Overview

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

Receiver operating characteristic

Week 2 Video 3. Diagnostic Metrics

Reducing Decision Errors in the Paired Comparison of the Diagnostic Accuracy of Continuous Screening Tests

Sensitivity, Specificity, and Relatives

Diagnostic screening. Department of Statistics, University of South Carolina. Stat 506: Introduction to Experimental Design

Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta GA, USA.

COMPARING SEVERAL DIAGNOSTIC PROCEDURES USING THE INTRINSIC MEASURES OF ROC CURVE

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Mammography limitations. Clinical performance of digital breast tomosynthesis compared to digital mammography: blinded multi-reader study

Introduction to Meta-analysis of Accuracy Data

Sensitivity, specicity, ROC

Women s Imaging Original Research

USE OF AREA UNDER THE CURVE (AUC) FROM PROPENSITY MODEL TO ESTIMATE ACCURACY OF THE ESTIMATED EFFECT OF EXPOSURE

Evidence-Based Medicine: Diagnostic study

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

4. Model evaluation & selection

A scored AUC Metric for Classifier Evaluation and Selection

A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests

Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy

Designing Studies of Diagnostic Imaging

Various performance measures in Binary classification An Overview of ROC study

VU Biostatistics and Experimental Design PLA.216

Evaluation of diagnostic tests

Estimation of Area under the ROC Curve Using Exponential and Weibull Distributions

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

EVALUATION AND COMPUTATION OF DIAGNOSTIC TESTS: A SIMPLE ALTERNATIVE

Modifying ROC Curves to Incorporate Predicted Probabilities

Systematic Reviews and meta-analyses of Diagnostic Test Accuracy. Mariska Leeflang

Combining Predictors for Classification Using the Area Under the ROC Curve

ROC (Receiver Operating Characteristic) Curve Analysis

(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

Validity of the Sinhala Version of the General Health Questionnaires Item 12 and 30: Using Different Sampling Strategies and Scoring Methods

MS&E 226: Small Data

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

About OMICS International

Selection and Combination of Markers for Prediction

4 Diagnostic Tests and Measures of Agreement

Diffuse high-attenuation within mediastinal lymph nodes on non-enhanced CT scan: Usefulness in the prediction of benignancy

Challenges of Observational and Retrospective Studies

NOVEL BIOMARKERS PERSPECTIVES ON DISCOVERY AND DEVELOPMENT. Winton Gibbons Consulting

Sanjay P. Zodpey Clinical Epidemiology Unit, Department of Preventive and Social Medicine, Government Medical College, Nagpur, Maharashtra, India.

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Statistics, Probability and Diagnostic Medicine

Behavioral Data Mining. Lecture 4 Measurement

Introductory Lecture. Significance of results in cancer imaging

A novel approach to assess diagnostic test and observer agreement for ordinal data. Zheng Zhang Emory University Atlanta, GA 30322

Diagnostic tests, Laboratory tests

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic, Prognostic, or Screening Marker

ROC Curves (Old Version)

Discrimination and Reclassification in Statistics and Study Design AACC/ASN 30 th Beckman Conference

Fundamentals of Clinical Research for Radiologists. ROC Analysis. - Research Obuchowski ROC Analysis. Nancy A. Obuchowski 1

Screening (Diagnostic Tests) Shaker Salarilak

Biostatistics II

Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic or Prognostic Marker

2011 ASCP Annual Meeting

The solitary pulmonary nodule: Assessing the success of predicting malignancy

Underuse of risk assessment and overuse of CTPA in patients with suspected pulmonary thromboembolism

1 Diagnostic Test Evaluation

Unit 1 Exploring and Understanding Data

Detection of Mild Cognitive Impairment using Image Differences and Clinical Features

Study Designs. Lecture 3. Kevin Frick

Assessing the accuracy of response propensities in longitudinal studies

Psychology, 2010, 1: doi: /psych Published Online August 2010 (

Lecture Outline Biost 517 Applied Biostatistics I. Statistical Goals of Studies Role of Statistical Inference

Lecture 2. Key Concepts in Clinical Research

Methods of MR Fat Quantification and their Pros and Cons

SEED HAEMATOLOGY. Medical statistics your support when interpreting results SYSMEX EDUCATIONAL ENHANCEMENT AND DEVELOPMENT APRIL 2015

Diagnostic methods 2: receiver operating characteristic (ROC) curves

The recommended method for diagnosing sleep

Evidence-based Imaging: Critically Appraising Studies of Diagnostic Tests

STATS8: Introduction to Biostatistics. Overview. Babak Shahbaba Department of Statistics, UCI

Assessment of performance and decision curve analysis

3. Model evaluation & selection

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Chapter 5: Field experimental designs in agriculture

Diagnostic Test. H. Risanto Siswosudarmo Department of Obstetrics and Gynecology Faculty of Medicine, UGM Jogjakarta. RS Sardjito

Evidence Based Medicine Prof P Rheeder Clinical Epidemiology. Module 2: Applying EBM to Diagnosis

Blinded, independent, central image review in oncology trials: how the study endpoint impacts the adjudication rate

11 questions to help you evaluate a clinical prediction rule

10 Intraclass Correlations under the Mixed Factorial Design

Meta-analysis of Diagnostic Test Accuracy Studies

Prevent Cancer Foundation Quantitative Imaging Workshop XIII

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Comparing Two ROC Curves Independent Groups Design

RISK PREDICTION MODEL: PENALIZED REGRESSIONS

Lecture 1 An introduction to statistics in Ichthyology and Fisheries Science

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Lecture Outline Biost 517 Applied Biostatistics I

Transcription:

Introduction to ROC analysis Andriy I. Bandos Department of Biostatistics University of Pittsburgh Acknowledgements Many thanks to Sam Wieand, Nancy Obuchowski, Brenda Kurland, and Todd Alonzo for previous version of this lecture

Outline 1. Basics of diagnostic accuracy evaluation 2. Why do we need ROC analysis? 3. How to construct and use the ROC curve * Focus on structure and interpretation of ROC tools, aside from statistical analysis

Basic (Classic) Set-up A number of subjects with and without the condition of interest (e.g., patients with/without distant metastases) For every subject, presence/absence of the condition of interest, or true status ( normal / abnormal ) is know from Gold (Reference) Standard diagnostic test result is obtained for every subject (e.g., radiologist s scores, 1-6, based on PET-CT interpretation) Goal of evaluation of diagnostic accuracy = assess the agreement between test result and true status Possible positive outcomes agreement is high overall = test can be used to diagnose presence or absence of the condition of interest agreement is high only for some results = test can be used to diagnose either presence or absence of the condition of interest

Illustrative Example FDG PET-CT identification of distant metastatic disease in uterine cervical and endometrial cancers: Analysis from ACRIN 6671/GOG0233. Gee et al., RSNA 2016 Goal: evaluate accuracy of staging PET-CT for detecting distant metastases (for the target population) Condition of Interest: presence/absence of distant metastatic disease Reference Standard: pathology and follow-up radiology reports Test result: radiologist interpretation of PET-CT for presence of distant metastases (scores on 1-6 scale with >3 being positive )

Simplest scenario: Binary test result (+/-) TRUE STATUS TEST RESULT Negative (-) Positive (+) Normal # True Negatives= 353 #False Positives=5 # Normal =358 Abnormal #False Negatives= #(T=1,D=0)=17 #True Positives=#(T=1,D=1)=31 # Abnormal =48 Most diagnostic tests are imperfect: errors are possible False Positive error: positive result for a normal subject False Negative error: negative result for an abnormal subject Two types of error have fundamentally different consequences (e.g., falsely suspecting versus missing the presence of distant metastases) must be kept separate #Negatives=370 #Positives=36 Total=406 Intuitively we would like to see more correct classifications, or, equivalently, fewer errors But how few is few enough?

Characterizing Diagnostic Accuracy TEST RESULT TRUE STATUS Negative (-) Positive (+) Normal # True Negatives= 353 #False Positives=5 # Normal =358 Abnormal #False Negatives=17 #True Positives=31 # Abnormal =48 How frequent are the correct classifications? 31 True Positives 31 out of 406 0.08 Probability of True Positives 31 out of 48 0.65 Sensitivity, Se, (or True Positive Fraction, TPF) 31 out of 36 0.86 Positive Predictive Value (PPV) 353 True Negatives #Negatives=370 #Positives=36 Total=406 353 out of 406 0.87 Probability of True Negatives 353 out of 50 0.99 Specificity (True Negative Fraction) 353 out of 20 0.95 Negative Predictive Value All measures can be useful under specific scenarios Se and Sp are usually used because they are typically robust (e.g., do not depend on prevalence) have fixed benchmarks of what is large (1) and what is small (0)

Graphical representation: ROC space ROC coordinates: Se (or TPF) as a vertical and 1-Sp (or FPF) as a horizontal axes Characteristics of benchmark tests: perfect (no errors): Sp=1, Se=1 most liberal (all labeled positive ): Sp=0, Se=1 most strict (all labeled negative ): guess (flip of a coin): Sp=1, Se=0 1-Sp=Se There is always a test with better Se or better Sp MUST consider both Se and Sp Simultaneous Interpretation of values of Se and Sp Bad comparable to performance of a guess (1-Sp Se, or close to the diagonal) Good close to the perfect (Se 1, Sp 1; or 1-Sp 0) A test with worse Se and Sp is objectively worse

More on Comparison of Diagnostic Tests Higher Se and Sp better test higher PPV and NPV in the same population (or better LR + and LR - ) better test higher PPV and lower NPV? E.g., PET-CT for distant metastases Central interpretations: Se=65%, Sp=99% PPV=86%, NPV=95% Institutional interpretations: Se=67%, Sp=94% PPV=62%, NPV=95% Often, this problem can be objectively solved by constructing ROC curves MarginProbe 2.0 PMA, FDA materials

ROC curve Many binary tests are actually dichotomized versions of continuous test (e.g., positive for distant metastases = (score>3) ) If we change diagnostic threshold /tune-up, both Se and Sp will change, e.g., TEST TRUTH (T>3) - + Normal 353 5 358 Abnormal 17 31 48 Se 0.65, Sp 0.99 (1-Sp 0.014) TEST TRUTH (T>2) - + Normal 344 14 358 Abnormal 14 34 48 Se 0.71, Sp=0.96 (1-Sp 0.039) ROC curve summarizes Se and 1-Sp for all thresholds ROC characterizes overall diagnostic accuracy applicable to a medical test as well as an artificial predictive tool (e.g., statistical model for binary truth )

ROC curve construction (make-up example) O X O O X X O X O X O X X O X O X O X O 1 2 3 4 threshold c TRUTH TEST - + Normal 9 1 10 Abnormal 6 4 10 Se(c)=0.4 1-Sp(c)=0.1 O X O O X X O X O X O X X O X O X O X O 1 2 3 4 TRUTH TEST - + Normal 7 3 10 Abnormal 3 7 10 Se(c)=0.7 1-Sp(c)=0.3 O X O O X X O X O X O X X O X O X O X O 1 2 3 4 TRUTH TEST - + Normal 4 6 10 Abnormal 1 9 10 Se(c)=0.9 1-Sp(c)=0.6

Use of ROC : threshold bias removal ROC describes all Se-Sp values which we can obtain by changing the diagnostic One of classic uses of the ROC curve: Can a new test achieve higher Se and Sp than for the existing (conventional) test? More informative comparison is usually based on comparison of the two ROC curves (more on that later) Gee, et al., RSNA, 2016

Use of ROC: comparison of tests Sometimes one diagnostic test is clearly better than the other Zuley ML, et al, Radiology 2013 Gee M, et al., RSNA 2016 AUCs: 0.84 vs 0.89 Sometimes, the improvement is not uniform, i.e., ROC curve is higher in some ranges and lower in others The practical/clinical purpose or objectives of the study should then drive the decision as to which test is better for the considered application

Use of ROC: evaluating a diagnostic test Benchmarks ROC curves Perfect: two segments connecting at (Sp=1, Sp=1) Guessing: diagonal (random choice) 1-Sp(c)=Se(c) Tests with imperfect ROC curves are still useful Tests with high ROC curves often lead to operating points with good Se and Sp (i.e., useful for both types of decisions) Tests with relatively low ROC curves could be useful for targeted decisions, e.g. for identifying a subset of diseased : Se>>0, Sp 1 for identifying a subset of disease-free : Se 1, Sp>>0 It is not always straightforward to quantify how good the ROC curve is Multiple summary indices are available (will discuss later)

Most typical ROC summary index Area Under the ROC curve (AUC) Single-value summary index of the entire ROC curve perfect ROC AUC=1; guessing ROC AUC=0.5 Quantifies the difference between distributions of test results for diseased and non-diseased (~ well-known Wilcoxon statistical test for comparing two groups) Interpretation: randomly selected diseased subject tests more suspicious than randomly selected non-diseased Technical advantages: well-known, objective, easy to use Limitations not very clinically relevant can be misleading (as any scalar index for the entire curve), e.g., non-guessing ROC with AUC=0.5 summarizes over the operating points outside of practical interest (e.g. Sp< 0.5)

Summary Indices for the ROC curve Area under the ROC curve (AUC) objective, i.e. does not require subjective specifications summarizes over the operating points outside of practical interest (e.g. Sp< 0.5) one of more precise summary indices Partial AUC (pauc) for s 1 <Sp<s 2 perfect pauc= s 1 -s 2 ; guess pauc= s 1 -s 2 (s 1 +s 2 )/2 focuses on the operating points in the ranges of interest uses more than one point from the curve (typically is less precise than AUC) Sensitivity corresponding to a given specificity perfect Se (Sp=s)=1; guess Se (Sp=s)=1-s too subjective due to required knowledge of needed Sp relatively imprecise (typically requires larger sample sizes)

Problems with scalar ROC indices Scalar indices lose some information stored in the ROC curve different indices could contradict to each other, e.g.: ROC curves with the same AUCs can be different at almost all points ROC curve with higher overall AUC can be lower in the range of interest (e.g. Sp>0.9) Thus, it is very important to look at the ROC curve in addition to computing summary indices

ROCs and binary tests When binary test is clearly better than the other Does it mean that it will also have a better ROC? Yes, at least between two known test point (because ROC curve is always non-decreasing) but not necessarily for all thresholds When a binary test has better PPV and NPV (in the same sample) Does it also have better ROC curve in a range two test points? Yes, if a test can be assumed reasonable (or locally better than chance ), e.g., probability of malignancy score of a trained radiologist) but this is not necessarily true for non-optimized tests e.g. raw biomarker measured for tissue A test with lower Sensitivity and Specificity

Limitations Typical limitations An entire curve can be difficult to interpret A single-number (scalar) indices of the ROC curve can be misleading ROC curves for human observers might be difficult to interpret (it is good to support ROC-based findings by the accuracy of actual decisions) Do you always need to use the ROC curve? No, e.g., in some cases, consideration of a single point (Sp, Se) can provide sufficient information Gee, et al., RSNA, 2016 Does ROC analysis always provide a definite answer? No, e.g., the ROC curves that cross in the region of interest

Overall recommendations If a reliable estimate of the ROC curve is available use ROC curve to visually evaluate or compare diagnostic systems use appropriate summary indices to quantify the results If reliable estimate of the ROC curve is not available, but the binary test results ( positive / negative ) are known: Try using a pair of intrinsic characteristics (Se, Sp, or functions thereof) to infer on diagnostic accuracy If only prevalence-dependent characteristics are available (e.g, PV s) remember the dependence on the prevalence in the sample When using a scalar summary index remember that no scalar summary index is better than others under all circumstances; study objectives should drive the interpret numeric value based on the values attained by performance of the benchmarks tests (perfect, guessing) under the same conditions

ROC analysis Many, more complicated, ROC techniques exist, e.g., Multiple estimation approaches: parametric (e.g., binormal ROC), non-parametric (empirical), semi-parametric Incorporation of information from other covariates: modeling ROC curve or its indices (e.g., ROC-GLM) Application to time-to-event data time-dependent ROC Extensions to more than two classes: multi-class ROC analysis Extensions to multiple targets per subject: free-response ROC (FROC), regions of interest approach (ROI).. Most standard software packages offer some basic types of ROC analysis (SPSS, SAS, Stata, R,.) Use of many advanced ROC techniques require help of a statistician experienced in the area A couple of great textbooks on ROC analysis and related topics Zhou, X.H., Obuchowski, N.A., McClish D.K. (2011). Statistical methods in diagnostic medicine. 2nd edition. New York: Wiley & Sons Inc. Pepe, M.S. (2003). The statistical evaluation of medical test for classification and prediction. Oxford: Oxford University Press.

THANK YOU!