Critical reading of diagnostic imaging studies. Lecture Goals. Constantine Gatsonis, PhD. Brown University

Similar documents
Introduction to ROC analysis

SYSTEMATIC REVIEWS OF TEST ACCURACY STUDIES

Worksheet for Structured Review of Physical Exam or Diagnostic Test Study

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

Systematic Reviews and meta-analyses of Diagnostic Test Accuracy. Mariska Leeflang

Hayden Smith, PhD, MPH /\ v._

Data that can be classified as belonging to a distinct number of categories >>result in categorical responses. And this includes:

Diagnostic screening. Department of Statistics, University of South Carolina. Stat 506: Introduction to Experimental Design

Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy

Designing Studies of Diagnostic Imaging

Health Studies 315: Handouts. Health Studies 315: Handouts

Evidence-Based Medicine: Diagnostic study

The cross sectional study design. Population and pre-test. Probability (participants). Index test. Target condition. Reference Standard

Screening with Abbreviated Breast MRI (AB-MR)

Evidence-based Imaging: Critically Appraising Studies of Diagnostic Tests

Appraising Diagnostic Test Studies

Statistics, Probability and Diagnostic Medicine

Evidence Based Medicine Prof P Rheeder Clinical Epidemiology. Module 2: Applying EBM to Diagnosis

Statistical Considerations: Study Designs and Challenges in the Development and Validation of Cancer Biomarkers

EVIDENCE-BASED GUIDELINE DEVELOPMENT FOR DIAGNOSTIC QUESTIONS

Assessment of performance and decision curve analysis

CHECK-LISTS AND Tools DR F. R E Z A E I DR E. G H A D E R I K U R D I S TA N U N I V E R S I T Y O F M E D I C A L S C I E N C E S

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Appendix G: Methodology checklist: the QUADAS tool for studies of diagnostic test accuracy 1

METHODS FOR DETECTING CERVICAL CANCER

Deppen S, et al. Annals of Thoracic Surgery 2011;92:

BMI 541/699 Lecture 16

Current Strategies in the Detection of Breast Cancer. Karla Kerlikowske, M.D. Professor of Medicine & Epidemiology and Biostatistics, UCSF

MethodologicOverview of Screening Studies

Screening (Diagnostic Tests) Shaker Salarilak

Bayesian Methods for Medical Test Accuracy. Broemeling & Associates Inc., 1023 Fox Ridge Road, Medical Lake, WA 99022, USA;

Diagnostic Studies Dr. Annette Plüddemann

Meta-analysis of diagnostic research. Karen R Steingart, MD, MPH Chennai, 15 December Overview

#46: DIGITAL TOMOSYNTHESIS: What is the Data Really Showing? TERMS (AKA) WHAT IS TOMOSYNTHESIS? 3/3/2014. Digital breast tomosynthesis =

Dr Robin Wilson, The Royal Marsden

Sensitivity, Specificity, and Relatives

Question Sheet. Prospective Validation of the Pediatric Appendicitis Score in a Canadian Pediatric Emergency Department

Young Investigator Initiative for the Conduct of ACRIN Ancillary Research

Challenges of Observational and Retrospective Studies

CONSORT 2010 checklist of information to include when reporting a randomised trial*

EBM, Study Design and Numbers. David Frankfurter, MD Professor OB/GYN The George Washington University

It s hard to predict!

POST GRADUATE DIPLOMA IN BIOETHICS (PGDBE) Term-End Examination June, 2016 MHS-014 : RESEARCH METHODOLOGY

EARLY DETECTION: MAMMOGRAPHY AND SONOGRAPHY

Various performance measures in Binary classification An Overview of ROC study

Meta-analysis of diagnostic test accuracy studies with multiple & missing thresholds

Overview. Health and economic burden of coronary artery disease (CAD) Pitfalls in care of patients suspected of having CAD

Module Overview. What is a Marker? Part 1 Overview

Computer Models for Medical Diagnosis and Prognostication

Discrimination and Reclassification in Statistics and Study Design AACC/ASN 30 th Beckman Conference

Chapter 10. Screening for Disease

Introduction to Epidemiology Screening for diseases

Mammography limitations. Clinical performance of digital breast tomosynthesis compared to digital mammography: blinded multi-reader study

OCW Epidemiology and Biostatistics, 2010 Michael D. Kneeland, MD November 18, 2010 SCREENING. Learning Objectives for this session:

Critical Review Form Clinical Decision Analysis

Evidence-based Cancer Screening & Surveillance

Evidence Based Medicine

Ronald C. Walker, MD, Prof of Radiology Vanderbilt University Medical Center Nashville, TN. Ga-DOTATATE PET/CT imaging Initial Vanderbilt experience

Emerging Techniques in Breast Imaging: Contrast-Enhanced Mammography and Fast MRI

EARLY DETECTION: MAMMOGRAPHY AND SONOGRAPHY

The solitary pulmonary nodule: Assessing the success of predicting malignancy

Protocol Development: The Guiding Light of Any Clinical Study

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Financial Disclosures

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

Electrical impedance scanning of the breast is considered investigational and is not covered.

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Design considerations for Phase II trials incorporating biomarkers

Net Reclassification Risk: a graph to clarify the potential prognostic utility of new markers

Contemporary Classification of Breast Cancer

Critical Review Form Clinical Prediction or Decision Rule

Glossary of Practical Epidemiology Concepts

Introduction to Clinical Trials - Day 1 Session 2 - Screening Studies

Overview. Why use tests? INTRODUCTION TO TEST EVALUATION RESEARCH

Introduction to screening tests. Tim Hanson Department of Statistics University of South Carolina April, 2011

Biases in clinical research. Seungho Ryu, MD, PhD Kanguk Samsung Hospital, Sungkyunkwan University

Compressive Re-Sampling for Speckle Reduction in Medical Ultrasound

The 2016 NASCI Keynote: Trends in Utilization of Cardiac Imaging: The Coronary CTA Conundrum. David C. Levin, M.D.

BCSC Glossary of Terms (Last updated 09/16/2009) DEFINITIONS

PRECISION IMAGING: QUANTITATIVE, MOLECULAR AND IMAGE-GUIDED TECHNOLOGIES

arxiv: v2 [cs.cv] 8 Mar 2018

Dense Breasts. A Breast Cancer Risk Factor and Imaging Challenge

Supplemental Screening for Women with Dense Breast Tissue. Public Meeting December 13, 2013

Introduction to REMARK: Reporting tumour marker prognostic studies

Assessment of methodological quality and QUADAS-2

Breast Cancer Screening and Diagnosis

Confounding Bias: Stratification

Diagnostic Test. H. Risanto Siswosudarmo Department of Obstetrics and Gynecology Faculty of Medicine, UGM Jogjakarta. RS Sardjito

2011 ASCP Annual Meeting

Supplemental Information

Meta-analyses evaluating diagnostic test accuracy

ROC (Receiver Operating Characteristic) Curve Analysis

Teaching Tips for Diagnostic Studies Dr. Annette Plüddemann

CHL 5225 H Advanced Statistical Methods for Clinical Trials. CHL 5225 H The Language of Clinical Trials

PET/CT for Therapy Assessment in Oncology

Lecture Outline Biost 517 Applied Biostatistics I

#s in EVERYDAY LIFE OUTLINE Class 3: Medical and Health Studies Ricki Lewis

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Systematic reviews of prognostic studies 3 meta-analytical approaches in systematic reviews of prognostic studies

Transcription:

Critical reading of diagnostic imaging studies Constantine Gatsonis Center for Statistical Sciences Brown University Annual Meeting Lecture Goals 1. Review diagnostic imaging evaluation goals and endpoints. 2. Identify building blocks of study design. 3. Describe common study designs, including studies of accuracy in diagnosis i and prediction and studies of outcomes. 4. Discuss structured reporting of studies Topics 1. Diagnostic imaging evaluation: 1. Overview 2. Measures of performance 3. Reporting checklists 2. Elements of study design 1. Studies of accuracy in detection ti 2. Studies of accuracy in prediction 3. Studies of outcomes 3. Generalizability and bias considerations 4. Reporting revisited 1

Topics 1. Diagnostic imaging evaluation: 1. Overview 2. Reporting checklists 3. Measures of performance Imaging integral to all key aspects of health care The Cancer Paradigm Screening Diagnosis Staging Treatment Response Marker Diagnosis vs therapy The dominant paradigm for evaluating medical care is provided by the evaluation of therapy. Diagnostic tests generate information, which is only one of the inputs into the decision making process. Most effects of the diagnostic information are mediated by therapeutic decisions. Diagnostic technology evolves very rapidly (moving target problem) 2

Patient presentation Diagnostic workup Treatment Decisions Patient outcomes Endpoints for diagnostic test evaluation Accurate? Diagnostic performance: measures of accuracy measures of predictive value Affects care? Intermediate process of care: Diagnostic thinking/decision making Therapeutic thinking/decision i i making Affects outcomes Patient outcomes: Quality of life, satisfaction, cost, mortality, morbidity STARD checklist Sections Title/Abstract Introduction Methods Participants Test methods Stat Methods Results Participants Test results Estimates Discussion Standards for Reporting of Diagnostic Accuracy Bossuyt P, Reitsma J, Bruns D, Gatsonis C, et al. Acad. Radiology, 2003;10:664-669 3

Measures of accuracy Test results can be binary (e.g. yes/no for presence of target condition), ordinal categorical (e.g. degree of suspicion on a 5- point scale), or continuous (e.g. SUV for PET). Analysis of diagnostic performance typically assume binary truth (e.g. disease present vs disease absent). The well known 2x2 table for binary test and truth: Truth Test - Test + Total Target condition absent Target condition present TN FP N - FN TP N + Total TN+FN FP+TP N Measures of accuracy for binary tests Starting with disease status: Sensitivity: Probability that test result will be positive given that the target condition is present. Sens = TPF = Pr(T+ D+), 1 Sens = FNF = Pr(T- D+) Specificity: Probability that test result will be negative given that target condition is absent. Spec = TNF = Pr(T- D-), 1 Spec = FPF = Pr(T+ D-) Starting with test result: Positive Predictive Value : PPV = Pr(D+ T+) Negative Predictive Value: NPV = Pr(D- T-) Positive likelihood ratio: Truth Test - Test + LR+ = Pr(T+ D+) / Pr(T+ D-)=Sens/(1-Spec) No TN FP N - Post-test odds =pretest odds x LR Yes FN TP N + Total TN+FN FP+TP N Fundamental conceptualization: Threshold for test positivity DISEASED NON-DISEASED THRESHOLD 4

ROC curves ROC curve: plot of all pairs of (1-Spec., Sens.) as positivity threshold varies Sensi tivity 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity Topics 2. Elements of study design 1. Studies of accuracy in detection ti 2. Studies of accuracy in prediction 3. Studies of outcomes Fundamental blocks of clinical study design Clinical question Patient population protocol inclusion / exclusion criteria Imaging intervention: Technique : Technical characteristics of modalities to be studied. Test interpretation: setting and reader population, blinding, experience, training, learning potential Measures of endpoints Reference information ( gold standard ) 5

Design template for prospective study of accuracy Enroll predefined participant population Perform and interpret imaging studies Evaluate study endpoints Collect information on reference standard (e.g. biopsy, follow-up) CT and MRI in staging cervical cancer. (ACRIN 6651) 208 women with documented cervical cancer, scheduled to undergo surgery Participants undergo CT and MR to determine cancer stage prior to surgery. Studies interpreted locally and centrally Diagnostic accuracy measures: predictive value, sensitivity, specificity, ROC Participants undergo surgery. Stage determined by surgical and pathology data Hricak, Gatsonis et al, JCO, 2005 Predictive value of CT and MRI for higher stage in cervical Cancer Results from ACRIN 6651, JCO 2005 6

Sensitivity and specificity of CT and MRI for detecting advanced stage in cervical cancer. (ACRIN 6651, JCO 2005) Digital vs Film Mammography. (ACRIN 6652, DMIST) 49,500 women undergoing routine mammography Participants undergo both digital and plain film mammography (paired design) Primary reading at participating institutions. Secondary readings (of subsets) by selected readers Difference in AUC (primary), sensitivity, specificity, predictive value. Quality of life, cost and costeffectiveness Work-up of positive mammograms and 1-year follow-up information (Pisano, Gatsonis et al, NEJM, 2005) Comparison of ROC curves for Digital and Film Mammo. DMIST results NEJM, Oct 2005. Age <50 Age 50+ Full cohort 7

National CT Colonography study, ACRIN 6664 2600 participants scheduled to undergo routine colonoscopy Participants undergo CT Colonography, interpreted at participating institutions. Estimate sensitivity, specificity, PPV, NPV of CTC with colonoscopy as the reference standard. Participants undergo colonoscopy. Johnson et al, NEJM 2008 Johnson et al NEJM, Sept 08 Predictive value of imaging Imaging as a biomarker: Predicting the course of disease with or without therapy Serving as surrogate marker for therapy study endpoints 8

ACRIN 6665/RTOG S0132 Slide courtesy of T. Shields Types of biomarkers I: Biomarkers for disease onset or presence II: Biomarkers for course of disease Prognostic biomarkers: used to predict disease outcome at time of diagnosis usually without reference to potential therapy (e.g. tumor size or stage) Predictive biomarkers: used to predict the outcome of particular therapy (e.g. PET SUV change from baseline) Monitoring biomarkers: used to measure response to treatment and to achieve early detection of disease progression or relapse (e.g. FDG PET or DCE MRI) MRI as predictor of response to therapy for breast cancer (ACRIN 6657, ISPY-1). 237 women receiving neoadjuvant breast cancer therapy. Participants undergo breast MRI at pre-specified time points pre, during, and after therapy Estimate the ability of MRI to predict response and long term outcome. Clinical information on response to therapy and survival is obtained. Ongoing study 9

FDG-PET as predictor of survival after therapy for lung cancer (ACRIN 6668). NSCLC patients undergoing chemoradiotherapy. Participants undergo FDG PET and baseline and after therapy Estimate the ability of FDG- PET to predict survival and response. Clinical information on response to therapy and survival is obtained. Ongoing study Research questions and statistical formulation Example: Can SUV change from baseline to week 2 of therapy predict survival? Two approaches: A. Compare survival curves among patient groups defined by a fixed cutoff value of marker. B: Compute sens/spec/roc curve for ref. standard defined by occurrence of future event. Note: A and B do not answer same question! A: do metabolic responders live longer? B: Can PET detect early those patients who will respond or survive past certain times? A. Comparison of survival curves Divide patients into two groups, using specific threshold of marker (e.g. median value or other value obtained from previous studies). Compare survival curves. Or use regression analysis Potential ti problems: Choice of threshold? Choice of specific form of regression model? Predictive value type of analysis 10

B. Sens/Spec/ ROC analysis Research question viewed as detection of a future event. For a future time t: Test result: SUV change Gold standard: dead or alive at time t Using a threshold on SUV change, sensitivity and specificity are estimated. Using continuous test result, ROC curve can be derived (time-dependent ROC). Technical issue: Censored observations present technical challenge ROC curves for two types of %S-phase measurements as predictors of survival after breast cancer diagnosis. Heagerty et el, Biometrics, 2000; Reliability and standardization Reliability of measurements e.g. are SUV measurements reliable for single patient? e.g. are SUV measurements comparable across institutions? Assessment of reliability is aim in current clinical trials, such as ACRIN 6678 (FDG-PET in advanced lung cancer patients) Is there a required level of reliability for marker validation? Standardization of measurements across institutions, machine types, is challenging but absolutely essential. 11

Studies of patient outcomes Patient presentation Diagnostic workup Treatment Decisions Patient outcomes National Lung Screening Trial (ACRIN 6654) CXR screening 52,000 participants, at high risk for lung Baseline, plus two cancer. annual screens or Randomized Helical CT screening Baseline, plus two annual screens Primary endpoint: Comparison of lung cancer mortality between randomization arms. Participants followed up in subsequent years 12

ACRIN 4005 Low- to Intermediate- Risk Patients: People 30 and older who present to the ED with symptoms consistent with potential ACS, RANDOMIZATION N=1365 GROUP A: TRADITIONAL RULE-OUT ARM GROUP B: CT CORONARY ANGIOGRAPHY RULE- OUT ARM: OUT - POSITIVE TEST NORMAL TEST FOLLOW UP: 30 DAYS AND 1 YEAR Another example of randomized study of outcomes Patient with lower back pain Baseline functional status, HQRL Fast MRI R Plain film Process of care Patient outcomes Health care ulitization Jarvik et al Radiology 1997, JAMA 2003 Topics 3. Generalizability and bias considerations 13

Building blocks of study design Clinical question Patient population protocol inclusion / exclusion criteria Imaging intervention: Technique : Technical characteristics of modalities to be studied. Test interpretation: setting and reader population, blinding, experience, training, learning potential Measures of endpoints Reference information ( gold standard ) Clinical question Patient population Clinical question /Patient population/ Case mix/spectrum: start with defining clinical setting recruit representative sample ( consecutive clinical i l series usually works but is difficult to achieve) ensure patients with all forms of disease in sample sample prevalence may influence interpretation because of limited spectrum or, even with representative spectrum, because of factors such as reader vigilance Imaging intervention: Technique Test interpretation Imaging Intervention: Technical characteristics of the imaging process precise description of techniques reproducible at other clinics i should reflect expected clinical practice, but this often varies across institutions set minimum acceptable techniques, or allowable range 14

Imaging intervention: Technique Test interpretation Imaging intervention: Reader population expert readers or professionals at large variation across readers & institutions extent of reader experience want to generalize beyond sample on the study, but do not want to bias against new technology if readers have little experience Imaging intervention: Technique Test interpretation The moving target problem within studies Most changes occurring during study unlikely to affect endpoints appreciably ( hottest thing since sliced bread syndrome). Need to anticipate changes and implement in controlled way during study. Statistical methods for adaptive study design and analysis, usually based on Bayesian approaches, can be used. Reference information ( gold standard ) Procedure for determining presence of target condition usually (but not always) involves pathology (or cytology) examination. May need to supplement pathology with follow-up for subjects who do not have biopsies needs unambiguous definition uniform assessment across centers sometimes use truth committee. 15

Reference information ( gold standard ) Variability among pathologists often persists even when common protocols are in place. Do all studies need central pathology interpretation. Reference information ( Gold standard ) may be in error. Bias: sources and control What is bias? A systematic deviation of study-based inferences (e.g. estimates of accuracy) from the true values of parameters describing the phenomenon under study (e.g. true value of accuracy of the modality). 16

Biases in studies of diagnostic tests Although some statistical methods for bias correction are available, the potential for bias is best addressed at the design stage. Common types of bias: Verification Bias (workup bias) Interpretation Bias Uninterpretable Tests Selection/referral bias Temporal effects ( moving target problem, learning effects ) Verification bias: Example Suppose the full results of a study would be: Test result Disease Status Positive Negative Total Positive 160 30 190 Negative 120 510 630 Total 280 540 820 Then we would estimate: sensitivity = 160/280= 57% specificity = 510/540= 94% Verification bias example (cont.) However, the investigators had reference standard only on a subset of cases: Test result Disease Status Positive Negative Total Positive 80 15 95 Negative 20 85 105 Total 100 100 200 New estimates: sensitivity = 80% & specificity = 85% Note: 50% of cases testing positive were verified compared to 17% of cases testing negative 17

Verification bias May occur if definitive information is selectively available on a subset of cases especially when selection for disease verification depends on test result. Assuming all participants without verified diseases are disease free leads to bias. Avoid by verifying all cases. May suffice to verify a randomly selected subset Statistical correction is possible. Interpretation Bias Occurs when test interpreter knows results of other tests for the disease or (even worse) the reference standard Possible effect of interpretation bias Test + Test - Truth + a b Truth - c d If test is interpreted with knowledge of reference standard, both sensitivity and specificity could be overestimated 18

Interpretation Bias Blinding to reference standard is absolutely essential. However, the broader question is What information should be available to reader? The answer depends on what is actually being assessed. Some questions: Is total blinding to other information necessary? Will such blinding enhance validity and generalizability of study? What information should be available to reader? The evaluation of a diagnostic test needs to take into account the context in which the test will be used. Study design needs to specify the types and amount of information available to the reader. Information available to test interpreter should correspond to the context to which study results would generalize. Example: In studies of contrast agents, if reading of contrast will be done clinically always with access to baseline (non-contrast enhanced) scan, Will the results be biased if contrast enhanced images were interpreted with access to baseline images? Uninterpretable test results Common problem. What is meant by uninterpretable result? technically unacceptable? equivocal? Equivocal/intermediate results are not uninterpretable. Excluding uninterpretable test outcomes may result in bias: often leads to over-estimates of test accuracy. If at all possible, participants should be followed to determine reference standard information. 19

Investigating uninterpretable results If test was to be repeated, could the problem be resolved? No Yes Need new test result category May repeat test but still need to report frequency Investigating uninterpretable results (cont.) Are uninterpretable results more common in diseased cases? If yes, then uninterpretable result may be of value in diagnosis/prediction Topics 4. Reporting revisited 20

STARD checklist S T A R D S T A R D 21

REMARK Checklist Reporting Recommend ations for Tumor Marker Prognostic Studies McShane et al JNCI 2005 97: 1180-4 Review of Topics 1. Diagnostic imaging evaluation: 1. Overview 2. Measures of performance 3. Reporting checklists 2. Elements of study design 1. Studies of accuracy in detection ti 2. Studies of accuracy in prediction 3. Studies of outcomes 3. Generalizability and bias considerations 4. Reporting revisited 22