Evidence-based Imaging: Critically Appraising Studies of Diagnostic Tests

Evidence-based Imaging: Critically Appraising Studies of Diagnostic Tests Aine Marie Kelly, MD Critically Appraising Studies of Diagnostic Tests Aine Marie Kelly B.A., M.B. B.Ch. B.A.O., M.S. M.R.C.P.I., F.R.C.R. No Financial i Disclosures Evidence Based Radiology Integrates the best available research evidence with clinical expertise and patient values Sackett et al. Evidence Based Medicine, how to practice and teach EBM. Elsvier Churchill Livingstone 2005. How do we practice Evidence Based Medicine? Formulate a clinical question Identify medical literature Critically appraise the medical literature Summarize the evidence Apply the evidence to derive the appropriate clinical action Sackett et al. Evidence Based Medicine, how to practice and teach EBM. Elsvier Churchill Livingstone 2005. Critical Appraisal Diagnostic Literature Grade the literature Technology assessment in radiology Assess materials and methods for validity and bias Assess the statistical strength of the results Specific questions for imaging s Levels of Evidence for Diagnostic Tests Level 1 (ideal) Systematic Review of RCT, RCT of appropriate size, validating cohort study, uniform good reference, appropriate population Level 2 (strong) SR of cohort study, exploratory cohort study, good reference, selected population Level 3 (moderate) SR of case control studies, outcomes research, case control trials, non consistent reference standard Level 4 (weak) case series, poor or non independent reference standard Level 5 (very weak) clinical evidence, descriptive studies or reports of expert consensus committees The Centre for Evidence Based Medicine, Oxford University, England. http://cebm.net/levels of evidence.asp#levels 268

Grading Evidence Technology Assessment A: consistent level 1 studies B: consistent level 2 or 3 studies C: level 4 studies D: level l 5 studies or inconsistenti t or inconclusive studies of any level Remedios D, McCoubrie P. Making the best use of clinical radiology services: A new approach to referral guidelines. Clin Radiol 2007; 62 (10): 919 20.) Level 1 = Technical Efficacy Level 2 = Diagnostic accuracy efficacy Level 3 = Diagnostic thinking efficacy Level 4 = Therapeutic efficacy Level 5 = Patient outcome efficacy Level 6 = Societal efficacy Thornbury JR. Acad Radiol 6 (1999), pp. S58 S65. Fryback and Thornbury. Med Decis Making 11 (1991), pp. 88 94. Mackenzie and Dixon. Clin Radiol 50 (1995), pp. 513 518. Imaging g Effectiveness Hierarchy Can the modality produce the image? Noise, resolution line pairs, MTF, grey scale, sharpness Yield of abnormal or normal cases in a series Sensitivity, specificity, PPV, NPV, ROC height and area Number of cases in which modality was useful in making a diagnosis Pre and post probability, likelihood ratios Number of times modality was helpful in clinical decision making Altered or avoided treatments Percentage of patients improved with the or change in quality adjusted life expectancy Expected value of information, cost effectiveness per QALY Cost Benefit analysis, cost effectiveness analysis from society viewpoint Thornbury JR. Acad Radiol 6 (1999); S58 S65. S65. Fryback and Thornbury. Med Decis Making 11 (1991);88 94. Mackenzie and Dixon. Clin Radiol 50 (1995); 513 518. Materials and Methods: Assess for Validity Was there an independent, blind comparison with a reference (gold) standard of diagnosis? Was the diagnostic evaluated in an appropriate spectrum of patients (like those in whom it would be used in practice)? Was the reference standard applied regardless of the diagnostic result? Was the (or cluster of s) validated in a second, independent group of patients? Dodd JD et al. Evidence-based radiology: how to quickly assess the validity and strength of publications in the diagnostic radiology literature. Eur Radiol. 2004 May;14(5):915-22. Materials and Methods: Assess for Bias Is the study original? How were the patients recruited? Was recruitment bias avoided? Inclusion and exclusion criteria? Were the subjects studied in real life circumstances? What was the sample size? Materials and Methods: Assess for Bias What exactly did they do? What outcome is measured and why? Was systematic bias avoided or minimized? Was assessment blind? Was review /diagnosis review bias avoided? Has workup/verification bias been avoided? Has spectrum bias been avoided? 269

Additional Issues Diagnostic Tests Is this potentially relevant to my practice? Did this validation study include an appropriate spectrum of participants? Was the shown to be both reproducible within and between observers? Has a sensible normal range been derived from these results? Has the been placed in the context of other potential s in the diagnostic sequence for the condition? Specific Questions that apply to Imaging Tests Has the imaging modality been described in sufficient detail to reproduce it in your department? Have the imaging g s being evaluated and the gold standard been performed to the same standard of excellence? Have generations of technology development within the same modality been adequately considered in the study design and discussion? Has radiation exposure been considered? Were images reviewed on hard copy or a monitor? Were images reviewed by a radiologist of sufficient experience? Dodd JD et al. Evidence-based radiology: how to quickly assess the validity and strength of publications in the diagnostic radiology literature. Eur Radiol. 2004 May;14(5):915-22. Results: Assessment of Statistical Strength Sensitivity Specificity Confidence intervals Positive Predictive value Negative Predictive value Likelihood ratios Sensitivity and Specificity Sensitivity = the proportion of patients with the diagnosis who have a positive Specificity = the proportion of patients without the diagnosis who have a negative Independent d of disease prevalence Diagnostic threshold = level of abnormality above which is considered positive and below which the is considered negative Ideal situation no analysis required threshold Disease absent But in reality threshold Disease absent Increasing abnormality Disease present Disease present Increasing abnormality 270

Sensitivity and Specificity: 2 x 2 table Index Reference or gold standard d + - + True (A) Positive - False (C) Negative A+C Sensitivity = False (B) Positive True (D) Negative B+D A +B C + D Total Ttl true positives true positives and false negatives Sensitivity and Specificity: 2 x 2 table Index Reference or gold standard d + - + True Positive (A) - False Negative (C) A+C Specificity = False Positive (B) A +B True Negative (D) C + D B+D Total Ttl true negatives true negatives and false positives Sensitivity = A / A+C Specificity = D / B+D SpPIn and SnNOut If has high sensitivity (Sn),, a negative (N) effectively rules out (Out) the diagnosis (SnNOut) If has high specificity (Sp),, a positive (P) effectively rules in (In) the diagnosis (SpPIn) Confidence Intervals Sensitivity and specificity are point estimates Confidence intervals provide a measure of how closely these estimates approximate the truth Sample size influences the confidence interval Dodd JD et al. Evidence-based radiology: how to quickly assess the validity and strength of publications in the diagnostic radiology literature. Eur Radiol. 2004 May;14(5):915-22. Positive Predictive Value Positive predictive value = the proportion p of patients with a positive who have the disease Sensitivity x Prevalence Sens x prev + (1-specificity)x(1 specificity)x(1-prev) prev) PPV dependsd on prevalence of disease Positive Predictive Value: 2 x 2 table Index Reference or gold standard d + - + True Positive (A) False Positive (B) - False Negative (C) A+C A +B True Negative (D) C + D B+D Positive Predictive Value = true positives all positives Total Ttl PPV = A / A+B 271

Negative Predictive Value Negative predictive value = proportion of patients with a negative who do not have the disease Specificity x (1-Prevalence) spec x (1-prev)+(1 prev)+(1-sensitivity) x prev NPV depends d on prevalence of the disease Negative Predictive Value: 2 x 2 table Index Reference or gold standard d + - + True Positive (A) - False Negative (C) A+C False Positive (B) A +B True Negative (D) C + D B+D Negative Predictive Value = true negatives all negatives NPV = D / C+D Total Ttl Predictive Values Likelihood Ratios Stein PD et al. MDCT for Acute PE. NEJM 2006;354(22); 2717-27. The ratio of two probabilities Probability of positive result in patients with disease divided id d by probability bili of positive result in patients without disease OR Probability of negative result in patients without disease divided by probability of negative result in patients with disease Likelihood Ratios Likelihood Ratios Calculated from sensitivity and specificity LR positive result = sensitivity 1-specificity LR negative result = 1-sensitivity specificity LR range from 0 to infinity LR of 0 = excludes disease LR of infinity = confirms disease LR of 1 = has no discriminating power LR of > 10 = strongly positive LR of < 0.1 = strongly negative 272

Fagan's Nomogram CTPA in PIOPED II (Likelihood Ratio Nomogram) Prevalence (pre probability)=23.3 LR+ = 19.6 LR - =018 0.18 Stein PD et al. MDCT for Acute PE. NEJM 2006;354(22); 2717-27. How do we estimate the pre- probability for a population or individual? Exclusion Threshold Action Threshold 0% 25% 50% 75% 100% Very unlikely Probably does Don t know Probably does Very likely to have the not have it have it to have the disease disease In practice, pre- probability does not have to be expressed numerically Graph of Conditional Probabilities GCP = graph of pre versus post probability Ranges from 0 to 1 or 100% for positive or negative LR of Web based programs - input prevalence (pre probability), sensitivity and specificity MacEneaney PM, Malone DE. The meaning of diagnostic results: A spreadsheet for swift data analysis. Clin Radiol 2000;55:227 235. Also at www.evidencebasedradiology.net Graph of Conditional Probabilities weak Post-Test t Probabilitiy of Disease Graph of Conditional Probabilities 1.000 0.800 0.600 0.400 0.200 0.000 0.000 0.200 0.400 0.600 0.800 1.000 Pre-Test Probability of Disease Graph of Conditional Probabilities strong Probabilitiy of Disease Post-Test 1.000 0.800 0.600 0.400 0.200 0.000 Graph of Conditional Probabilities biliti 0.000 0.200 0.400 0.600 0.800 1.000 Pre-Test Probability of Disease Test Negative Test Positive Test Negative Test Positive MacEneaney PM, Malone DE. The meaning of diagnostic results: A spreadsheet for swift data analysis. Clin Radiol 2000;55:227 235. Also at www.evidencebasedradiology.net MacEneaney PM, Malone DE. The meaning of diagnostic results: MacEneaney PM, Malone DE. The meaning of diagnostic results: A spreadsheet for swift data analysis. Clin Radiol 2000;55:227 235. Also at www.evidencebasedradiology.net 273

Graph of Conditional Probabilities - MDCT in PIOPED II Post-Test P robabilitiy of Dis sease Graph of Conditional Probabilities 1.000 0800 0.800 0.600 0.400 0.200 0.000 0.000 0.200 0.400 0.600 0.800 1.000 Pre-Test Probability of Disease Test Negative Test Positive Stein PD et al. MDCT for Acute PE. NEJM 2006;354(22); 2717-27. Critical Appraisal Guides STAndards for the Reporting of Diagnostic Accuracy Studies (STARD) CONsolidated Standards Of Reporting Trials (CONSORT) The STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) Standards for QUality Improvement Reporting Excellence (SQUIRE) Transparent Reporting of Evaluations with Non randomized Designs (TREND) STARD Guidelines STARD Guidelines Identify the article as a study of diagnostic accuracy (recommend MeSH heading 'sensitivity and specificity'). State the research questions or study aims, such as estimating diagnostic accuracy or comparing accuracy between s or across participant groups. The study population: The inclusion and exclusion criteria, setting and locations where data were collected. Participant recruitment: Was recruitment based on presenting symptoms, results from previous s, or the fact that the participants had received the index s or the reference standard? Participant sampling: Was the study population a consecutive series of participants defined by the selection criteria in item 3 and 4? If not, specify how participants i t were further selected. Data collection: Was data collection planned before the index and reference standard were performed (prospective study) or after (retrospective study)? Bussuyt PM et al. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Radiology 2003 Jan;226(1):24-8. The reference standard and its rationale. Technical specifications of material and methods involved including how and when measurements were taken, and/or cite references for index s and reference standard. Definition of and rationale for the units, cut-offs and/or categories of the results of the index s and the reference standard. The number, training and expertise of the persons executing and reading the index s and the reference standard. Whether or not the readers of the index s and reference standard were blind (masked) to the results of the other and describe any other clinical information available to the readers. Methods for calculating or comparing measures of diagnostic accuracy, and the statistical ti ti methods used to quantify uncertainty t (e.g. 95% confidence intervals). Methods for calculating reproducibility, if done. Bussuyt PM et al. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Radiology 2003 Jan;226(1):24-8. STARD Guidelines STARD Guidelines When study was performed, including beginning and end dates of recruitment. Clinical and demographic characteristics of the study population (at least information on age, gender, spectrum of presenting symptoms). The number of participants i t satisfying i the criteria i for inclusion i who did or did not undergo the index s and/or the reference standard; describe why participants failed to undergo either (a flow diagram is strongly recommended). Time-interval interval between the index s and the reference standard, and any treatment administered in between. Distribution of severity of disease (define criteria) in those with the target condition; other diagnoses in participants without the target condition. A cross tabulation of the results of the index s (including indeterminate and missing results) by the results of the reference standard; for continuous results, the distribution of the results by the results of the reference standard. Any adverse events from performing the index s or the reference standard. Bussuyt PM et al. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Radiology 2003 Jan;226(1):24-8. Estimates of diagnostic accuracy and measures of statistical uncertainty (e.g. 95% confidence intervals). How indeterminate results, missing data and outliers of the index s were handled. Estimates of variability of diagnostic accuracy between subgroups of participants, readers or centers, if done. Estimates of reproducibility, if done. Discuss the clinical applicability of the study findings. Bussuyt PM et al. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Radiology 2003 Jan;226(1):24-8. 274

References Evidence Based Medicine. How to practice and teach EBM. Third Edition. Sharon E. Strauss, W. Scott Richardson, Paul Glasziou and R. Brian Haynes. Elsvier Churchill hill Livingstone i 2005. Users Guide to the Medical Literature. Essentials of Evidence-Based Clinical Practice. Gordon Guyatt MD and Drummond Rennie MD. AMA press 2002. How to Read a Paper. The basics of Evidence Based Medicine. Third Edition. Trisha Greenhalgh. Blackwell Publishing 2006. Evidence Based Imaging. L.Santiago Medina and C. Craig Blackmore. Springer 2006 Critically Appraising Studies of Diagnostic Tests Aine Marie Kelly B.A., M.B. B.Ch. B.A.O., M.S. M.R.C.P.I., F.R.C.R. 275