(true) Disease Condition Test + Total + a. a + b True Positive False Positive c. c + d False Negative True Negative Total a + c b + d a + b + c + d

Similar documents
Figure 1: Design and outcomes of an independent blind study with gold/reference standard comparison. Adapted from DCEB (1981b)

4 Diagnostic Tests and Measures of Agreement

COMMITMENT &SOLUTIONS UNPARALLELED. Assessing Human Visual Inspection for Acceptance Testing: An Attribute Agreement Analysis Case Study

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

2 Philomeen Weijenborg, Moniek ter Kuile and Frank Willem Jansen.

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

Validity and reliability of measurements

Reliability and Validity checks S-005

PTHP 7101 Research 1 Chapter Assignments

Screening (Diagnostic Tests) Shaker Salarilak

Lab 4: Alpha and Kappa. Today s Activities. Reliability. Consider Alpha Consider Kappa Homework and Media Write-Up

Lessons in biostatistics

COMPUTING READER AGREEMENT FOR THE GRE

Validity and reliability of measurements

reproducibility of the interpretation of hysterosalpingography pathology

DATA is derived either through. Self-Report Observation Measurement

English 10 Writing Assessment Results and Analysis

Unequal Numbers of Judges per Subject

Lecture 5. Contingency /incidence tables Sensibility, specificity Relative Risk Odds Ratio CHI SQUARE test

Chapter 10. Screening for Disease

Agreement Coefficients and Statistical Inference

Survey Question. What are appropriate methods to reaffirm the fairness, validity reliability and general performance of examinations?

Measuring Performance Of Physicians In The Diagnosis Of Endometriosis Using An Expectation-Maximization Algorithm

Evidence Based Medicine Prof P Rheeder Clinical Epidemiology. Module 2: Applying EBM to Diagnosis

Psychology, 2010, 1: doi: /psych Published Online August 2010 (

Package CompareTests

CommonKnowledge. Pacific University. Gina Clark Pacific University. Lauren Murphy Pacific University. Recommended Citation.

Binary Diagnostic Tests Two Independent Samples

EPIDEMIOLOGY. Training module

SUPPLEMENTARY INFORMATION In format provided by Javier DeFelipe et al. (MARCH 2013)

Observer variation for radiography, computed tomography, and magnetic resonance imaging of occult hip fractures

Maltreatment Reliability Statistics last updated 11/22/05

Data that can be classified as belonging to a distinct number of categories >>result in categorical responses. And this includes:

Evaluating Quality in Creative Systems. Graeme Ritchie University of Aberdeen

BMI 541/699 Lecture 16

Closed Coding. Analyzing Qualitative Data VIS17. Melanie Tory

Lessons in biostatistics

Victoria YY Xu PGY-3 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

Psychometric qualities of the Dutch Risk Assessment Scales (RISc)

Mohegan Sun Casino/Resort Uncasville, CT AAPP Annual Seminar

Comparison of the Null Distributions of

Introduction to ROC analysis

Statistical Validation of the Grand Rapids Arch Collapse Classification

Victoria YY Xu PGY-2 Internal Medicine University of Toronto. Supervisor: Dr. Camilla Wong

A Cross-sectional, Randomized, Non-interventional Methods Study to Compare Three Methods of Assessing Suicidality in Psychiatric Inpatients

Relationship Between Intraclass Correlation and Percent Rater Agreement

Questionnaire design. Questionnaire Design: Content. Questionnaire Design. Questionnaire Design: Wording. Questionnaire Design: Wording OUTLINE

An Exploratory Case Study of the Use of Video Digitizing Technology to Detect Answer-Copying on a Paper-and-Pencil Multiple-Choice Test

Evidence-based Imaging: Critically Appraising Studies of Diagnostic Tests

Running head: ATTRIBUTE CODING FOR RETROFITTING MODELS. Comparison of Attribute Coding Procedures for Retrofitting Cognitive Diagnostic Models

appstats26.notebook April 17, 2015

Designing Psychology Experiments: Data Analysis and Presentation

Binary Diagnostic Tests Paired Samples

AP Psychology -- Chapter 02 Review Research Methods in Psychology

Lecture 15 Chapters 12&13 Relationships between Two Categorical Variables

Evidence-Based Medicine: Diagnostic study

Statistics, Probability and Diagnostic Medicine

CHAPTER 8 EXPERIMENTAL DESIGN

Cover Page. The handle holds various files of this Leiden University dissertation.

Appraising Diagnostic Test Studies

Two-sample Categorical data: Measuring association

10/26/2017. Diagnostic Tests vs. Screening. Dysphagia Screening: What it is and what it is not

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA

Biosta's'cs Board Review. Parul Chaudhri, DO Family Medicine Faculty Development Fellow, UPMC St Margaret March 5, 2016

CLASS II ETIOLOGY AND ITS EFFECT ON TREATMENT APPROACH AND OUTCOME

Diagnostic screening. Department of Statistics, University of South Carolina. Stat 506: Introduction to Experimental Design

Examining Relationships Least-squares regression. Sections 2.3

02a: Test-Retest and Parallel Forms Reliability

Basic Biostatistics. Dr. Kiran Chaudhary Dr. Mina Chandra

Comparing multiple proportions

Evidence-Based Medicine Journal Club. A Primer in Statistics, Study Design, and Epidemiology. August, 2013

Interpreting Kappa in Observational Research: Baserate Matters

12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Cover Page. The handle holds various files of this Leiden University dissertation.

Quality of Clinical Practice Guidelines

and Screening Methodological Quality (Part 2: Data Collection, Interventions, Analysis, Results, and Conclusions A Reader s Guide

Designing Psychology Experiments: Data Analysis and Presentation

Essential Skills for Evidence-based Practice: Statistics for Therapy Questions

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

Measuring association in contingency tables

Matthew L. Jensen. MIS Division, Price College of Business, University of Oklahoma, 307 West Brooks, Norman, OK U.S.A.

SOME NOTES ON STATISTICAL INTERPRETATION

Tubal subfertility and ectopic pregnancy. Evaluating the effectiveness of diagnostic tests Mol, B.W.J.

Understanding Statistics for Research Staff!

STATISTICS 8 CHAPTERS 1 TO 6, SAMPLE MULTIPLE CHOICE QUESTIONS

10 Intraclass Correlations under the Mixed Factorial Design

Examining Inter-Rater Reliability of a CMH Needs Assessment measure in Ontario

SUMMARY AND DISCUSSION

Finding Good Diagnosis Studies

Basic Concepts in Research and DATA Analysis

NIH Public Access Author Manuscript Tutor Quant Methods Psychol. Author manuscript; available in PMC 2012 July 23.

Pediatric Lung Ultrasound (PLUS) In Diagnosis of Community Acquired Pneumonia (CAP)

DEPRESSION-FOCUSED INTERVENTION FOR PREGNANT SMOKERS 1. Supplemental Material For Online Use Only

Appendix: Instructions for Treatment Index B (Human Opponents, With Recommendations)

ROC Curves. I wrote, from SAS, the relevant data to a plain text file which I imported to SPSS. The ROC analysis was conducted this way:

A study of adverse reaction algorithms in a drug surveillance program

AN ASSESSMENT OF INTER-RATER RELIABILITY IN THE TREATMENT OF CAROTID ARTERY STENOSIS

SkillBuilder Shortcut: Levels of Evidence

Using Differential Item Functioning to Test for Inter-rater Reliability in Constructed Response Items

Transcription:

Biostatistics and Research Design in Dentistry Reading Assignment Measuring the accuracy of diagnostic procedures and Using sensitivity and specificity to revise probabilities, in Chapter 12 of Dawson & Trapp Basic & Clinical Biostatistics Objectives of this chapter The objective is to understand the diagnostic uses of a 2x2 table, and how to interpret terms associated with the table. (true) Disease Condition Test + Total + a b a + b True Positive False Positive c d c + d False Negative True Negative Total a + c b + d a + b + c + d Prevalence Proportion of population affected with the disease = (a + c) / (a + b + c + d) Sensitivity Proportion of true positives. = a / (a + c) A sensitive test is a good screening test because it identifies most of the people who have the disease, and perhaps a few who do not. Specificity Proportion of true negatives. = d / (b + d) A specific test is a good diagnostic test because it identifies most of the people who do not have the disease, and maybe a few who do. False Positive Proportion of false positives. =b / (b + d) = 1-specificity False Negative Proportion of false negatives. = c / (a + c) = 1-sensitivity Positive Predictive Value (PPV) Negative Predictive Value (NPV) Likelihood Ratio For A Positive Test Likelihood Ratio For A Negative Test Proportion of subjects with a positive test result who have the disease. = a / (a + b); (prevalence)(sensitivity) / ((prevalence)(sensitivity) + (1-prevalence)(1 specificity)). If table reflects prevalence then PPV = a / (a + b) Proportion of subjects with a negative test result who do not have the disease. = d / (c + d); (1 - prevalence)(specificity) / ((1 prevalence)(specificity) + (prevalence)(1 sensitivity)). If table reflects prevalence then NPV = d / (b + d) Accuracy Proportion of correct results or the probability the test will detect true findings; (prevalence)(sensitivity) + (1 prevalence)(specificity) (a + d) / (a + b + c + d) Odds that the given level of the test would occur in a subject who has the disease. = ((a/a + c)) / ((b/b + d)); sensitivity / (1 specificity) Odds that the given level of the test would occur in a subject who has the disease. = ((d/b + d)) / ((c/a + c)); (1 sensitivity) / specificity Page 1 of 7

Estimating Sensitivity & Specificity Table 12-3: ST elevation Test + Total Sensitivity= 19.4% + 6 13 19 Specificity= 81.9% 25 59 84 False Positive= 18.1% Total 31 72 103 False Negative= 80.6% PPV= 31.6% NPV= 70.2% Accuracy= 63.1% Prevalence= 30.1% Dawson gives this bottom line: 1. To rule out a disease, we want to be sure that a negative result is really negative; therefore, not very many false-negatives should occur. A sensitive test is the best choice to obtain as few falsenegatives as possible if factors such as cost and risk are similar; that is, high sensitivity helps rule out if the test is negative. As a handy acronym, if we abbreviate sensitivity by SN, and use a sensitive test to rule OUT, we have SNOUT. 2. To find evidence of a disease, we want a positive result to indicate a high probability that the patient has the disease that is, a positive test result should really indicate disease. We therefore want few false-positives. The best method for achieving this is a highly specific test that is, high specificity helps rule in if the test is positive. Again, if we abbreviate specificity by SP, and use a specific test to rule IN, we have SPIN. 3. To make accurate diagnoses, we must understand the role of prior probability of disease. If the prior probability of disease is extremely small, a positive result does not mean very much and should be followed by a test that is highly specific. The usefulness of a negative result depends on the sensitivity of the test. Page 2 of 7

Effect of Prevalence Assume Prevalence of 20% Test + Total Sensitivity= 78.0% + 156 264 420 Specificity= 67.0% 44 536 580 False Positive= 33.0% Total 200 800 1000 False Negative= 22.0% Assumed PPV= 37.1% NPV= 92.4% Accuracy= 69.2% Prevalence= 20.0% Likelihood ratio for Pos test= 2.4 Likelihood ratio for Neg test= 0.3 Odds ratio= 7.20 or Change Prevalence to 50% Test + Total Sensitivity= 78.0% + 390 165 555 Specificity= 67.0% 110 335 445 False Positive= 33.0% Total 500 500 1000 False Negative= 22.0% Assumed PPV= 70.3% NPV= 75.3% Accuracy= 72.5% Prevalence= 50.0% Likelihood ratio for Pos test= 2.4 Likelihood ratio for Neg test= 0.3 Odds ratio= 7.20 Page 3 of 7

Effect of accuracy More accurate, 20% prevalence Test + Total Sensitivity= 95.0% + 190 40 230 Specificity= 95.0% 10 760 770 False Positive= 5.0% Total 200 800 1000 False Negative= 5.0% Assumed PPV= 82.6% NPV= 98.7% Accuracy= 95.0% Prevalence= 20.0% Likelihood ratio for Pos test= 19.0 Likelihood ratio for Neg test= 0.1 Odds ratio= 361.00 or More accurate, 5% prevalence Test + Total Sensitivity= 95.0% + 47.5 47.5 95 Specificity= 95.0% 2.5 902.5 905 False Positive= 5.0% Total 50 950 1000 False Negative= 5.0% Assumed PPV= 50.0% NPV= 99.7% Accuracy= 95.0% Prevalence= 5.0% Likelihood ratio for Pos test= 19.0 Likelihood ratio for Neg test= 0.1 Odds ratio= 361.00 Page 4 of 7

Agreement Reading Assignment Measuring agreement, in Chapter 5 of Dawson & Trapp Basic & Clinical Biostatistics Objectives of this section The objective is to understand how to describe agreement between two imperfect measures, understand how chance can affect apparent agreement, and to interpret Kappa. Variability is pervasive All measurements will vary depending upon: Actual changes in the characteristics being measured Variation introduced by the examiner Variation of the measurement method In addition, measurements by one individual may be affected by the measurements obtained by another. The process of measuring may change the characteristic. Expectancy what you except to see influences what you see. Bias will occur unless one examiner is blinded to all other examiner s results. Reliability Reliability is reproducibility. Whether or not there is a true gold standard of truth, do different measurements agree with each other. It has nothing to do with agreement with what s true. Just with how close two (error prone) measures are. Intrarater reliability is the reproducibility of measures by the same examiner (One examiner measures the same characteristic twice). Sometimes called within-examiner reliability. Interrater reliability is the agreement of measures by different examiners. (Two examiners measure the same characteristic). Sometimes called between-examiner reliability. Test-retest reliability is used in the context of questionnaires. It s like intrarater reliability but, since questionnaires often have multiple items (supposedly) getting at the same construct, we can also get a measure of internal consistency. [Reliability is not quantified by things like sensitivity, specificity, false positive rate, and false negative rate; These all assume you know the true value.] It s often of interest to assess how well different classification methods agree. The different classification methods may refer to multiple raters making a clinical diagnosis, to multiple software algorithms classifying digitized images, to the scores from different rating scales determining probable etiology, or to comparing any two methods that yield classifications of individuals. In these situations there is no true or known classification and so assessing reliability (repeatability, reproducibility, agreement) is of interest. This is in contrast to being interested in the validity of a classification scheme. In the most typical case where each of N subjects is classified into one of R categories by two classification methods, the observations may be summarized in a RxR contingency table where rows describe classification by one method and columns describe classification by the other method. If nij is the number of subjects classified into the row classification value i and the column classification value j, then one natural index of raw agreement is the proportion of R subjects where the two classification methods agree, po i 1nii N. The problem with po is that it reflects both chance agreement and agreement beyond chance. The fact that it reflects chance agreement is easily seen in the following example: Assume the prevalence in a Page 5 of 7

population of interest of characteristic A is 0.95. Further, assume that one rater uses information to classify subjects as A or not-a. Note that if the other rater simply always diagnoses every patient as A, the two will agree p o = 0.95. Thus, a simple proportion-agreement score is insufficient to assess reliability. The proportion agreement expected by chance, p e, is easily calculated from the marginal proportions of the two raters, exactly as in the chi-square test of independence. So, to calculate a chance-corrected index of agreement Cohen [1] defined the kappa index: p κ o p e 1 pe He describes this as the proportion of agreement after chance agreement is removed from consideration. Landis and Koch [1977, The measurement of observer agreement for categorical data, Biometrics 33, 159-174.] suggest that κ < 0.40 reflects poor agreement, 0.40 κ < 0.75 reflects fair to good agreement, and κ > 0.75 reflects excellent agreement (also see the table on the bottom of page 119). Example: A study was done to compare different methods for packing filling material. Two methods were used assess voids in the filling methods. A portion of the report stated, Across all of the assessments, the agreement between the two methods (radiograph and microscope) were good, with over 80% of the assessments in complete agreement (see the table below). The largest disagreement occurred where no voids were evident with the microscope but more than half of the area had a void by the radiographic method (n = 25 + 29 cases). There was also n= 18 cases where the radiograph indicted no void but the microscope indicated larger than half of the area was incomplete. Observed Microscope <50% >50% Radiograph no voids incomplete incomplete Total no voids 330 9 18 357 (66.0) (1.8) (3.6) (71.4) <50% 25 5 9 39 incomplete (5.0) (1.0) (1.8) (7.8) >50% 29 8 67 104 incomplete (5.8) (1.6) (13.4) (20.8) Total 384 22 94 500 (Percent) (76.8) (4.4) (18.8) (100.0) Observed agreement= (330+5+67) = 402 = 0.804 500 500 Page 6 of 7

Kappa measures the amount of agreement we would expect by chance alone. The expected level of agreement is below: expected Microscope Radiograph no voids <50% incomplete >50% incomplete Total no voids 274.176 15.708 67.116 357 (54.8) (3.1) (13.4) (71.4) <50% 29.952 1.716 7.332 39 incomplete (6.0) (0.3) (1.5) (7.8) >50% 79.872 4.576 19.552 104 incomplete (16.0) (0.9) (3.9) (20.8) Total 384 22 94 500 (Percent) (76.8) (4.4) (18.8) (100.0) Expected agreement= (274+1.7+19.5) = 295.4 = 0.591 100 100 So, the chance-corrected measure of agreement is: Kappa= 0.804-0.591 = 0.213 = 0.521 1-0.591 0.409 (SE = 0.039) Question: What s your conclusion now? Does the microscope method and the radiograph method agree in their assessment of filling voids? Page 7 of 7