Biostatistics and Research Design in Dentistry Reading Assignment Measuring the accuracy of diagnostic procedures and Using sensitivity and specificity to revise probabilities, in Chapter 12 of Dawson & Trapp Basic & Clinical Biostatistics Objectives of this chapter The objective is to understand the diagnostic uses of a 2x2 table, and how to interpret terms associated with the table. (true) Disease Condition Test + Total + a b a + b True Positive False Positive c d c + d False Negative True Negative Total a + c b + d a + b + c + d Prevalence Proportion of population affected with the disease = (a + c) / (a + b + c + d) Sensitivity Proportion of true positives. = a / (a + c) A sensitive test is a good screening test because it identifies most of the people who have the disease, and perhaps a few who do not. Specificity Proportion of true negatives. = d / (b + d) A specific test is a good diagnostic test because it identifies most of the people who do not have the disease, and maybe a few who do. False Positive Proportion of false positives. =b / (b + d) = 1-specificity False Negative Proportion of false negatives. = c / (a + c) = 1-sensitivity Positive Predictive Value (PPV) Negative Predictive Value (NPV) Likelihood Ratio For A Positive Test Likelihood Ratio For A Negative Test Proportion of subjects with a positive test result who have the disease. = a / (a + b); (prevalence)(sensitivity) / ((prevalence)(sensitivity) + (1-prevalence)(1 specificity)). If table reflects prevalence then PPV = a / (a + b) Proportion of subjects with a negative test result who do not have the disease. = d / (c + d); (1 - prevalence)(specificity) / ((1 prevalence)(specificity) + (prevalence)(1 sensitivity)). If table reflects prevalence then NPV = d / (b + d) Accuracy Proportion of correct results or the probability the test will detect true findings; (prevalence)(sensitivity) + (1 prevalence)(specificity) (a + d) / (a + b + c + d) Odds that the given level of the test would occur in a subject who has the disease. = ((a/a + c)) / ((b/b + d)); sensitivity / (1 specificity) Odds that the given level of the test would occur in a subject who has the disease. = ((d/b + d)) / ((c/a + c)); (1 sensitivity) / specificity Page 1 of 7
Estimating Sensitivity & Specificity Table 12-3: ST elevation Test + Total Sensitivity= 19.4% + 6 13 19 Specificity= 81.9% 25 59 84 False Positive= 18.1% Total 31 72 103 False Negative= 80.6% PPV= 31.6% NPV= 70.2% Accuracy= 63.1% Prevalence= 30.1% Dawson gives this bottom line: 1. To rule out a disease, we want to be sure that a negative result is really negative; therefore, not very many false-negatives should occur. A sensitive test is the best choice to obtain as few falsenegatives as possible if factors such as cost and risk are similar; that is, high sensitivity helps rule out if the test is negative. As a handy acronym, if we abbreviate sensitivity by SN, and use a sensitive test to rule OUT, we have SNOUT. 2. To find evidence of a disease, we want a positive result to indicate a high probability that the patient has the disease that is, a positive test result should really indicate disease. We therefore want few false-positives. The best method for achieving this is a highly specific test that is, high specificity helps rule in if the test is positive. Again, if we abbreviate specificity by SP, and use a specific test to rule IN, we have SPIN. 3. To make accurate diagnoses, we must understand the role of prior probability of disease. If the prior probability of disease is extremely small, a positive result does not mean very much and should be followed by a test that is highly specific. The usefulness of a negative result depends on the sensitivity of the test. Page 2 of 7
Effect of Prevalence Assume Prevalence of 20% Test + Total Sensitivity= 78.0% + 156 264 420 Specificity= 67.0% 44 536 580 False Positive= 33.0% Total 200 800 1000 False Negative= 22.0% Assumed PPV= 37.1% NPV= 92.4% Accuracy= 69.2% Prevalence= 20.0% Likelihood ratio for Pos test= 2.4 Likelihood ratio for Neg test= 0.3 Odds ratio= 7.20 or Change Prevalence to 50% Test + Total Sensitivity= 78.0% + 390 165 555 Specificity= 67.0% 110 335 445 False Positive= 33.0% Total 500 500 1000 False Negative= 22.0% Assumed PPV= 70.3% NPV= 75.3% Accuracy= 72.5% Prevalence= 50.0% Likelihood ratio for Pos test= 2.4 Likelihood ratio for Neg test= 0.3 Odds ratio= 7.20 Page 3 of 7
Effect of accuracy More accurate, 20% prevalence Test + Total Sensitivity= 95.0% + 190 40 230 Specificity= 95.0% 10 760 770 False Positive= 5.0% Total 200 800 1000 False Negative= 5.0% Assumed PPV= 82.6% NPV= 98.7% Accuracy= 95.0% Prevalence= 20.0% Likelihood ratio for Pos test= 19.0 Likelihood ratio for Neg test= 0.1 Odds ratio= 361.00 or More accurate, 5% prevalence Test + Total Sensitivity= 95.0% + 47.5 47.5 95 Specificity= 95.0% 2.5 902.5 905 False Positive= 5.0% Total 50 950 1000 False Negative= 5.0% Assumed PPV= 50.0% NPV= 99.7% Accuracy= 95.0% Prevalence= 5.0% Likelihood ratio for Pos test= 19.0 Likelihood ratio for Neg test= 0.1 Odds ratio= 361.00 Page 4 of 7
Agreement Reading Assignment Measuring agreement, in Chapter 5 of Dawson & Trapp Basic & Clinical Biostatistics Objectives of this section The objective is to understand how to describe agreement between two imperfect measures, understand how chance can affect apparent agreement, and to interpret Kappa. Variability is pervasive All measurements will vary depending upon: Actual changes in the characteristics being measured Variation introduced by the examiner Variation of the measurement method In addition, measurements by one individual may be affected by the measurements obtained by another. The process of measuring may change the characteristic. Expectancy what you except to see influences what you see. Bias will occur unless one examiner is blinded to all other examiner s results. Reliability Reliability is reproducibility. Whether or not there is a true gold standard of truth, do different measurements agree with each other. It has nothing to do with agreement with what s true. Just with how close two (error prone) measures are. Intrarater reliability is the reproducibility of measures by the same examiner (One examiner measures the same characteristic twice). Sometimes called within-examiner reliability. Interrater reliability is the agreement of measures by different examiners. (Two examiners measure the same characteristic). Sometimes called between-examiner reliability. Test-retest reliability is used in the context of questionnaires. It s like intrarater reliability but, since questionnaires often have multiple items (supposedly) getting at the same construct, we can also get a measure of internal consistency. [Reliability is not quantified by things like sensitivity, specificity, false positive rate, and false negative rate; These all assume you know the true value.] It s often of interest to assess how well different classification methods agree. The different classification methods may refer to multiple raters making a clinical diagnosis, to multiple software algorithms classifying digitized images, to the scores from different rating scales determining probable etiology, or to comparing any two methods that yield classifications of individuals. In these situations there is no true or known classification and so assessing reliability (repeatability, reproducibility, agreement) is of interest. This is in contrast to being interested in the validity of a classification scheme. In the most typical case where each of N subjects is classified into one of R categories by two classification methods, the observations may be summarized in a RxR contingency table where rows describe classification by one method and columns describe classification by the other method. If nij is the number of subjects classified into the row classification value i and the column classification value j, then one natural index of raw agreement is the proportion of R subjects where the two classification methods agree, po i 1nii N. The problem with po is that it reflects both chance agreement and agreement beyond chance. The fact that it reflects chance agreement is easily seen in the following example: Assume the prevalence in a Page 5 of 7
population of interest of characteristic A is 0.95. Further, assume that one rater uses information to classify subjects as A or not-a. Note that if the other rater simply always diagnoses every patient as A, the two will agree p o = 0.95. Thus, a simple proportion-agreement score is insufficient to assess reliability. The proportion agreement expected by chance, p e, is easily calculated from the marginal proportions of the two raters, exactly as in the chi-square test of independence. So, to calculate a chance-corrected index of agreement Cohen [1] defined the kappa index: p κ o p e 1 pe He describes this as the proportion of agreement after chance agreement is removed from consideration. Landis and Koch [1977, The measurement of observer agreement for categorical data, Biometrics 33, 159-174.] suggest that κ < 0.40 reflects poor agreement, 0.40 κ < 0.75 reflects fair to good agreement, and κ > 0.75 reflects excellent agreement (also see the table on the bottom of page 119). Example: A study was done to compare different methods for packing filling material. Two methods were used assess voids in the filling methods. A portion of the report stated, Across all of the assessments, the agreement between the two methods (radiograph and microscope) were good, with over 80% of the assessments in complete agreement (see the table below). The largest disagreement occurred where no voids were evident with the microscope but more than half of the area had a void by the radiographic method (n = 25 + 29 cases). There was also n= 18 cases where the radiograph indicted no void but the microscope indicated larger than half of the area was incomplete. Observed Microscope <50% >50% Radiograph no voids incomplete incomplete Total no voids 330 9 18 357 (66.0) (1.8) (3.6) (71.4) <50% 25 5 9 39 incomplete (5.0) (1.0) (1.8) (7.8) >50% 29 8 67 104 incomplete (5.8) (1.6) (13.4) (20.8) Total 384 22 94 500 (Percent) (76.8) (4.4) (18.8) (100.0) Observed agreement= (330+5+67) = 402 = 0.804 500 500 Page 6 of 7
Kappa measures the amount of agreement we would expect by chance alone. The expected level of agreement is below: expected Microscope Radiograph no voids <50% incomplete >50% incomplete Total no voids 274.176 15.708 67.116 357 (54.8) (3.1) (13.4) (71.4) <50% 29.952 1.716 7.332 39 incomplete (6.0) (0.3) (1.5) (7.8) >50% 79.872 4.576 19.552 104 incomplete (16.0) (0.9) (3.9) (20.8) Total 384 22 94 500 (Percent) (76.8) (4.4) (18.8) (100.0) Expected agreement= (274+1.7+19.5) = 295.4 = 0.591 100 100 So, the chance-corrected measure of agreement is: Kappa= 0.804-0.591 = 0.213 = 0.521 1-0.591 0.409 (SE = 0.039) Question: What s your conclusion now? Does the microscope method and the radiograph method agree in their assessment of filling voids? Page 7 of 7