4 Diagnostic Tests and Measures of Agreement

Size: px

Start display at page:

Download "4 Diagnostic Tests and Measures of Agreement"

Margaret McCoy
5 years ago
Views:

1 4 Diagnostic Tests and Measures of Agreement Diagnostic tests may be used for diagnosis of disease or for screening purposes. Some tests are more effective than others, so we need to be able to measure how useful a test is in a given set of circumstances. In practice, of course, we rarely know the true state of the individual and hence we evaluate the test in comparison with some other, more accurate classification. To simplify terminology here, we shall assume that the reference procedure (often called the gold standard ) indicates the true status of the subject. 4.1 Sensitivity and specificity To measure the effectiveness of a test, we need to consider two measures: sensitivity: (Se) the probability that if the disease is present the test is positive specificity: (Sp) the probability that if the disease is absent the test is negative Sensitivity is a measure of how good a test is at correctly identifying those who have the condition. If the test is not sensitive to the condition of interest then we would observe many false negatives. Specificity is a measure of how good a test is at correctly identifying those who do not have the condition. If the test is not specific to the condition of interest then we would observe many false positives. Sometimes the false negative (1 Se) and false positive (1 Sp) rates are given. Setting aside wider issues we can look at a simple measure of the efficiency of a screening test by comparing the prevalence in the whole population with the prevalence in the screen positive group. If the costs of the gold standard are high it may not be economically viable to apply the gold standard to the entire population, but it might be cost effective to apply it to a screen positive group Confidence Intervals Sensitivity and specificity are both estimates, so we can find confidence intervals for them. They are both Binomial proportions; however, since they are often close to 1 using the normal approximation may not always be appropriate. Instead we should use exact methods. 4.2 Tests on a continuous scale When a test result is expressed on a continuous scale, as in most haematological and biochemical tests, it is often convenient to think in terms of a cut off point (orfrequentlybothupperandlower cut off points) beyond which the result will be regarded as being abnormal. This simplifies the test to a binary (positive/negative) result. We need to define a critical value C as the cut-off point beyond which an individual would be referred for further investigation. Clearly the position of C is crucial. For example, suppose the cut-off point is such that most healthy individuals have values less than C and most diseased individuals have values greater than C; diseased individuals less than C are false negatives, missed by the test. Reducing C moves it closer to the mean of the healthy individuals and will reduce the number of false negatives. This improves the sensitivity of the test but at the expense of its 21

2 Table 4: Radiation of pain and diagnosis of gallstones Gallstones Not gallstones Total Pain radiates to shoulder Pain radiates to other site Pain does not radiate Total specificity (and hence the number of false positives). The converse happens if the cut-off point is moved the other way. A receiver operating characteristic (ROC) curve can be useful to examine the trade-off between sensitivity and specificity. We choose a number of different cut-off points, calculate the sensitivity and specificity for each cut-off point and then plot sensitivity against the false positive rate. Tests with ROC curves which go furthest into the top left corner are usually best. The area under the ROC curve estimates the probability that a member of one population chosen at random will have avaluegreaterthanamemberoftheotherpopulation(similartothemann-whitneyutest). It can be useful in comparing different tests. 4.3 Positive and negative predictive value Sensitivity and specificity give only part of the picture. In evaluating a test that might be used for screening purposes, we need a measure of the predictive power ofeitherapositiveoranegative test result. The predictive power of a positive test result, the positive predictive value (PPV) is the proportion of those with positive test results who turn out eventually to have the condition. The predictive value of a negative test result, the negative predictive value (NPV) is the proportion of those with negative test results who eventually turn out not to have the condition. PPV = NPV = Se p Se p +(1 Sp)(1 p) Sp(1 p) Sp(1 p)+(1 Se)p where p is the prevalence Positive and negative evidence When deciding between alternative diagnoses, different items of information contribute more or less weight of evidence for or against particular diagnoses. The presence of guarding in a patient with acute abdominal pain, for example, carries considerable weight in favour of the diagnosis of acute appendicitis. The items of information that help to exclude a diagnosis may, however, be different from those that help to establish it. These considerations may be important when deciding which of several possible subsequent investigations are likely to be helpful. 4.4 Comparing Two Methods There is a general class of problems relating to how one device whichmeasuressomecontinuous variable compares with a second device. The particular problem which occurs most frequently 22

3 in medicine, and will be discussed here, is whether a (usually cheaper) device can satisfactorily substitute for a device which measures with no appreciable error, this is method comparison. An apparently slightly different problem is to do with whether one method which measures with error can be substituted for another method which also measures with error. This has been dubbed the method conversion problem, and it is quite different from the method comparison problem and will not be discussed here. AcommonmistakeistocalculatePearson scorrelationcoefficient on the data, get a result which is very highly statistically significant and hence declare good agreement. However,thisisnotsuitable because the null hypothesis, that the two measurement scales areunrelated,isnotplausible;so showing the results were unlikely to occur by chance under the nullhypothesisofnoagreement is not useful. Rather we need a method which shows how much the results deviate from total agreement. Plot the data as a scatterplot and add the line of equality (y = x). This will give a quick visualisation of the association between the data. Perform a paired sample t-test on the data against the null hypothesis of no difference in the pairs of results. The mean difference is an estimate of the bias; a confidence interval will quantify the extent of the plausible bias, while the p value will show the weight of evidence in favour of a true difference existing. Plot the difference between the methods against the average. This gives an estimate of the size of the bias against the true value. A 95% range, based on the mean and standard deviation of the difference (assuming normality), is often added to the plot; these lines are sometimes called limits of agreement. Pearson s correlation coefficient can be calculated on these data to test the null hypothesis that the difference and mean and unrelated; that is, that the size of bias is unrelated to the true value. 4.5 Measures of Agreement Suppose two observers are asked to rate the same subjects for the presence or absence of a disease. Cohen s kappa coefficient can be used to assess the agreement between the two raters. Rater 2 Rater 1 Present Absent Total Present n 11 n 10 n 1+ Absent n 01 n 00 n 0+ Total n +1 n +0 n ++ Define I o as the observed proportion of agreement and I e as the proportion of expected agreement due to chance: I o = n 11 + n 00 n ++ I e = n +1n 1+ + n +0 n 0+ n 2 ++ Then kappa, κ, is the excess agreement expressed as a fraction of the maximum possible excess: κ = I o I e 1 I e 23

4 If there is complete agreement, κ =1;ifobservedagreementisequaltochance,κ =0;ifobserved agreement is greater than by chance κ>0. An important assumption underlying the use of the kappa coefficient is that errors associated with the two sets of ratings are independent. This requires the subjects to be independent and Rater 1 s ratings to be independent of Rater 2 s. The kappa coefficient, therefore, is not appropriate for a situation in which one observer is required to either confirm or disconfirm a known previous rating from another observer. When margin totals are not the same we may use I max I e as the denominator, where I max is the maximum possible agreement, keeping the margins fixed. Another alternative, using weighted observations, so that it attaches greater emphasis to large differences between ratings than to small differences. 4.6 Measurement Scales Validity Avalidscalemeasureswhatitintendstomeasure. Validitycan be judged in several ways; the scale should look as if it makes sense (face validity); all the itemsshouldberelevantandallaspects of the concept being measured should be included (content validity); the scale should be able to predict outcome (predictive validity); the scale should produce similar results to an established scale measuring a different concept (convergent and divergent validity); finally, a scale should be able to distinguish groups of patients who, a priori, are deemed to be different (discriminant validity) Sensitivity and specificity When a scale is used to categorise people, it should be capable ofcategorisingthemaccurately. For example, it would be most useful to detect patients with previously unrecognised problems or those with problems that are amenable to intervention. When screening, sensitivity may be more important than specificity; opportunities for clarifying the status of false positive patients will arise but the false negative patient is lost to further scrutiny Reliability A reliable scale produces results which can be replicated with different observers (inter-observer reliability), when repeated (test-retest reliability), when using different sources of information and when administered by different means. Simple correlation between repeat tests is not adequate for the assessment of reliability - it is more appropriate to analyse the differences between scores to see if they are larger than might be expected by chance Responsiveness to change Ascaleshouldbecapableofdetectingchangeduetointerventions or over time at all levels of the scale. Floor and ceiling effects present particular difficulties; a scale may not be able to detect meaningful differences between subjects who score respectively at the bottom or the top of a scale. 24

5 4.6.5 Format and language Ascaleshouldbewell-designed,andinanappropriateformatandlanguageforthesubjectsand users of the scale who may have differing knowledge and skills. 25

Question Sheet. Prospective Validation of the Pediatric Appendicitis Score in a Canadian Pediatric Emergency Department

Question Sheet. Prospective Validation of the Pediatric Appendicitis Score in a Canadian Pediatric Emergency Department Question Sheet Prospective Validation of the Pediatric Appendicitis Score in a Canadian Pediatric Emergency Department Bhatt M, Joseph L, Ducharme FM et al. Acad Emerg Med 2009;16(7):591-596 1. Provide