Chapter III METHODS FOR DETECTING CERVICAL CANCER 3.1 INTRODUCTION The successful detection of cervical cancer in a variety of tissues has been reported by many researchers and baseline figures for the sensitivity and specificity of their methods can be derived from their publications. In order to establish whether our laboratory could produce results similar to those reported in the literature. Several practical considerations are common to all hands-on cytological studies. This chapter discusses these practical issues and explains the reasoning behind the implementation decisions we made. The following chapter then describes the pilot studies and discusses the implications of their results in the context of previous findings in this field. 3.2 REPORTING DIAGNOSTIC TEST RESULTS The results of diagnostic tests are often reported as a single figure, usually the percentage of correct. This approach has the advantage of being simple to comprehend, and it makes comparisons between results easy to perform; but this single figure is influenced by a multitude of underlying factors, of which the most obvious is the cutoff threshold used for classification. A diagnostic test returns a continuously distributed measurement, or score, for each case. In order to actually classify the case, a threshold value of the score must be established, above (or below) which a case is considered 19
positive. The error proportion of the test depends upon the classification threshold chosen, and can be adjusted by moving this threshold. Figure 3.1 Dependence of summary measures of classifier performance on classification threshold selected To fully assess the performance of a diagnostic test it is necessary to understand its performance over a range of threshold values. Undoubtedly the most widely-used technique for this purpose is Receiver Operating Characteristic (ROC ) curve analysis (Henderson, 1993; Dwyer, 1996). ROC curves were developed for use in signal detection in radar returns in the 1950 s Swets (1986), in an excellent overview article, mentions that they were invented by Theodore G. Birdsall, of the Electrical Engineering Department of the University of Michigan, who taught the technique to him. The use of ROC curves has since been generalized to many problem domains, and is particularly widespread in medical decision making; Swets (1988) mentions at least 100 studies in the field of medical imaging which use ROC curve analysis. Example ROC curves are shown in Figure. 20
3.3 SCREENING TESTS Figure 3.2 Examples of ROC curves There are three possible purposes for a clinical diagnostic test discovery, confirmation and exclusion. A discovery test aims to detect the presence of a disease in a population; screening tests such as Pap smear screening are discovery tests. Confirmation tests are used to confirm the presence of disease in an individual with other symptoms, and exclusion tests allow the presence of disease in an individual to be ruled out. The three types of test have different purposes and a different decision threshold may be appropriate for each purpose. A rule-in threshold is the threshold used to confirm the presence of disease, while a rule-out threshold is used to exclude it. The two thresholds need not be the same. The performance of a diagnostic test is assessed in comparison with a gold standard indicator for the disease in question. Henderson (1993) defines a gold standard as A test constituting definitive diagnostic evidence [which] uniquely defines the disease 21
in the presence of specific symptomatology. For cervical neoplasia, the gold standard is biopsy of the lesion; however, in many studies biopsy is not available for all patients. Individuals diagnosed as normal are not usually biopsied, and therefore may include missed positives, while some women diagnosed as having neoplasia refuse treatment for personal reasons, and therefore cannot be biopsy-confirmed. A surrogate test, such as examination of a Pap smear by several cytologists, must be used instead in these cases. The possibility of errors in the gold standard must be kept in mind when assessing test performance. With respect to the gold standard diagnosis, a given test may provide one of four outcomes (Table 3.1). A positive test result in an individual with the disease is known as a true positive (TP); a positive result in a disease-free individual is a false positive (FP); a negative result in a disease-free individual is a true negative (TN) and a negative result in a patient with the disease is a false negative (FN). A table such as Table 3.1 is often referred to as the confusion matrix for a classifier. Table 3.1 The table of classification outcomes True Diagnosis Positive Classification Negative Total Positive True Positive (TP) False Negative (FN) Positive Population (PP) Negative False Positive (FP) True Negative (TN) Negative Population (NP) Total Class d Positive (CP) Class d Negative (CN) Total Population 22
From these outcomes, several related measures of test performance can be calculated (Bradley, 1997): Accuracy = (1 Error) = (TP + TN)/(PP + NP) = Pr(C), the probability of a correct classification. Sensitivity = TP/(TP + FN) = TP/PP = Pr(disease detected disease present), the ability of the test to detect disease in a population of diseased individuals. Specificity = TN/(TN + FP) = TN / NP = Pr(negative result no disease present), the ability of the test to correctly rule out the disease in a disease-free population. 3.4 PREDICTIVE VALUE OF A TEST Measures of sensitivity and specificity do not indicate how a test will perform in clinical practice. To do this, it is necessary to calculate the predictive values of that test. The positive predictive value (PPV) of the test is the proportion of all positive tests which correctly indicate the presence of disease; that is PPV = TP / (TP + FP) = Pr (disease present positive test) 3.5 RECEIVER OPERATING CHARACTERISTIC CURVE An ROC curve is constructed by first classifying the data set of interest. The classification result can be a single real number or an ordinal number, permitting a sensible ranking of cases (Dwyer, 1996; van Erkel & Pattynama, 1998). The true positive and false positive rates will depend upon the classification threshold chosen; to plot a ROC curve this threshold is varied over all possible output values and the true positive proportion is plotted against the false positive proportion for each threshold. The resulting curve will follow the diagonal from 0,0 to 1,1 if the classifier has no power (area 23
under the curve is 0.5), and will hug the top left corner of the plot for a perfect classifier (area under the curve is 1.0). The prevalence of the disease, the proportion of individuals with the disease in a given population at a specified point in time, will affect these values. The predictive value of a given test decreases with decreasing prevalence of the disease, so that even a relatively good test will perform poorly for a disease of very low prevalence. 3.6 ROC CURVE ANALYSIS ROC curves are usually plotted assuming a binormal distribution of the data that is, it is assumed that data from both the diseased and the non-diseased groups are distributed normally and the mean and variance of each distribution is estimated separately (Swets, 1986; Metz, Herman & Roe 1998). These distributions are then used to find the area under the curve. A number of nonparametric methods have also been developed for finding the area under the curve, to handle data which is highly non- Gaussian, although Hajian-Tilaki, Hanley, Joseph & Collet (1997) demonstrate that both parametric and non-parametric methods yield very similar estimates of the area under the curve for the same data, and these authors conclude that for a wide range of distributions ROC curves and their associated methodology are relatively robust. For statistical details of these approaches see (Swets, 1986). The binormal assumption is widely used because much of the original work on ROC curves was performed by humans ranking images (radar or radiological). A human can only handle a limited number of categories; five or seven are the numbers frequently cited in the literature (Hanley & McNeil, 1982; Swets, 1986). The area under a ROC curve empirically plotted from five categories can be estimated using the trapezoidal rule, but this approach is 24
known to underestimate the area. The binormal assumption allowed a smooth curve to be drawn, and the area underneath it to be calculated accurately (Hanley & McNeil, 1982). Using more recent computer techniques and a continuous decision variable, the binormal assumption is unnecessary and an empirical approach can be taken. 3.7 MEASURES FROM ROC CURVES The area under the ROC curve is commonly used as a measure of the overall performance of the classifier. Bamber (1975) demonstrated that the area under the ROC curve is equivalent to the value of the non-parametric Mann-Whitney U statistic, which in turn is equivalent to the Wilcoxon statistic. The Wilcoxon statistic is also a nonparametric statistic, usually used to test the hypothesis that the distribution of a variable, x, from one population, p, is equal to that from a second population, n. If the null hypothesis, H0: xp = xn is rejected, the probability, p can be calculated such that either xp > xn,, xp < xn or xp xn. The Wilcoxon test makes no assumptions about the distributions of the underlying variables. The area under the ROC curve effectively measures P(xp > xn), and so represents the probability that a random chosen positive example from a data set of interest is correctly ranked with respect to a randomly chosen negative example (Hanley & McNeil, 1982). In order to compare classifiers, it is necessary to estimate the standard error of the area under the curve, SE(AUC). The method for doing this varies with the method used to estimate the AUC, but one easily applied method, which is applicable to an empirically derived curve, is to use the standard error of the Wilcoxon statistic, SE (W): 25
.3.1 where θ is the area under the curve, C p and C n are the number of positive and negative examples respectively, and 3.2. 3.3 Q 1 is the probability that two randomly chosen abnormal images will both be ranked with greater suspicion than a randomly chosen normal image and Q2 is the probability that one randomly chosen abnormal image will be ranked with greater suspicion than two randomly chosen normal images (Hanley & McNeil, 1982; Henderson, 1993). SE(W) decreases as the number of samples on which it is estimated, N, increases, the decrease being proportional to N. SE(W) is also inversely proportional to the area under the curve, with SE(W) approaching 0 as the area under the curve approaches 1. To overcome these problems, Dwyer (1996) has suggested analysis of the area of that portion of the ROC curve which corresponds to a desirable range of false positive values. This allows the comparison of only those parts of the curve which are in the area of clinical interest. Another use of the ROC curve is to select a single threshold for optimum discrimination. Intuitively, this point is the point closest to the upper left corner of the graph; however, cost/benefit considerations may alter this optimum. Van Erkel & 26
Pattynama (1998) recommend combining ROC analysis with formal cost-effectiveness analysis in order to determine the optimal threshold. Returning to the use of a single point would appear to invalidate the use of ROC curves, since the same point could potentially be selected on the basis of the sensitivity/specificity values on which the curve is based. This is not, in fact, the case. A single pair of values may represent a point where sensitivity may be increased with little loss of specificity and vice versa, or it may not. Inspection of the ROC curve allows clarification of such issues (Bradley, 1997). 3.8 SLOPE OF THE ROC CURVE The area under the ROC curve provides an overall measure of the behaviour of the classifier, but does not measure performance at specific points. The diagnostic utility of a particular point on the ROC curve can be calculated using the odds-likelihood ratio form of Bayes Rule (Henderson, 1993). In its simplest form, Bayes Rule is usually written as:.. 3.4 where Pr(D R) is the probability of disease (D) given the test result (R), Pr(R D) is the probability of the result given that the disease is present, Pr(D) is the probability of disease (D) in the population, and Pr(R) is the probability of the test result in the population as a whole. This equation can also be expressed in terms of the probability of non-disease (D ), and the two expressions combined as a ratio: 3.5 27
Or (posterior odds of disease) = (positive likelihood ratio) x (prior odds of disease) The positive likelihood ratio (LR+) is the ratio between the probability of a positive test result given the presence of a disease and the probability of the same test result given the absence of a disease (Choi, 1998). The negative likelihood ratio (LR-) can be defined analogously. These likelihood ratios can be calculated from the slope of the ROC curve. The tangent to the curve at the point representing the threshold of interest corresponds to the likelihood ratio for a single test value corresponding to that point on the ROC curve for a continuous test (Henderson, 1993; Dwyer, 1996; Choi, 1998). The likelihood ratio of a positive test result for a dichotomous test is given by LR+ = TPR / FPR 3.6 where TPR is the True Positive Proportion, and FPR is the False Positive Proportion, both of which lie between 0.0% and 100.0%. The likelihood ratio is equivalent to the tangent of the angle formed by that point, 0,0 and 1,0. The likelihood ratio of a negative test result is given by LR- = FNR / TNR 3.7. which is equivalent to the slope of a line from that point to 1,1 (Choi, 1998). Before performing any tests at all, the prior probability of a patient having the disease is equal to the prevalence of the disease in the population of interest (Moons et al., 1997). The posterior odds required for treatment depend upon clinical issues such as the cost/benefit tradeoff discussed above, and will vary greatly from test to test and disease to disease. Further discussion of these issues can be found in (Moons et al., 1997). 28
3.9 DISCUSSION OF ROC CURVE ANALYSIS ROC curve analysis has not been used in the investigation of cervical cancer earlier. Early studies tended to be subjective and qualitative, and did not yield the type of data suitable for ROC curve generation. The first workers to use ROC analysis in this context were Burger et al. (1981), who present ROC curves for their cervical cell classifiers, but do not extract any summary measures, such as the area under the curve. Haroske et al. (1990) also take this approach in demonstrating the performance of their hierarchical classifier, as do Garner et al. (1994b), who merely mention that the closer the curves are to the axes, the better the system performance (Garner et al., 1994b, p. 8). The same is true for Palcic & MacAulay (1994). Payne et al. (1997) do, however, present a ROC curve, together with a discussion of the implications of such a curve for the underlying classifier, and a discussion of the sensitivity, specificity and positive predictive value of the classifier at selected operating points. Given the amount of work which has gone into the development of ROC curve analysis for the characterization, optimization and comparison of classifiers, it would appear that these techniques have to date been underutilized by cervical cancer researchers. 29