AQCHANALYTICAL TUTORIAL ARTICLE. Classification in Karyometry HISTOPATHOLOGY. Performance Testing and Prediction Error

Size: px

Start display at page:

Download "AQCHANALYTICAL TUTORIAL ARTICLE. Classification in Karyometry HISTOPATHOLOGY. Performance Testing and Prediction Error"

Lewis Maxwell
5 years ago
Views:

1 AND QUANTITATIVE CYTOPATHOLOGY AND AQCHANALYTICAL HISTOPATHOLOGY An Official Periodical of The International Academy of Cytology and the Italian Group of Uropathology Classification in Karyometry Performance Testing and Prediction Error TUTORIAL ARTICLE Peter H. Bartels, Ph.D., and Hubert G. Bartels, M.S.I.E. Classification plays a central role in quantitative histopathology. Success is expressed in terms of the accuracy of prediction for the classification of future data points and an estimate of the prediction error. The prediction error is affected by the chosen procedure, e.g., the use of a training set of data points, a validation set, an independent test set, the sample size and the learning curve of the classification algorithm. For small samples procedures such as the jackknife, the leave one out and the bootstrap are recommended in order to arrive at an unbiased estimate of the true prediction error. All of the procedures rest on the assumption that the data set used to derive a classification rule is representative for the diagnostic categories involved. It is this assumption that in quantitative histopathology has to be carefully verified before a clinically generally valid classification procedure can be claimed. (Anal Quant Cytopathol Histopathol 2013;35: ) Keywords: classification, histopathology, karyometry. Classification of nuclei, lesions, and patients plays a central role in quantitative histopathologic studies. There is a rich literature on classification procedures, on the training of classification algorithms, and on the testing of their performance. There is the comprehensive collection of seminal studies in the engineering field edited by Agrawala. 1 There are the classical texts by Fukunaga 2 and by Duda and Hart. 3 There have been extensive studies of the behavior of classification algorithms. Computer simulations and Monte Carlo studies have led to a thorough understanding of the underlying processes. Most authors in the field agree, though, that there is no general theory guiding method development. The existing procedures and recommendations are essentially based on heuristics. Much of the literature, particularly on error estimation, requires a rather advanced background in mathematics and statistics for a reader to appreciate the recommendations for practical applications. 4-6 However, the correct practical use of classification algorithms From the College of Optical Sciences and Arizona Cancer Center, University of Arizona, Tucson, Arizona, U.S.A. Dr. P. Bartels is Professor Emeritus. Mr. H. Bartels is Applications Programmer, Senior. This work was supported in part by grant PO1 CA from the National Institutes of Health, Bethesda, Maryland, and a gift from Michael Lewis, Los Angeles, California. Address correspondence to: Peter H. Bartels, Ph.D., Arizona Cancer Center, University of Arizona, 1515 North Campbell Avenue, P.O. Box , Tucson, Arizona , U.S.A. (hubertbartels@msn.com). Financial Disclosure: The authors have no connection to any companies or products mentioned in this article /13/ /$18.00/0 Science Printers and Publishers, Inc. 181

2 182 Bartels and Bartels offered in software packages is fairly straightforward. 7,8 Karyometry presents some particular challenges to the development, evaluation and application of classification procedures. In many instances the validity of certain assumptions underlying classification procedures is questionable: clinical materials by their very nature rarely offer entirely homogeneous populations. The idea of a cohort of patients even when matched for certain anamnestic variables as offering samples from a single stochastic source in the jargon of the pertinent literature is, at best, an approximation. It is difficult to state with confidence a priori at what size a truly representative sample for a diagnostic category has been attained. An analysis of nuclear populations usually involves thousands of nuclei. Here again, though, the presence of subpopulations of different phenotypes with often subtle differences in karyometric characteristics raises questions concerning the homogeneity of the clinical samples. There are 3 assumptions underlying any discussion of classifier development and performance: first, that the sample used for the training of a classification algorithm is truly representative for its class. Second, the data points, i.e., nuclei, lesions or patients, are assumed to be true random samples. The researcher assembling the data sets must under no circumstances exert any judgment, or preselect nuclei or patients. Third, it is assumed that the data are drawn for each category from a single distribution, as mentioned above, from a single stochastic source. The assumption implies that the population from which the objects are taken is homogeneous. For the elements of the training or test sets to be random samples from a homogeneous population, though, is by itself not yet enough. The elements in a training set should also be fully representative of the population in general. The literature calls this requirement that they fully cover the problem space. This is a condition that in karyometry is often hard to attain or even to verify a priori. Given the notable variability in any biologic entity, one might have to assemble a data set of considerable size in order to make it fairly representative. Unless one has a fully representative data set for the classifier, the system is not really trained to finality and the true prediction error is hard to approximate. Exactly what sample size is required to achieve full representation depends on the task at hand. Practical experience suggests that equality of the apparent error from the training set to the prediction error from the test set indicates that full representation has been reached. It is not that unusual in karyometry that one finds this to be true at sample sizes of a few hundred nuclei. It is the objective of this article to provide guidance for the practical use of classification procedures and to explain the underlying rationale. Basic Concepts The basic process begins with the assembly of 2 data sets, representing diagnostic categories, to be distinguished by a decision rule. A search is conducted for characteristics, or features, which have differences in value for the 2 diagnostic categories. The set of selected features is called a feature vector. Each feature vector in karyometry represents a nucleus, a lesion, or a patient. In the following these shall be referred to as objects or as data points. The feature vectors for the 2 samples to be discriminated are submitted to a classification algorithm. The algorithm derives a decision rule. That rule typically is a linear combination of feature values. Computed for a single data point, it results in a score. The score value is compared to a threshold. If it exceeds the threshold, the data point is assigned to one diagnostic category; if less than the threshold value, it is assigned to the other diagnostic category. The decision rule is applied to the data sets, and the proportion of correctly assigned objects is determined. The result is presented as a classification matrix. The rows represent the true diagnostic category for data points of known label. The columns present the assignments made by the classification algorithm. Table I shows an example. There are objects assigned to their correct category and there are objects that have been misclassified, i.e., assigned to the incorrect category. The correct recognition rate, or overall accuracy, here would be / = 81.4%. The estimated classification error would be 18.6%. In this example the distinguishing features did not completely separate objects from the two diag- Table I Classification Matrix for Nuclei True diagnostic category Assignment by algorithm Sample size Class A Class B Class A (84%) 62 (16%) Class B (23%) 185 (77%)

3 Volume 35, Number 4/August 2013 Classification in Karyometry 183 nostic categories. The misclassified objects are referred to as classification errors. At this point one might add a feature, or delete a feature that does not carry a notable weight. Then one would run the classification algorithm again to see whether a better distinction with a lower error rate could be attained. This brings us to an important concept in classification methodology: the estimation of an error rate. Classifier Performance: Estimation of Error Rates In the basic procedure shown above, objects used in the development of the decision rule were also involved in the estimation of the decision rule s performance. However, the rule may have been fitted specifically to the data set from which it was derived. The result might be optimistic. It could not be expected to be as favorable when the rule is applied to new objects, independent, and not involved in the rule s derivation. The result of the procedure may, in principle, be biased. The error rate is known as apparent error rate (E app ). The bias resulting in an optimistic outcome causes the error rate to be lower than the rate that would be expected in the application of the rule to unknown, new objects from the same diagnostic categories. 8 The rate at which the decision rule would classify any new, independent objects is called the generalization error rate, or the true prediction error (E true ) rate. The reason for the bias in the apparent error rate is that the samples used to derive the result may not have been fully representative for their categories. If they had been, the application to new, independent objects would yield the same misclassification rate as for the original data sets. This, however, is rarely the case for biologic materials. The conventional wisdom, according to which the procedure leads to bias, is generally accepted. Using the data sets from the formulation of the decision rule to estimate the classifier error rate is known as resubstitution and the classification error E app as resubstitution or reclassification error. The Training Set/Test Set Procedure A common method to avoid the resubstitution error and to obtain an unbiased estimate of the true prediction error is the training set/test set procedure. The original data sets are partitioned. Often, 50% of the objects from category A and 50% of the objects from category B are used to derive a decision rule, as a training set. The other 50% of each category are used as independent test set. The decision rule is then applied to the test set. The classification result from the test set is free of bias. The recommendation is to report only the result from the test set. In karyometry the clinical materials representing diagnostic categories usually are a set of nuclei from each case and a set of cases from each diagnostic category. Typical values would be 100 nuclei per case, and from 10 to 50 cases per diagnostic category. This would allow training sets of a minimum of 500 nuclei from 5 cases or 2,500 nuclei from 25 cases. The partition into training and test sets should be made at the case level. One should not randomly select, e.g., every other nucleus to be assigned to the training or test set, as this would not result in an independent sample for the test set. The results are now presented by two classification matrices: one for the training set, and one for the test set. It is to be expected that the overall accuracy 1- E test is somewhat lower than that from the training set. Practical experience from the pattern recognition literature suggests a decrease by < 15%. If the classification error is more than that, one might want to reexamine the selected feature set. It has been customary, finally, to apply the decision rule to the combined training set and test set, thus getting an estimate of the classification error on a larger sample size. This, of course, also involves resubstitution and may introduce bias and provide too optimistic a result. Any resubstitution has been criticized and its use discouraged. The categorical rejection of a classification procedure involving resubstitution is, though, not always justified. The resubstitution bias decreases with increasing sample size. The apparent error induced by the training increases, and asymptotically approximates the true prediction error. The test set error decreases in a similar manner, approximating the true prediction error with increasing sample size, as reflected in the classifier s learning curve. This is seen in Figure 1. The relationship between the apparent prediction error and the generalized, true prediction error as a function of sample size is demonstrated by the learning curve of a classifier. The learning curve of a classifier usually takes the form of a power law function. 9 The estimated apparent prediction error becomes monotonically less optimistic with increasing sample size. For large samples the training error becomes equal to the true prediction error because the

4 184 Bartels and Bartels Figure 1 Apparent error and test set error of a classifier as a function of sample size. samples for both diagnostic categories have become fully representative for their populations. For the test set error the opposite trend is true. It decreases with increasing sample size and asymptotically approximates the true prediction error. For large samples both the apparent error and the test set error leave only a negligible bias. The distinction between apparent error and true prediction error is dropped altogether for large samples in the socalled one-shot approach. 10 Sample size thus plays an important role in assessing classification errors. The literature on classification methodology considers samples of 10,000 as very large, and samples in the range of 1,000s to be intermediate. Samples of < 500 in size are generally considered as small in the engineering literature. The question then becomes, what is a large sample in karyometry? The heuristic rule here, for a multivariate analysis, is that a sample consisting of 10 times the number of objects per variable is accepted as a large sample for which resubstitution would not be optimistically biased. The asymptotic approximation of the test set error to the true prediction error as a function of sample size is closely related to the learning curve of the classifier. The learning curve follows a power law and has the form E test = E true + C/n x The constant C and the exponent x are task specific. They have to do with the dispersion of the test set data and how many data points would be needed to have a representative sample. With increasing sample size n the second term in the sum goes to zero and the true prediction error remains. The exponent x affects the sample size, at which a certain difference to the true prediction error is reached, say, 1% or so. The value of x is slightly larger than 1.00, but it has a notable influence on the effective sample size. For n = 200, x = 1, n x = 200, but for x = 1.05, n x = 260, and for x = 1.10, n x = 339. In karyometry the classification of nuclear populations practically always involves several hundred, and often thousands of, nuclei. The remaining resubstitution error then is very small. Assessing overall efficacy of a chemopreventive agent on a treated and a control cohort, even in an exploratory study, may involve about 20 patients/ diagnostic category, i.e., there would be 2,000 nuclei/diagnostic category, and typically from 4 to 8 variables. In the development of a criterion indicating risk for the development of an aggressive type of lesion, one could expect patients, i.e., up to 10,000 nuclei recorded and evaluated. The decision rule typically involves from 3 to 8 variables at most. In both instances, the resubstitution error may be negligible. The assessment of nuclei from a single or a small

5 Volume 35, Number 4/August 2013 Classification in Karyometry 185 number of cases invariably provides only small samples. This is certainly also so when classification involves nuclear subpopulations of different phenotype, as they occur in single cases. Attention then needs to be paid to possible bias. There is always some uncertainty as to what sample size would be representative for a diagnostic category. Weiss and Kulikowski 7 point out that the sample size ensuring full representation may not be unreasonably high, and rather, sometimes, surprisingly small. One knows the size of the test set. For any classifier the quality of an error estimate depends directly on the number of objects in the test set, and the accuracy of the estimate, on randomly drawn, independent test objects, follows a binomial distribution. This means we know not only the error rate estimated from the test set but also how far off it can be: the highest possible error rate is given by the confidence limit of the binomial distribution. There is only a low percentage chance that the error rate is higher. Thus, for example, in a situation where an error rate of 32% had been estimated, on a nuclear population of 2,000 nuclei the true error rate is likely not higher than 32% %. The standard error is defined as Standard Error = {E * (1 E) / n} 1/2 = {0.32 * (1 0.32) / 2,000} 1/2 = {1.088 * 10-4 } 1/2 = , or 1.04% i.e., the sample size is quite adequate for an estimate of the true prediction error. For an estimate of the percentage of nuclei of a certain phenotype in a single case, n = 100, and the same estimated error rate of 32%, the result would be 32% % = 36.7%. Even for samples of intermediate size the difference between apparent error and true prediction error may not be substantial, though. If the classification error from the training set matches the classification error from the test set, it is an indication that the decision rule had not been bent to fit the training set. It indicates that both the training set and the test set are fully representative for the diagnostic categories at hand and that the apparent error has become practically equal to the true prediction error. In the classification of cases, sample sizes tend to be small. To obtain an unbiased estimate of the true prediction error, the partitioning of the data sets into 50% training and 50% test sets is common. The recommendations in the literature tend to partitionings of 2/3 training set objects and 1/3 test set objects, or even to 90% versus 10%. The reasoning is that this makes more information available for the defining of a decision rule. Use of a Validation Set The concern with optimistic bias in the apparent error is justified in the classification of cases. It has been extended to the test set used in the training set/test set procedure as described above. There the result from the test set is generally accepted as unbiased. The argument here is, though, that the training of the system may involve observing the result obtained from the test set. Making adjustments to the decision rule, therefore, actually involves the test set in the process and impairs complete independence. In response, a procedure is recommended where the data sets are partitioned into 3 components: a training set and a validation set for the development of the decision rule, and then application of that rule to a truly independent test set (Figure 2). 4,6 Classification of Intermediate and/or Small Samples The estimate of the generalization, true prediction error is a function of the sample size of the training set. In many studies one might expect that the size for a fully representative sample might have to be prohibitively large or might just not be available. The classifier, therefore, would have to be tested on a sample of smaller size for an estimate of the true prediction error. For the classification of medium and small size data sets, a number of methods are recommended. The training set/test set sequence, with a partitioning into just 2 data sets, is expanded. In the jackknife procedure the preferred choice is a 5-fold to 10-fold cross-validation. 11 In the leave- Figure 2 Partitioning of data into a training set, a validation set and a test set.

6 186 Bartels and Bartels one-out procedure a partitioning of a sample of n data points into partitions of size n 1 is set up. The bootstrap method is recommended especially for small samples which are resampled with replacement up to several hundred times, followed by the same number of training set/test set procedures. The Jackknife Procedure In this procedure the data sets are divided into a number of subsets. All but 1 are used to derive decision rules versus the left-out subset. Since one had several subsets, there are several decision rules and several estimates of an error rate. The true error rate is the average of them. Its reliability is ascertained by the standard deviation of this set of estimates. The partitioning is shown in Figure 3. For a sample of 300 objects and a partitioning into 5 subsets, the training set thus has 240 entries. The risk that one encounters with all decreases in the size of the training set is that one may end up at a portion of the learning curve of the classifier well below the asymptotic approach to the true prediction error. This would result in an overestimate of the true prediction error, as shown in Figure 4. One needs to consider the trade-off between the number of partitions and the effect of working with a smaller sample size for the training set. For the example above, the 5-fold partition would result in a training set of 240 objects. This would provide an acceptable approximation to the true prediction error. But, if for the same task one had only 80 samples to begin with, the training set would have only 64 objects. This may very well place the problem into a range of the learning curve where the slope has still kept it well below the accuracy given by the true prediction error. 4 This can be seen when drawing a line vertically from the abscissa at the sample size of 64 to the learning curve. If one chose to employ a 10-fold cross validation, this would further reduce the size of the effective training set, and it might lead to an overestimate of the true prediction error. Just how much the true prediction error would be overestimated depends not only on the available sample size, but also on the slope of the learning curve. This situation may not become a problem in the processing of nuclear populations. However, when the data points represent cases, it is a very relevant consideration. The Leave-One-Out Procedure In this procedure, with a sample size of n, training is done on n 1 data points versus the 1 data point left out. This process is then repeated until every data point has been left out once, i.e., one develops n classification rules. This is a very labor-intensive procedure. Its single advantage is that for a small sample the leave-oneout method is the only one to provide an unbiased estimate of the true prediction error. For this estimate one uses the average error for the n rules. There are n such estimates, so one obtains an estimate for the variance of the true prediction error as well. The procedure has a number of disadvantages. There is the need to develop n decision rules, and the finally resulting estimate for the true prediction error is based on an average over the n classifiers, so which rule does it refer to? Figure 3 Jackknife procedure with a 5-fold cross validation. The Bootstrap Procedure For very small samples, say of 30 objects or so, the finding of the best estimate for the prediction error may be difficult. Traditionally for such samples the leave-one-out method has been used. It is unbiased, but the variance of the prediction error estimate is quite high for small samples. In such small samples the variance has a dominating influence on the result. Thus, if one had a low variance procedure, for small samples even some bias may be accepted. The bootstrap method offers such a procedure. It was introduced in 1983 by Efron. 12 Bootstrapping is

7 Volume 35, Number 4/August 2013 Classification in Karyometry 187 Figure 4 Overestimation of true prediction error resulting from a sample size so small that the learning curve of the classifier is not yet approximating the true prediction error. a resampling method. If one has a sample of n cases, one would resample the sample by drawing n resamples, with replacement. In sampling with replacement, an object may be drawn twice or even multiple times for a resample, while other objects are not resampled at all. Sampling theory shows that in such a procedure, on the average, 63.2% of the original objects are drawn for a resample, and 36.8 objects are not drawn. These are used as test set. The resampling may be done a very large number of times, such as times. These are treated as independent data sets in the subsequent (200 or so) training set/test set procedures. The so-called procedure results in a low variance estimate for the prediction error, but it has an optimistic bias. Conclusions The engineering literature emphasizes that it is useful to have rules discriminating between objects from different classes, but that the real challenge is to have rules that allow an accurate prediction for new objects in the future. This is certainly true and accurate; generally valid prediction rules are more difficult to derive. In karyometry, though, even the ability to distinguish accurately between objects in 2 data sets plays an important role e.g., in the assessment of the grade, or, in general, an accurate quantitative assessment of a lesion. And, it is by no means evident that just such a classification rule is simple and straightforward to derive. In karyometry, resubstitution error might be the least problem to be worried about, but sample inhomogeneity and inadequate representation can pose big problems. The literature on automated pattern recognition, machine vision and classification lists the representative sample as a prime requirement for the development of a classification rule. In technology applications this requirement is readily satisfied, but in karyometry it remains a major problem. In prospective karyometric studies in which material from one and the same institution is used, careful control of processing, i.e., of fixation, sectioning, and staining, is possible. When materials collected at different institutions are used, or even when prepared from archival material at the same institution, the assumption of having a representative sampling may need to be examined. Karyometric characteristics do not generally have visually clearly perceived appearance. Histopathologic preparations looking convincingly the same as others, in their digital representation, may be distinctly different. Consequently, one may well find agreement between classification success from a training set and a test set for the clinical materials in a given study. But, one may also find that the classification rule fails when applied to a set of histopathologic slides from a different institution, even when those were prepared according to a well-defined protocol. The differences may be subtle. Training on the new material may show the same karyometric features as effective, but the

8 188 Bartels and Bartels coefficients in the discriminant function may be a little different. A representative sample for a classification algorithm therefore may not evolve until material subject to all small differences in preparation has been included in the training. A clinically generally valid classification rule must be expected to follow from an iterative process. The problem of a representative sample becomes particularly relevant in situations when the original set of cases is small. One has to remember here that nuclei from a given diagnostic category, and especially from a given grade of a lesion, are not crisp, but fuzzy sets. What one has as a representative sample is a small number of members of a fuzzy set. Classification methodologies for small samples have been developed to allow estimates of prediction error balanced between the variance of the estimate versus bias. Bootstrapping is a good example for this. A small sample is resampled with replacement, possibly hundreds of times, until a large number of these small data sets are generated. They allow a precise estimate of prediction error. The generated bootstrapped data set has the exact stochastic properties of the original small sample. The derived classification rule is generally valid, but only for additional materials with the same stochastic properties as the original data set. As a small sample of a fuzzy set it is unlikely that the originally included cases represent a diagnostic category in its entirety. The problem of a representation, therefore, is doubly relevant when the original material is but a small sample. Again, this is rarely a problem in technology applications, but in histopathologic materials it is to be taken into serious consideration. References 1. Agrawala AK (editor): Machine Recognition of Patterns. New York, IEEE Press, Fukunaga K: Introduction to Statistical Pattern Recognition. New York, Academic Press, Duda R, Hart P: Pattern Classification and Scene Analysis. New York, Wiley, Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. New York, Springer, 2001, pp Michie D, Spiegelhalter DJ, Taylor CC: Machine Learning, Neural and Statistical Classification. New York, Ellis Horwood, 1994, pp Schuermann J: Pattern Classification. New York, John Wiley, 1996, pp Weiss SM, Kulikowski CA: Computer Systems That Learn: Classification and Prediction Methods. San Mateo, California, Morgan Kaufman, 1990, pp James M: Classification Algorithms. New York, Wiley, 1985, pp Duda RO, Hart PE, Stork DG: Pattern Classification. Second edition. New York, John Wiley, 2000, p Henery RJ: Methods for comparison: Train and test. In Machine Learning, Neural and Statistical Classification. Edited by D Michie, DJ Spiegelhalter, CC Taylor. New York, Ellis Horwood, 1994, p McLachlan GJ: Discriminant Analysis and Statistical Pattern Recognition. New York, John Wiley, Efron B: Estimating the error rate of a prediction rule: Some improvements on cross-validation. J Amer Statist Assoc 1983;78:

Empirical Formula for Creating Error Bars for the Method of Paired Comparison

Empirical Formula for Creating Error Bars for the Method of Paired Comparison Ethan D. Montag Rochester Institute of Technology Munsell Color Science Laboratory Chester F. Carlson Center for Imaging Science