METHODS FOR DETECTING CERVICAL CANCER

Similar documents
1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Various performance measures in Binary classification An Overview of ROC study

Comparing Two ROC Curves Independent Groups Design

Statistics, Probability and Diagnostic Medicine

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Receiver operating characteristic

Meta-analyses evaluating diagnostic test accuracy

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

ROC Curves (Old Version)

EVALUATION AND COMPUTATION OF DIAGNOSTIC TESTS: A SIMPLE ALTERNATIVE

BMI 541/699 Lecture 16

Estimation of Area under the ROC Curve Using Exponential and Weibull Distributions

Week 2 Video 3. Diagnostic Metrics

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

4. Model evaluation & selection

Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta GA, USA.

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

Introduction to ROC analysis

Sensitivity, Specificity, and Relatives

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Bayes theorem, the ROC diagram and reference values: Definition and use in clinical diagnosis

3. Model evaluation & selection

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems

PII: S (96) THE USE OF THE AREA UNDER THE ROC CURVE IN THE EVALUATION OF MACHINE LEARNING ALGORITHMS

An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn

Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy

Meta-analysis of Diagnostic Test Accuracy Studies

Critical reading of diagnostic imaging studies. Lecture Goals. Constantine Gatsonis, PhD. Brown University

A scored AUC Metric for Classifier Evaluation and Selection

Overview. Goals of Interpretation. Methodology. Reasons to Read and Evaluate

SYSTEMATIC REVIEWS OF TEST ACCURACY STUDIES

7/17/2013. Evaluation of Diagnostic Tests July 22, 2013 Introduction to Clinical Research: A Two week Intensive Course

Clinical Decision Analysis

1 Diagnostic Test Evaluation

Evaluation of diagnostic tests

Examining differences between two sets of scores

Modifying ROC Curves to Incorporate Predicted Probabilities

Module Overview. What is a Marker? Part 1 Overview

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Fundamentals of Clinical Research for Radiologists. ROC Analysis. - Research Obuchowski ROC Analysis. Nancy A. Obuchowski 1

Meta-analysis of diagnostic research. Karen R Steingart, MD, MPH Chennai, 15 December Overview

Probability Revision. MED INF 406 Assignment 5. Golkonda, Jyothi 11/4/2012

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy

VU Biostatistics and Experimental Design PLA.216

Worksheet for Structured Review of Physical Exam or Diagnostic Test Study

Introduction to screening tests. Tim Hanson Department of Statistics University of South Carolina April, 2011

Introduction to Meta-analysis of Accuracy Data

Machine learning II. Juhan Ernits ITI8600

Biomarker adaptive designs in clinical trials

Diagnostic Test. H. Risanto Siswosudarmo Department of Obstetrics and Gynecology Faculty of Medicine, UGM Jogjakarta. RS Sardjito

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)

Diagnostic tests, Laboratory tests

Chapter 11. Experimental Design: One-Way Independent Samples Design

STATISTICS AND RESEARCH DESIGN

Bayesian meta-analysis of Papanicolaou smear accuracy

sickness, disease, [toxicity] Hard to quantify

Systematic Reviews and meta-analyses of Diagnostic Test Accuracy. Mariska Leeflang

Diagnostic Reasoning: Approach to Clinical Diagnosis Based on Bayes Theorem

Predicting Breast Cancer Survivability Rates

An Empirical and Formal Analysis of Decision Trees for Ranking

Detection Theory: Sensitivity and Response Bias

Screening (Diagnostic Tests) Shaker Salarilak

Zheng Yao Sr. Statistical Programmer

Diagnostic imaging evaluating image quality using visual grading characteristic (VGC) analysis

Hayden Smith, PhD, MPH /\ v._

Clinical Utility of Likelihood Ratios

Multivariate Mixed-Effects Meta-Analysis of Paired-Comparison Studies of Diagnostic Test Accuracy

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1:

OBSERVER PERFORMANCE METHODS FOR DIAGNOSTIC IMAGING Foundations, Modeling and Applications with R-Examples. Dev P.

Theme 14 Ranking tests

ROC Curves. I wrote, from SAS, the relevant data to a plain text file which I imported to SPSS. The ROC analysis was conducted this way:

Behavioral Data Mining. Lecture 4 Measurement

Overview of Non-Parametric Statistics

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

! Mainly going to ignore issues of correlation among tests

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

Studies reporting ROC curves of diagnostic and prediction data can be incorporated into meta-analyses using corresponding odds ratios

Multivariate Mixed-Effects Meta-Analysis of Paired-Comparison Studies of Diagnostic Test Accuracy

When Overlapping Unexpectedly Alters the Class Imbalance Effects

SEED HAEMATOLOGY. Medical statistics your support when interpreting results SYSMEX EDUCATIONAL ENHANCEMENT AND DEVELOPMENT APRIL 2015

Net Reclassification Risk: a graph to clarify the potential prognostic utility of new markers

Receiver operating characteristics curves and related decision measures: A tutorial

Performance Evaluation of Machine Learning Algorithms in the Classification of Parkinson Disease Using Voice Attributes

OCW Epidemiology and Biostatistics, 2010 Michael D. Kneeland, MD November 18, 2010 SCREENING. Learning Objectives for this session:

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Technical Specifications

The recommended method for diagnosing sleep

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

SEM: the precision of the mean of the sample in predicting the population parameter and is a way of relating the sample mean to population mean

Investigating the robustness of the nonparametric Levene test with more than two groups

Comparing disease screening tests when true disease status is ascertained only for screen positives

Diagnostic methods 2: receiver operating characteristic (ROC) curves

Assessment of performance and decision curve analysis

International Journal of Engineering Trends and Applications (IJETA) Volume 4 Issue 2, Mar-Apr 2017

Choosing the Correct Statistical Test

Methods Research Report. An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy

Transcription:

Chapter III METHODS FOR DETECTING CERVICAL CANCER 3.1 INTRODUCTION The successful detection of cervical cancer in a variety of tissues has been reported by many researchers and baseline figures for the sensitivity and specificity of their methods can be derived from their publications. In order to establish whether our laboratory could produce results similar to those reported in the literature. Several practical considerations are common to all hands-on cytological studies. This chapter discusses these practical issues and explains the reasoning behind the implementation decisions we made. The following chapter then describes the pilot studies and discusses the implications of their results in the context of previous findings in this field. 3.2 REPORTING DIAGNOSTIC TEST RESULTS The results of diagnostic tests are often reported as a single figure, usually the percentage of correct. This approach has the advantage of being simple to comprehend, and it makes comparisons between results easy to perform; but this single figure is influenced by a multitude of underlying factors, of which the most obvious is the cutoff threshold used for classification. A diagnostic test returns a continuously distributed measurement, or score, for each case. In order to actually classify the case, a threshold value of the score must be established, above (or below) which a case is considered 19

positive. The error proportion of the test depends upon the classification threshold chosen, and can be adjusted by moving this threshold. Figure 3.1 Dependence of summary measures of classifier performance on classification threshold selected To fully assess the performance of a diagnostic test it is necessary to understand its performance over a range of threshold values. Undoubtedly the most widely-used technique for this purpose is Receiver Operating Characteristic (ROC ) curve analysis (Henderson, 1993; Dwyer, 1996). ROC curves were developed for use in signal detection in radar returns in the 1950 s Swets (1986), in an excellent overview article, mentions that they were invented by Theodore G. Birdsall, of the Electrical Engineering Department of the University of Michigan, who taught the technique to him. The use of ROC curves has since been generalized to many problem domains, and is particularly widespread in medical decision making; Swets (1988) mentions at least 100 studies in the field of medical imaging which use ROC curve analysis. Example ROC curves are shown in Figure. 20

3.3 SCREENING TESTS Figure 3.2 Examples of ROC curves There are three possible purposes for a clinical diagnostic test discovery, confirmation and exclusion. A discovery test aims to detect the presence of a disease in a population; screening tests such as Pap smear screening are discovery tests. Confirmation tests are used to confirm the presence of disease in an individual with other symptoms, and exclusion tests allow the presence of disease in an individual to be ruled out. The three types of test have different purposes and a different decision threshold may be appropriate for each purpose. A rule-in threshold is the threshold used to confirm the presence of disease, while a rule-out threshold is used to exclude it. The two thresholds need not be the same. The performance of a diagnostic test is assessed in comparison with a gold standard indicator for the disease in question. Henderson (1993) defines a gold standard as A test constituting definitive diagnostic evidence [which] uniquely defines the disease 21

in the presence of specific symptomatology. For cervical neoplasia, the gold standard is biopsy of the lesion; however, in many studies biopsy is not available for all patients. Individuals diagnosed as normal are not usually biopsied, and therefore may include missed positives, while some women diagnosed as having neoplasia refuse treatment for personal reasons, and therefore cannot be biopsy-confirmed. A surrogate test, such as examination of a Pap smear by several cytologists, must be used instead in these cases. The possibility of errors in the gold standard must be kept in mind when assessing test performance. With respect to the gold standard diagnosis, a given test may provide one of four outcomes (Table 3.1). A positive test result in an individual with the disease is known as a true positive (TP); a positive result in a disease-free individual is a false positive (FP); a negative result in a disease-free individual is a true negative (TN) and a negative result in a patient with the disease is a false negative (FN). A table such as Table 3.1 is often referred to as the confusion matrix for a classifier. Table 3.1 The table of classification outcomes True Diagnosis Positive Classification Negative Total Positive True Positive (TP) False Negative (FN) Positive Population (PP) Negative False Positive (FP) True Negative (TN) Negative Population (NP) Total Class d Positive (CP) Class d Negative (CN) Total Population 22

From these outcomes, several related measures of test performance can be calculated (Bradley, 1997): Accuracy = (1 Error) = (TP + TN)/(PP + NP) = Pr(C), the probability of a correct classification. Sensitivity = TP/(TP + FN) = TP/PP = Pr(disease detected disease present), the ability of the test to detect disease in a population of diseased individuals. Specificity = TN/(TN + FP) = TN / NP = Pr(negative result no disease present), the ability of the test to correctly rule out the disease in a disease-free population. 3.4 PREDICTIVE VALUE OF A TEST Measures of sensitivity and specificity do not indicate how a test will perform in clinical practice. To do this, it is necessary to calculate the predictive values of that test. The positive predictive value (PPV) of the test is the proportion of all positive tests which correctly indicate the presence of disease; that is PPV = TP / (TP + FP) = Pr (disease present positive test) 3.5 RECEIVER OPERATING CHARACTERISTIC CURVE An ROC curve is constructed by first classifying the data set of interest. The classification result can be a single real number or an ordinal number, permitting a sensible ranking of cases (Dwyer, 1996; van Erkel & Pattynama, 1998). The true positive and false positive rates will depend upon the classification threshold chosen; to plot a ROC curve this threshold is varied over all possible output values and the true positive proportion is plotted against the false positive proportion for each threshold. The resulting curve will follow the diagonal from 0,0 to 1,1 if the classifier has no power (area 23

under the curve is 0.5), and will hug the top left corner of the plot for a perfect classifier (area under the curve is 1.0). The prevalence of the disease, the proportion of individuals with the disease in a given population at a specified point in time, will affect these values. The predictive value of a given test decreases with decreasing prevalence of the disease, so that even a relatively good test will perform poorly for a disease of very low prevalence. 3.6 ROC CURVE ANALYSIS ROC curves are usually plotted assuming a binormal distribution of the data that is, it is assumed that data from both the diseased and the non-diseased groups are distributed normally and the mean and variance of each distribution is estimated separately (Swets, 1986; Metz, Herman & Roe 1998). These distributions are then used to find the area under the curve. A number of nonparametric methods have also been developed for finding the area under the curve, to handle data which is highly non- Gaussian, although Hajian-Tilaki, Hanley, Joseph & Collet (1997) demonstrate that both parametric and non-parametric methods yield very similar estimates of the area under the curve for the same data, and these authors conclude that for a wide range of distributions ROC curves and their associated methodology are relatively robust. For statistical details of these approaches see (Swets, 1986). The binormal assumption is widely used because much of the original work on ROC curves was performed by humans ranking images (radar or radiological). A human can only handle a limited number of categories; five or seven are the numbers frequently cited in the literature (Hanley & McNeil, 1982; Swets, 1986). The area under a ROC curve empirically plotted from five categories can be estimated using the trapezoidal rule, but this approach is 24

known to underestimate the area. The binormal assumption allowed a smooth curve to be drawn, and the area underneath it to be calculated accurately (Hanley & McNeil, 1982). Using more recent computer techniques and a continuous decision variable, the binormal assumption is unnecessary and an empirical approach can be taken. 3.7 MEASURES FROM ROC CURVES The area under the ROC curve is commonly used as a measure of the overall performance of the classifier. Bamber (1975) demonstrated that the area under the ROC curve is equivalent to the value of the non-parametric Mann-Whitney U statistic, which in turn is equivalent to the Wilcoxon statistic. The Wilcoxon statistic is also a nonparametric statistic, usually used to test the hypothesis that the distribution of a variable, x, from one population, p, is equal to that from a second population, n. If the null hypothesis, H0: xp = xn is rejected, the probability, p can be calculated such that either xp > xn,, xp < xn or xp xn. The Wilcoxon test makes no assumptions about the distributions of the underlying variables. The area under the ROC curve effectively measures P(xp > xn), and so represents the probability that a random chosen positive example from a data set of interest is correctly ranked with respect to a randomly chosen negative example (Hanley & McNeil, 1982). In order to compare classifiers, it is necessary to estimate the standard error of the area under the curve, SE(AUC). The method for doing this varies with the method used to estimate the AUC, but one easily applied method, which is applicable to an empirically derived curve, is to use the standard error of the Wilcoxon statistic, SE (W): 25

.3.1 where θ is the area under the curve, C p and C n are the number of positive and negative examples respectively, and 3.2. 3.3 Q 1 is the probability that two randomly chosen abnormal images will both be ranked with greater suspicion than a randomly chosen normal image and Q2 is the probability that one randomly chosen abnormal image will be ranked with greater suspicion than two randomly chosen normal images (Hanley & McNeil, 1982; Henderson, 1993). SE(W) decreases as the number of samples on which it is estimated, N, increases, the decrease being proportional to N. SE(W) is also inversely proportional to the area under the curve, with SE(W) approaching 0 as the area under the curve approaches 1. To overcome these problems, Dwyer (1996) has suggested analysis of the area of that portion of the ROC curve which corresponds to a desirable range of false positive values. This allows the comparison of only those parts of the curve which are in the area of clinical interest. Another use of the ROC curve is to select a single threshold for optimum discrimination. Intuitively, this point is the point closest to the upper left corner of the graph; however, cost/benefit considerations may alter this optimum. Van Erkel & 26

Pattynama (1998) recommend combining ROC analysis with formal cost-effectiveness analysis in order to determine the optimal threshold. Returning to the use of a single point would appear to invalidate the use of ROC curves, since the same point could potentially be selected on the basis of the sensitivity/specificity values on which the curve is based. This is not, in fact, the case. A single pair of values may represent a point where sensitivity may be increased with little loss of specificity and vice versa, or it may not. Inspection of the ROC curve allows clarification of such issues (Bradley, 1997). 3.8 SLOPE OF THE ROC CURVE The area under the ROC curve provides an overall measure of the behaviour of the classifier, but does not measure performance at specific points. The diagnostic utility of a particular point on the ROC curve can be calculated using the odds-likelihood ratio form of Bayes Rule (Henderson, 1993). In its simplest form, Bayes Rule is usually written as:.. 3.4 where Pr(D R) is the probability of disease (D) given the test result (R), Pr(R D) is the probability of the result given that the disease is present, Pr(D) is the probability of disease (D) in the population, and Pr(R) is the probability of the test result in the population as a whole. This equation can also be expressed in terms of the probability of non-disease (D ), and the two expressions combined as a ratio: 3.5 27

Or (posterior odds of disease) = (positive likelihood ratio) x (prior odds of disease) The positive likelihood ratio (LR+) is the ratio between the probability of a positive test result given the presence of a disease and the probability of the same test result given the absence of a disease (Choi, 1998). The negative likelihood ratio (LR-) can be defined analogously. These likelihood ratios can be calculated from the slope of the ROC curve. The tangent to the curve at the point representing the threshold of interest corresponds to the likelihood ratio for a single test value corresponding to that point on the ROC curve for a continuous test (Henderson, 1993; Dwyer, 1996; Choi, 1998). The likelihood ratio of a positive test result for a dichotomous test is given by LR+ = TPR / FPR 3.6 where TPR is the True Positive Proportion, and FPR is the False Positive Proportion, both of which lie between 0.0% and 100.0%. The likelihood ratio is equivalent to the tangent of the angle formed by that point, 0,0 and 1,0. The likelihood ratio of a negative test result is given by LR- = FNR / TNR 3.7. which is equivalent to the slope of a line from that point to 1,1 (Choi, 1998). Before performing any tests at all, the prior probability of a patient having the disease is equal to the prevalence of the disease in the population of interest (Moons et al., 1997). The posterior odds required for treatment depend upon clinical issues such as the cost/benefit tradeoff discussed above, and will vary greatly from test to test and disease to disease. Further discussion of these issues can be found in (Moons et al., 1997). 28

3.9 DISCUSSION OF ROC CURVE ANALYSIS ROC curve analysis has not been used in the investigation of cervical cancer earlier. Early studies tended to be subjective and qualitative, and did not yield the type of data suitable for ROC curve generation. The first workers to use ROC analysis in this context were Burger et al. (1981), who present ROC curves for their cervical cell classifiers, but do not extract any summary measures, such as the area under the curve. Haroske et al. (1990) also take this approach in demonstrating the performance of their hierarchical classifier, as do Garner et al. (1994b), who merely mention that the closer the curves are to the axes, the better the system performance (Garner et al., 1994b, p. 8). The same is true for Palcic & MacAulay (1994). Payne et al. (1997) do, however, present a ROC curve, together with a discussion of the implications of such a curve for the underlying classifier, and a discussion of the sensitivity, specificity and positive predictive value of the classifier at selected operating points. Given the amount of work which has gone into the development of ROC curve analysis for the characterization, optimization and comparison of classifiers, it would appear that these techniques have to date been underutilized by cervical cancer researchers. 29