Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC 13 March 2015 1 Testing A useful tool for investigating model performance is the confusion matrix: y = 0 y = 1 ŷ = 0 a b ŷ = 1 c d Contains quantities for the correct prediction of class 0, correct prediction of class 1, and the two ways you may have made incorrect predictions. Tom Kelsey ID5059-17-AUC 13 March 2015 2 Performance Measures a + d Accuracy a + b + c + d d Precision b + d d Recall (TP) Sensitivity c + d a True negative Specificity a + b b False positive a + b c False negative c + d Tom Kelsey ID5059-17-AUC 13 March 2015 3

Receiver-Operator Characteristics ROC curves For continuous data with variable cutoff points for the classification Obese Y/N based on BMI, age, etc. Cancerous based on percent of abnormal tissue in a slide Given a tree, some test data and a confusion matrix, it s easy to generate a point on a ROC chart x-axis is FP rate, y-axis is TP rate This point depends on a probability threshold for the classification Varying this threshold will change the confusion matrix, giving more points on the chart Use this to tune the model w.r.t FP and TP rates Tom Kelsey ID5059-17-AUC 13 March 2015 4 Example Goldstein and Mushlin (J. Gen. Intern. Med. 1987 2 20-24) Tom Kelsey ID5059-17-AUC 13 March 2015 5 Example Tom Kelsey ID5059-17-AUC 13 March 2015 6

Example Tom Kelsey ID5059-17-AUC 13 March 2015 7 Example Tom Kelsey ID5059-17-AUC 13 March 2015 8 Effect of Thresholding How the balance between TP, TN, FP and FN changes: Tom Kelsey ID5059-17-AUC 13 March 2015 9

Area Under Curve The area measures discrimination the ability of the test to classify correctly Useful for comparing ROC curves standard academic banding: 0.90 1.00 = excellent 0.80 0.90 = good 0.86 for the example 0.70 0.80 = fair 0.60 0.70 = poor 0.50 0.60 = fail Computed by trapezoidal estimates (or the curve can be smoothed, then integrated) Tom Kelsey ID5059-17-AUC 13 March 2015 10 Kelsey et al. Tom Kelsey ID5059-17-AUC 13 March 2015 11 Kelsey et al. Tom Kelsey ID5059-17-AUC 13 March 2015 12

Tom Kelsey ID5059-17-AUC 13 March 2015 13 Tom Kelsey ID5059-17-AUC 13 March 2015 14 Tom Kelsey ID5059-17-AUC 13 March 2015 15

Tom Kelsey ID5059-17-AUC 13 March 2015 16 Tom Kelsey ID5059-17-AUC 13 March 2015 17 Tom Kelsey ID5059-17-AUC 13 March 2015 18

The Case For S. Ma & J. Huang Regularized ROC method for disease classification and biomarker selection with microarray data Bioinf. (2005) 21 (24) An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease classification. Thus there is a need for developing statistical methods that can efficiently use such high-throughput genomic data, select biomarkers with discriminant power and construct classification rules. The ROC technique has been widely used in disease classification with low-dimensional biomarkers because (1) it does not assume a parametric form of the class probability as required for example in the logistic regression method; (2) it accommodates case-control designs and (3) it allows treating false positives and false negatives differently. However, due to computational difficulties, the ROC-based classification has not been used with microarray data. Tom Kelsey ID5059-17-AUC 13 March 2015 19 The Case Against J.M. Lobo et al. AUC: a misleading measure of the performance of predictive distribution models Global Ecol. and Biogeog. 17(2); 2008 The... AUC, is currently considered to be the standard method to assess the accuracy of predictive distribution models. It avoids the supposed subjectivity in the threshold selection process, when continuous probability derived scores are converted to a binary presence-absence variable, by summarizing overall model performance over all possible thresholds... We do not recommend using AUC for five reasons: (1) it ignores the predicted probability values and the goodness-of-fit of the model; (2) it summarises the test performance over regions of the ROC space in which one would rarely operate; (3) it weights omission and commission errors equally; (4) it does not give information about the spatial distribution of model errors; and, most importantly, (5) the total extent to which models are carried out highly influences the rate of well-predicted absences and the AUC scores. Tom Kelsey ID5059-17-AUC 13 March 2015 20 Lift Measures the degree to which the predictions of a classification model are better than random predictions. In simple terms lift is the ratio of the correct positive classifications made by the model to the actual positive classifications in the test data For example, if 40% of patients have been diagnosed (the positive classification) in the past, and the model accurately predicts 75% of them, the lift would be 0.75 0.4 = 1.875 Tom Kelsey ID5059-17-AUC 13 March 2015 21

Lift Lift charts for a model can be obtained in a similar manner to ROC charts. For threshold value t x = TP(t) + FP(t) P + N y = TP(t) The AUC of a lift chart is no smaller than the AUC of the ROC curve for the same model As before, we can compare lift charts for competing models, and investigate optimal threshold values Tom Kelsey ID5059-17-AUC 13 March 2015 22 Lift Example Suppose there is have a mailing list of former students, and we want to get money by mailing an elaborate brochure. We have demographic information that we can relate to the response rate. Also, from similar mail-out campaigns, we estimated the baseline response rate at 8%. Sending to everyone would result in a net loss. We build a model based on the data collected. We can select the 10% most likely to respond. If among these the response rate is 16% percent then the lift value due to using the predictive model is 16% / 8% = 2. Analogous lift values can be computed for each percentile of the population. From this we work out the best trade-off between expense and anticipated response. Tom Kelsey ID5059-17-AUC 13 March 2015 23 General chart structure You can think of this as a customer database ordered by predicted probability - as we move from left-to-right we are penetrating deeper in to the database from high ˆp observations to low ˆp observations: Tom Kelsey ID5059-17-AUC 13 March 2015 24

Lift Closely associated with the Pareto Principle 80% of profit comes from 20% of customers. A good model and a lift chart help identify those customers. Tom Kelsey ID5059-17-AUC 13 March 2015 25 Why use these plots? The utility of these charts is hopefully clear: if we had a limited budget we can see what kind of level of response this would buy by targeting the (modelled) most likely responders we can see how much value our model has brought to the problem (compared to a random sample of customers) - in direct monetary terms if costs are included perhaps we can do a smaller campaign, as the returns diminish beyond some percentage of customers targeted we can see where a level of customer targeting becomes unprofitable if the costs are known. Tom Kelsey ID5059-17-AUC 13 March 2015 26 Summary Medics and management use ROC, AUC & Lift whenever possible Easy to compute Easy to understand Simple 2D graphical expression of how Model A compares to Model B Plus useful threshold cutoff information Plus important cost-benefit information You are expected to be able to produce ROC curves. You are not expected to be able to produce lift charts, but be able to explain their design and use. Tom Kelsey ID5059-17-AUC 13 March 2015 27