An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn

Size: px

Start display at page:

Download "An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn"

Fay Brooks
5 years ago
Views:

1 An Introduction to ROC curves Mark Whitehorn Mark Whitehorn

2 It s all about me Prof. Mark Whitehorn Emeritus Professor of Analytics Computing University of Dundee Consultant Writer (author) m.a.f.whitehorn@dundee.ac.uk Mark Whitehorn

3 It s all about me Teach Masters in: Data Science Part time Distance learning - aimed at existing data professionals Data Engineering Mark Whitehorn

4 Scope Note the word introduction in the title. We are simply going to do a little gentle roc climbing to try to flatten out the learning curve.

5 ROC Curves Receiver Operating Characteristic (ROC) Curves ROC curves have a long and glorious history and are very broadly applicable. Their history explains the name and we will come to that but since this conference is about Machine Learning (ML) we will start there and then go forward to the past.

6 Definitions Augusta Ada King-Noel, Countess of Lovelace Image: Alfred Edward Chalon [Public domain], via Wikimedia Commons

7 Definitions Ada Lovelace The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of anticipating any analytical relations or truths. Its province is to assist us to making available what we are already acquainted with.

8 Definitions Arthur Samuel* (1959) built a computer program that could play draughts better than he could. He defined ML as a Field of study that gives computers the ability to learn without being explicitly programmed. *Worked with Donald Knuth.

9 Where do ROC curves fit in ML? ML is often split into: Unsupervised Learning which doesn t concern us Supervised Learning which does Mark Whitehorn

10 A very brief overview of Machine Learning Supervised learning uses data which can be classified: Good/Bad customer Male/Female ROC curves are used with classification algorithms. Not all classification is binary of course but ROC curves are typically used with binary classifications.* (You can use one vs all classifier to reduce many classifications down to two.) * Volume Under the ROC Surface for Multi-class Problems. Exact Computation and Evaluation of Approximations Ferri. et. al 2003 Mark Whitehorn

11 Data and classification Features, Predictors, Attributes, Dimensions Output, Response Cases, Examples, Instances, Observations CaseID Date of Incident Time of Incident PostCode Value of Items Number of Items Etc Fraud 1 23/01/ :30 DD1 4HN No 2 23/01/ :50 HR2 5ES No 3 23/01/ :45 RD2 5VG 1,230 2 Yes 4 23/01/ :40 DF4 2WS No 5 23/01/ :50 TH7 4RD No 6 23/01/ :20 WE3 5Rf No etc etc etc etc etc etc etc etc

12 Data modelling Training data + algorithm = data model Mark Whitehorn

13 Data modelling Training data + algorithm = data model Data model + test data = evaluation ROC curves are associated with this testing/evaluation phase of this process Mark Whitehorn

14 Data modelling Training data + algorithm = data model Data model + test data = evaluation New data + data model = new information Mark Whitehorn

15 Classifiers The model that we build in this instance is a classifier, so we could say that: Classifiers consistently classify cases into classes and it would certainly ROC if we did so. So, for example, classifying new insurance claims as fraudulent. Note that if a claim is not classified as fraudulent then the assumption is that it is not fraudulent. (This being the nature of binary classification.) Mark Whitehorn

16 Classifiers We usually try many different classification algorithms, end up with multiple models and need to choose the best one. Deciding which is best is harder than it sounds, this is where we use ROC curves. (As an aside, we may also want to combine several models into an ensemble model; but either way we want to be able to estimate the efficiency of each of our models.) Mark Whitehorn

17 Classifier evaluation We are performing binary classification (can be male/female, good/bad) so we need generic terms: Positive Negative In turn this means that we have to make a decision We are trying to find Good customers. Having made that decision, a Good customer is positive; anything that is not a good customer is negative. Mark Whitehorn

18 Classifier evaluation Each incoming case is either Positive or Negative and our ML algorithm will apply a Positive or Negative classification. The classification can be right or wrong; so we can have four states: Classification Mark Whitehorn Actual Female Male Female Not Female

19 Classifier evaluation True Positive (A bit like the True True in Cloud Atlas. But not true much.) Classification Female Not Female Mark Whitehorn Actual Female Male True Positive

20 Classifier evaluation False Positive Classification Mark Whitehorn Actual Female Male Female False Positive Not Female

21 Classifier evaluation False Negative Mark Whitehorn Actual Female Male Female Classification Not Female False Negative

22 Classifier evaluation True Negative Mark Whitehorn Actual Female Male Female Classification Not Female True Negative

23 Classifier evaluation Four possible states: Blue is good Classification Classified Positive CP Classified Negative CN Actual Actual Positive P True Positive TP False Negative FN Actual Negative N False Positive FP True Negative TN Mark Whitehorn

24 Classification models This kind of grid (Confusion Matrix) is used very frequently, for example: Classified Actual Female Not Female Total Female P 1276 TP 1231 FN Male N 738 FP 34 TN Total

25 Model evaluation So, you have a model and it will classify with four results. How good is it? - You going to have to evaluate it. In fact, you may have several so you need to evaluate them all to find out which is the best. Mark Whitehorn

26 Model evaluation You have one or several models so you need to evaluate them all to find out which is the best. One option. Is each model: A. slightly good? B. very good? C. Whizzo? We need a more precise measure.

27 Accuracy Classified Positive CP Classified Negative CN Accuracy (ACC) can be estimated as: Actual Positive P Actual Negative N True Positive TP False Positive FP False Negative FN True Negative TN TP + TN = Total ( )/2014 = 0.96 Classified Female Not Female Total Actual Female P 1276 TP 1231 FN Male N 738 FP 34 TN Total

28 ACC I train my model using training data. I test my model using, unsurprisingly, testing data. I calculate how many of the cases in the test data were correctly identified. My model scores 99%. So, is ACC a good metric? Mark Whitehorn

29 ACC Suppose 1% are Male. And suppose that my model assumes all are Female. My model scores an impressive 99%. Female Not Female Total Female P 99 TP 99 FN 0 99 Male N 1 FP 1 TN 0 1 Total Mark Whitehorn

30 ACC Hmmmm. We need another metric. Happily the good this about metrics is that they are like standards, and the wonderful thing about standards is that there are so many from which to choose. * *Attributed to Grace Hopper but also to others. Mark Whitehorn

31 There are many such metrics: Actual Positive P Actual Negative N Classified Positive CP True Positive TP False Positive FP Classified Negative CN False Negative FN True Negative TN Test Accuracy (ACC) True Negative Rate - Specificity (SPC) True Positive Rate (TPR) - Sensitivity or Recall False Positive Rate (FPR) - Fall-out False Negative Rate (FNR) Miss rate Positive Predictive Value (PPV) - Precision Negative Predictive Value (NPV) Formula (TP + TN)/Total TN/N = TN/(FP + TN) TP/P = TP/(TP + FN) FP/N = FP/(FP + TN) FN/P = FN/(TP+FN) = 1 - TPR TP/(TP + FP) TN/(TN + FN)

32 And we can start building more complex ones: Matthews Correlation Coefficient (MCC) = TP * TN FP *FN Sqrt((TP + FP)(TP + FN)(TN + FP)(TN+FN))

33 Good news everybody!! ROC curves only use two. Actual Positive P Actual Negative N Classified Positive CP True Positive TP False Positive FP Classified Negative CN False Negative FN True Negative TN Test Formula True Positive Rate (TPR) - Sensitivity or Recall TP/P = TP/(TP + FN) The fraction of positive examples that are correctly classified. How many of the positives do we get right? False Positive Rate (FPR) - Fall-out FP/N = FP/(FP + TN) The fraction of negative examples that are incorrectly classified. How many of the negatives do we get wrong? * Note that it is possible to get Division by zero errors if P or N is zero

34 Base Line We plot the FPR against TPR and produce ROC space. It has a base line as shown. All classification should be will be above this line. Why?

35 Base Line Well, what happens if we simply guess? Random Random Positive Negative Prediction Prediction Total Actual Positive Actual Negative Total FPR = 50 TPR = 50

36 Base Line What happens if we simply guess? Random Random Positive Negative Prediction Prediction Total Actual Positive Actual Negative Total FPR = 50 TPR = 50

37 Base Line What happens if we simply guess? Random Random Positive Negative Prediction Prediction Total Actual Positive Actual Negative Total FPR = 50 TPR = 50

38 Base Line What happens if we simply guess? Random Random Positive Negative Prediction Prediction Total Actual Positive Actual Negative Total FPR = 50 TPR = 50

39 Base Line What happens if we simply guess? Random Random Positive Negative Prediction Prediction Total Actual Positive Actual Negative Total FPR = 50 TPR = 50

40 Base Line What happens if we simply guess? Random Random Positive Negative Prediction Prediction Total Actual Positive Actual Negative Total FPR = 50 TPR = 50

41 Base Line What happens if we simply guess? Random Random Positive Negative Prediction Prediction Total Actual Positive Actual Negative Total FPR = 50 TPR = 50

42 Base Line What happens if we simply guess? Random Random Positive Negative Prediction Prediction Total Actual Positive Actual Negative Total FPR = 80 TPR = 80

43 Base Line So anything above the line is doing better than average.

44 Base Line Anything below the line?

45 Base Line We simply reverse the classification.

46 Base Line Earlier we had a problem using ACC. Reality was producing: Positive 99% Negative 1% We had an algorithm that predicted all Positive which scores 99% on ACC. How does ROC space cope?

47 Base Line Well, here it is. Positive Negative Total Prediction Prediction Actual Positive Actual Negative Total FPR = TPR = 100.0

48 Base Line But change it by the merest smidgen: Positive Negative Total Prediction Prediction Actual Positive Actual Negative Total FPR = 0.0 TPR = 100.0

49 Base Line Suppose you try five different models, each of which produces a single point. Positive Negative Total Prediction Prediction Actual Positive Actual Negative Total FPR = 4.6 TPR = 96.5

50 Base Line You can then assess them. Nominally the best one is closest to the top left hand corner. This is not yet a classic ROC curve. For that we need a model that has a moveable threshold.

51 Zombie Apocalypse Zombies are rife in the city. After infection, people don t show symptoms for about 48 hours. We have people in quarantine. We discover that proto-zombies have higher level of a given antibody than the uninfected, so we measure the level of antibodies in the blood samples of these people.

52 Efficiency We measure the Antibody level (Ab) for 4,000 people for whom we know the diagnosis. (If the numbers look very convenient, that is because they are.) Ab Healthy Infected n 2,000 2,000 mean SD 10 10

53 Very approximately: 70% of all values fall within 1 SD of the mean 95% within 2 SD 99% within 3 SD 35% 35% 0.5% 2% 12.5% 12.5% 2% 0.5% Values SD

54 Efficiency So the distributions overlap, the means differ by ten SD and the SD is ten. (Sometimes the world is just too kind.) Healthy Infected Ab level

55 Efficiency As a side issue, do bear in mind that, if we look at the population of untested patients we don t see two overlapping normal distributions because they are additive. Healthy Infected Ab level

56 Efficiency Mixed patients - equal numbers of healthy and infected So what we see is a normal distribution with a greater standard deviation. Healthy Infected Ab level

57 Efficiency But if we know who is healthy or infected (as we do in our training data) then we do see two overlapping normal distributions. Healthy Infected Ab level

58 Efficiency So, if we were trying to classify unknown patients as Healthy (positive) we choose a threshold Ab value. Below that threshold we say the person is healthy, above it we say they are not healthy (a zombie). Healthy Here be living Infected Ab level Here be Living dead

59 Efficiency Here be living Here be Not Living We can set the threshold wherever we want, but every time we move it we change the numbers of TP, FP, TN and FN. Healthy Infected Ab level

60 Efficiency Here be living Here be Not Living Assume we are actively trying to identify the healthy. Healthy is therefore supposed to be classified as positive. Infected is supposed to be classified as negative. Healthy TP FN Ab level

61 Efficiency Here be living Here be Not Living These people are all healthy (Positive). The majority are correctly classified (True Positive). Some are falsely classified as negative (False Negative). Healthy TP FN Ab level

62 Ab level Efficiency Here be living Here be Not Living These people are infected (Negative). Most are correctly classified (True Negative). But some are incorrectly classified as healthy (False Positive). FP TN Infected

63 Efficiency if we decide to set the threshold at 70, clearly almost everyone is classified as not healthy (infected). So, massive numbers of True and False Negatives. But what of FPR and TPR? Healthy Infected Ab level

64 Efficiency The TPR is the fraction of positive examples that are correctly classified. 0.05% of the healthy people are correctly classified, so TPR = 0.05% The FTR is the fraction of negative examples that are incorrectly classified. 0% of the infected people are incorrectly classified so FPR=0%

65 Efficiency 2.5% healthy correctly classified, so TPR = 2.5% 0.5% infected are incorrectly classified, so FTP = 0.05% Healthy Infected Ab level

66 Efficiency TPR =15% FPR = 2.5% Healthy Infected Ab level

67 Efficiency TPR =50% FPR = 15% Healthy Infected Ab level

68 Efficiency Carrying on with collecting TPR and FPR TPR =85% FPR = 50% Healthy Infected Ab level

69 Efficiency TPR =97.5% FPR = 85% Healthy Infected Ab level

70 Efficiency TPR =99.5% FPR = 97.5% Healthy Infected Ab level

71 Efficiency TPR =100% FPR = 99.5% Healthy Infected Ab level

72 Efficiency TPR =100% FPR = 100% Healthy Infected Ab level

73 Efficiency Our model Random Given these figures, we can plot a ROC curve which shows the efficiency of the model.

74 ROC Curves These are very important estimators of the efficiency of a machine learning algorithm. However it is worth noting what ROC curves do and do not measure. They are great, but they have limitations.

75 ROC Curves Note that the different thresholds do not appear on the ROC curve and, from the curve alone, you cannot work out where they were.

76 Efficiency If the curves are further part, what happens to the ROC curve? Healthy Infected Ab level

79 AUC (Area Under the Curve) Clearly the area under the curve is another useful metric.

80 Does the ROC curve tell us where to put the threshold? It helps a great deal (in other words, No.) Why not? Well, the efficiency of classification is not the only consideration. For example, what about the cost (to the business, to humans) of an FP and an FN?

81 Does it tell us where to put the threshold? In our example, assume that all people judged negative (infected) are caged together and those judged positive (healthy) are released. (Both of these are very bad options but we are in the middle of an apocalypse.) The former decision condemns the FN in the cage to living death, the latter condemns the entire population to the same.

82 Does it tell us where to put the threshold? Clearly I do not envisage a real zombie apocalypse (I wouldn t joke about it if I did) but even in very simple classifications, the impact of an FP can be very different from that of a FN. Identifying a fraudulent transaction (positive). A FN may cost us the value of the transaction, a FP may cost us a valued customer.

83 Does it tell us where to put the threshold? Diagnosing a disease (positive). A false positive may cost a 5 retest. A false negative may kill the patient. So where do we set the threshold?

84 This has been merely an introduction. My advice would be not to forget those other metrics. Oh, but what about the origins? Test Accuracy (ACC) True Negative Rate - Specificity (SPC) True Positive Rate (TPR) - Sensitivity or Recall False Positive Rate (FPR) - Fall-out False Negative Rate (FNR) Miss rate Positive Predictive Value (PPV) - Precision Negative Predictive Value (NPV) Formula (TP + TN)/Total TN/N = TN/(FP + TN) TP/P = TP/(TP + FN) FP/N = FP/(FP + TN) FN/P = FN/(TP+FN) = 1 - TPR TP/(TP + FP) TN/(TN + FN)

85 Origins of ROC Curves Today the BBC has an article by Tim Harford on the origins of radar. ROC curves have their origins in Signal Detection Theory which is essentially about trying to distinguish between noise and not-noise (in other words, signal).

86 Rather confusingly the ROC was also a mythical bird. The ROC was VERY big but this illustration somewhat fails to show the scale of the problem. René Bull ( )

87 The ROC was VERY big

88 Now, suppose you wanted some warning that a ROC was coming. You d probably invent radar.

89 12 Count of Pixels Light Intensity

90 12 Not Blip Blip Count of Pixels Light Intensity To detect a signal, you simply look for anything to the right of the threshold.

91 But in the early days of radar it wasn t like that. The image was much less clear.

92 But in the early days of radar it wasn t like that. The image was much less clear.

93 12 Count of Pixels Light Intensity Does the question Where are we going to put the threshold? sound familiar?

94 ROC curves Thank you for ROCing up, any questions?

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

INTRODUCTION TO MACHINE LEARNING Decision tree learning Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign