INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Size: px

Start display at page:

Download "INTRODUCTION TO MACHINE LEARNING. Decision tree learning"

Ella Marshall
6 years ago
Views:

1 INTRODUCTION TO MACHINE LEARNING Decision tree learning

2 Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign class to new observation with features, using previous observations Binary classification: two classes Multiclass classification: more than two classes

3 Example A dataset consisting of persons Features: age, weight and income Class: binary: happy or not happy multiclass: happy, satisfied or not happy

4 Examples of features Features can be numerical age: 23, 25, 75, height: 175.3, 179.5, Features can be categorical travel_class: first class, business class, coach class smokes?: yes, no

5 The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old?

6 The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Old

7 The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Old Smoked for more than 10 years?

8 The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Young Old Vaccinated against the measles? Smoked for more than 10 years?

9 The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Young Old Vaccinated against the measles? Smoked for more than 10 years? Yes No Yes No

10 The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Young Old Vaccinated against the measles? Smoked for more than 10 years? Yes No Yes No It s a decision tree!!!

11 Define the tree A B C D E F G

12 Define the tree A Nodes B C D E F G

13 Define the tree A Edges B C D E F G

14 Define the tree Root A B C D E F G

15 Define the tree Root A B C D E F G Leafs

16 Define the tree Root A Children of A B C Children of B, C Grandchildren of A D E F G

17 Define the tree Root A Children of A B C Children of B, C Grandchildren of A D E F G Leafs

18 Questions to ask age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

19 Categorical feature Can be a feature test on itself travel_class: coach, business or first travel_class coach first business

20 Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

21 Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

22 Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

23 Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

24 Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

25 Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick Prediction: not sick

26 Learn a tree Use training set Come up with queries (feature tests) at each node

27 training set Split into parts 2 parts for binary test age <= 18 yes no part of training set part of training set TRUE FALSE

28 part of training set part of training set feature test feature test yes no yes no part of training set part of training set part of training set part of training set

29 part of training set part of training set part of training set part of training set keep splitting until leafs contain small portion of training set

30 Learn the tree Goal: end up with pure leafs leafs that contain observations of one particular class leaf part of training set class 1 class class 2

31 Learn the tree Goal: end up with pure leafs leafs that contain observations of one particular class In practice: almost never the case noise leaf When classifying new instances end up in leaf part of training set class 1 class class 2

32 Learn the tree Goal: end up with pure leafs leafs that contain observations of one particular class In practice: almost never the case noise leaf When classifying new instances end up in leaf part of training set class 1 class 2 assign class of majority of training instances

33 Learn the tree At each node Iterate over different feature tests Choose the best one Comes down to two parts Make list of feature tests Choose test with best split

34 Construct list of tests Categorical features Parents/grandparents/ didn t use the test yet Numerical features Choose feature Choose threshold

35 Choose best feature test More complex Use splitting criteria to decide which test to use Information gain ~ entropy

36 Information gain Information gained from split based on feature test Test leads to nicely divided classes -> high information gain Test leads to scrambled classes -> low information gain Test with highest information gain will be chosen

37 Pruning Number of nodes influences chance on overfit Restrict size higher bias Decrease chance on overfit Pruning the tree

38 INTRODUCTION TO MACHINE LEARNING Let s practice!

39 INTRODUCTION TO MACHINE LEARNING k-nearest Neighbors

40 Instance-based learning Save training set in memory No real model like decision tree Compare unseen instances to training set Predict using the comparison of unseen data and the training set

41 k-nearest Neighbor Form of instance-based learning Simplest form: 1-Nearest Neighbor or Nearest Neighbor

42 Nearest Neighbor - example 2 features: X1 and X2 Class: red or blue Binary classification

43 Nearest Neighbor - example Introduction to Machine Learning

44 Nearest Neighbor - example Save complete training set

45 Nearest Neighbor - example Save complete training set Given: unseen observation with features X = (1.3, -2)

46 Nearest Neighbor - example Save complete training set Given: unseen observation with features X = (1.3, -2) Compare training set with new observation

47 Nearest Neighbor - example Save complete training set Given: unseen observation with features X = (1.3, -2) Compare training set with new observation Find closest observation nearest neighbor and assign same class just Euclidean distance, nothing fancy

48 k-nearest Neighbors k is the amount of neighbors If k = 5 Use 5 most similar observations (neighbors) Assigned class will be the most represented class within the 5 neighbors

49 Distance metric Important aspect of k-nn

50 Distance metric Important aspect of k-nn Euclidian distance:

51 Distance metric Important aspect of k-nn Euclidian distance: Manhattan distance:

52 Scaling - example Dataset with 2 features: weight and height 3 observations height (m) weight (kg) distance: 0.5 distance: 0.13

53 Scaling - example Dataset with 2 features: weight and height 3 observations Scale influences distance! height (cm) weight (kg) distance: 0.5 distance: 13

54 Scaling Normalize all features e.g. rescale values between 0 and 1 Gives better measure of real distance Don t forget to scale new observations

55 Categorical features How to use in distance metric? Dummy variables 1 categorical features with N possible outcomes to N binary features (2 outcomes)

56 Dummy variables Example mother tongue: Spanish, Italian or French mother_tongue Spanish Italian Italian Spanish French French French spanish italian french

57 INTRODUCTION TO MACHINE LEARNING Let s practice!

58 INTRODUCTION TO MACHINE LEARNING Introducing: The ROC curve

59 Introducing Very powerful performance measure For binary classification Reiceiver Operator Characteristic Curve (ROC Curve)

60 Probabilities as output Used decision trees and k-nn to predict class They can also output probability that instance belongs to class

61 Probabilities as output - example Binary classification Decide whether patient is sick or not sick decision function! Define probability threshold from which you decide patient to be sick Avoid sending sick patient home: lower threshold to 30% Decision tree: New patient: 70% 30% More patients classified as higher than 50% classify as More but also patients classified as

62 Confusion matrix Other performance measure for classification Important to construct the ROC curve

63 Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction P N Truth p TP FN n FP TN

64 Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction True Positives Prediction: P Truth: P Truth P N p TP FN n FP TN

65 Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction False Negatives Prediction: N Truth: P Truth P N p TP FN n FP TN

66 Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction False Positives Prediction: P Truth: N Truth P N p TP FN n FP TN

67 Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction True Negatives Prediction: N Truth: N Truth P N p TP FN n FP TN

68 Ratios in the confusion matrix True positive rate (TPR) = recall False positive rate (FPR) Prediction TPR TP/(TP+FN) P N Truth p TP FN n FP TN

69 Ratios in the confusion matrix True positive rate (TPR) = recall False positive rate (FPR) Prediction Truly TPR TP/(TP+FN) P N p TP FN + Truth n FP TN Truly Falsely

70 Ratios in the confusion matrix True positive rate (TPR) = recall False positive rate (FPR) Prediction FPR FP/(FP+TN) P N Truth p TP FN n FP TN

71 Ratios in the confusion matrix True positive rate (TPR) = recall False positive rate (FPR) Prediction Falsely FPR FP/(FP+TN) P N p TP FN + Truth n FP TN Falsely Truly

72 ROC curve Horizontal axis: FPR Vertical axis: TPR How to draw the curve? True positive rate False positive rate

73 Draw the curve Need classifier which outputs probabilities The decision function probability decide to diagnose threshold by decision function probability

74 Draw the curve Need classifier which outputs probabilities The decision function probability decide to diagnose threshold by decision function probability

75 True positive rate >=50%: sick < 50%: healthy False positive rate 50% probability

76 True positive rate all sick False positive rate 0% probability

77 True positive rate all healthy False positive rate 100% probability

78 Interpreting the curve Is it a good curve? Closer to left upper corner = better Good classifiers have big area under the curve True positive rate False positive rate

79 > 0.9 = very good True positive rate AUC = False positive rate Area under the curve (AUC)

80 INTRODUCTION TO MACHINE LEARNING Let s practice!

An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn

An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn An Introduction to ROC curves Mark Whitehorn Mark Whitehorn It s all about me Prof. Mark Whitehorn Emeritus Professor of Analytics Computing University of Dundee Consultant Writer (author) m.a.f.whitehorn@dundee.ac.uk