INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Similar documents
An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn

Outlier Analysis. Lijun Zhang

An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework

Week 2 Video 3. Diagnostic Metrics

4. Model evaluation & selection

Assessment of diabetic retinopathy risk with random forests

Predictive performance and discrimination in unbalanced classification

Predicting Breast Cancer Survivability Rates

An Empirical and Formal Analysis of Decision Trees for Ranking

Dottorato di Ricerca in Statistica Biomedica. XXVIII Ciclo Settore scientifico disciplinare MED/01 A.A. 2014/2015

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Learning Decision Trees Using the Area Under the ROC Curve

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

CHAPTER 5 DECISION TREE APPROACH FOR BONE AGE ASSESSMENT

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

Predictive Models for Healthcare Analytics

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Data Mining and Knowledge Discovery: Practice Notes

3. Model evaluation & selection

BMI 541/699 Lecture 16

Machine learning II. Juhan Ernits ITI8600

METHODS FOR DETECTING CERVICAL CANCER

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

Efficient AUC Optimization for Information Ranking Applications

Probability Revision. MED INF 406 Assignment 5. Golkonda, Jyothi 11/4/2012

Comparing Two ROC Curves Independent Groups Design

Sensitivity, Specificity, and Relatives

Prediction Models of Diabetes Diseases Based on Heterogeneous Multiple Classifiers

VU Biostatistics and Experimental Design PLA.216

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

Classification of benign and malignant masses in breast mammograms

Zheng Yao Sr. Statistical Programmer

Predicting Breast Cancer Survival Using Treatment and Patient Factors

PERFORMANCE MEASURES

Performance Evaluation of Machine Learning Algorithms in the Classification of Parkinson Disease Using Voice Attributes

Data that can be classified as belonging to a distinct number of categories >>result in categorical responses. And this includes:

Prediction of micrornas and their targets

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

Introduction to Discrimination in Microarray Data Analysis

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Modifying ROC Curves to Incorporate Predicted Probabilities

Classification of EEG signals in an Object Recognition task

Enhanced Detection of Lung Cancer using Hybrid Method of Image Segmentation

Unit 7 Comparisons and Relationships

Applying Data Mining for Epileptic Seizure Detection

Various performance measures in Binary classification An Overview of ROC study

The area under the ROC curve as a variable selection criterion for multiclass classification problems

Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification

Diagnostic tests, Laboratory tests

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

COMP90049 Knowledge Technologies

MS&E 226: Small Data

Automatic Detection of Epileptic Seizures in EEG Using Machine Learning Methods

Analysis of Classification Algorithms towards Breast Tissue Data Set

Example - Birdkeeping and Lung Cancer - Interpretation. Lecture 20 - Sensitivity, Specificity, and Decisions. What do the numbers not mean...

Improved Intelligent Classification Technique Based On Support Vector Machines

AN INVESTIGATION OF MULTI-LABEL CLASSIFICATION TECHNIQUES FOR PREDICTING HIV DRUG RESISTANCE IN RESOURCE-LIMITED SETTINGS

Classification of Thyroid Nodules in Ultrasound Images using knn and Decision Tree

Classification with microarray data

AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES

Comparison of discrimination methods for the classification of tumors using gene expression data

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

OCW Epidemiology and Biostatistics, 2010 Michael D. Kneeland, MD November 18, 2010 SCREENING. Learning Objectives for this session:

Behavioral Data Mining. Lecture 4 Measurement

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Decision Support System for Heart Disease Diagnosing Using K-NN Algorithm

Detection of Glaucoma and Diabetic Retinopathy from Fundus Images by Bloodvessel Segmentation

The Application of Image Processing Techniques for Detection and Classification of Cancerous Tissue in Digital Mammograms

Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on Mutual Information

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction

Predicting Microfinance Participation in Indian Villages

MEM BASED BRAIN IMAGE SEGMENTATION AND CLASSIFICATION USING SVM

5.3: Associations in Categorical Variables

Brain Tumor Segmentation Based On a Various Classification Algorithm

ABSTRACT. classification algorithm, so rules can be extracted from the decision tree models.

Automatic Lung Cancer Detection Using Volumetric CT Imaging Features

Knock Knock, Who s There? Membership Inference on Aggregate Location Data

When Overlapping Unexpectedly Alters the Class Imbalance Effects

Receiver operating characteristic

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Protein Structure & Function. University, Indianapolis, USA 3 Department of Molecular Medicine, University of South Florida, Tampa, USA

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

Data Mining of Access to Tetanus Toxoid Immunization Among Women of Childbearing Age in Ethiopia

Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems

International Journal of Advance Research in Computer Science and Management Studies

Propensity Score Analysis to compare effects of radiation and surgery on survival time of lung cancer patients from National Cancer Registry (SEER)

Using AUC and Accuracy in Evaluating Learning Algorithms

PII: S (96) THE USE OF THE AREA UNDER THE ROC CURVE IN THE EVALUATION OF MACHINE LEARNING ALGORITHMS

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

D8 - Executive Summary

The Impact of Using Regression Models to Build Defect Classifiers

Statistics, Probability and Diagnostic Medicine

A PRACTICAL APPROACH FOR IMPLEMENTING THE PROBABILITY OF LIQUEFACTION IN PERFORMANCE BASED DESIGN

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Detection of Lung Cancer Using Marker-Controlled Watershed Transform

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

Transcription:

INTRODUCTION TO MACHINE LEARNING Decision tree learning

Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign class to new observation with features, using previous observations Binary classification: two classes Multiclass classification: more than two classes

Example A dataset consisting of persons Features: age, weight and income Class: binary: happy or not happy multiclass: happy, satisfied or not happy

Examples of features Features can be numerical age: 23, 25, 75, height: 175.3, 179.5, Features can be categorical travel_class: first class, business class, coach class smokes?: yes, no

The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old?

The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Old

The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Old Smoked for more than 10 years?

The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Young Old Vaccinated against the measles? Smoked for more than 10 years?

The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Young Old Vaccinated against the measles? Smoked for more than 10 years? Yes No Yes No

The decision tree Suppose you re classifying patients as sick or not sick Intuitive way of classifying: ask questions Is the patient young or old? Young Old Vaccinated against the measles? Smoked for more than 10 years? Yes No Yes No It s a decision tree!!!

Define the tree A B C D E F G

Define the tree A Nodes B C D E F G

Define the tree A Edges B C D E F G

Define the tree Root A B C D E F G

Define the tree Root A B C D E F G Leafs

Define the tree Root A Children of A B C Children of B, C Grandchildren of A D E F G

Define the tree Root A Children of A B C Children of B, C Grandchildren of A D E F G Leafs

Questions to ask age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

Categorical feature Can be a feature test on itself travel_class: coach, business or first travel_class coach first business

Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick

Classifying with the tree Observation: patient of 40 years, vaccinated and didn t smoke age <= 18 yes no vaccinated smoked yes no yes no not sick sick sick not sick Prediction: not sick

Learn a tree Use training set Come up with queries (feature tests) at each node

training set Split into parts 2 parts for binary test age <= 18 yes no part of training set part of training set TRUE FALSE

part of training set part of training set feature test feature test yes no yes no part of training set part of training set part of training set part of training set

part of training set part of training set part of training set part of training set keep splitting until leafs contain small portion of training set

Learn the tree Goal: end up with pure leafs leafs that contain observations of one particular class leaf part of training set class 1 class class 2

Learn the tree Goal: end up with pure leafs leafs that contain observations of one particular class In practice: almost never the case noise leaf When classifying new instances end up in leaf part of training set class 1 class class 2

Learn the tree Goal: end up with pure leafs leafs that contain observations of one particular class In practice: almost never the case noise leaf When classifying new instances end up in leaf part of training set class 1 class 2 assign class of majority of training instances

Learn the tree At each node Iterate over different feature tests Choose the best one Comes down to two parts Make list of feature tests Choose test with best split

Construct list of tests Categorical features Parents/grandparents/ didn t use the test yet Numerical features Choose feature Choose threshold

Choose best feature test More complex Use splitting criteria to decide which test to use Information gain ~ entropy

Information gain Information gained from split based on feature test Test leads to nicely divided classes -> high information gain Test leads to scrambled classes -> low information gain Test with highest information gain will be chosen

Pruning Number of nodes influences chance on overfit Restrict size higher bias Decrease chance on overfit Pruning the tree

INTRODUCTION TO MACHINE LEARNING Let s practice!

INTRODUCTION TO MACHINE LEARNING k-nearest Neighbors

Instance-based learning Save training set in memory No real model like decision tree Compare unseen instances to training set Predict using the comparison of unseen data and the training set

k-nearest Neighbor Form of instance-based learning Simplest form: 1-Nearest Neighbor or Nearest Neighbor

Nearest Neighbor - example 2 features: X1 and X2 Class: red or blue Binary classification

Nearest Neighbor - example Introduction to Machine Learning

Nearest Neighbor - example Save complete training set

Nearest Neighbor - example Save complete training set Given: unseen observation with features X = (1.3, -2)

Nearest Neighbor - example Save complete training set Given: unseen observation with features X = (1.3, -2) Compare training set with new observation

Nearest Neighbor - example Save complete training set Given: unseen observation with features X = (1.3, -2) Compare training set with new observation Find closest observation nearest neighbor and assign same class just Euclidean distance, nothing fancy

k-nearest Neighbors k is the amount of neighbors If k = 5 Use 5 most similar observations (neighbors) Assigned class will be the most represented class within the 5 neighbors

Distance metric Important aspect of k-nn

Distance metric Important aspect of k-nn Euclidian distance:

Distance metric Important aspect of k-nn Euclidian distance: Manhattan distance:

Scaling - example Dataset with 2 features: weight and height 3 observations height (m) weight (kg) 1 1.83 80 2 1.83 80.5 3 1.70 80 distance: 0.5 distance: 0.13

Scaling - example Dataset with 2 features: weight and height 3 observations Scale influences distance! height (cm) weight (kg) 1 183 80 2 183 80.5 3 170 80 distance: 0.5 distance: 13

Scaling Normalize all features e.g. rescale values between 0 and 1 Gives better measure of real distance Don t forget to scale new observations

Categorical features How to use in distance metric? Dummy variables 1 categorical features with N possible outcomes to N binary features (2 outcomes)

Dummy variables Example mother tongue: Spanish, Italian or French mother_tongue Spanish Italian Italian Spanish French French French spanish italian french 1 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1

INTRODUCTION TO MACHINE LEARNING Let s practice!

INTRODUCTION TO MACHINE LEARNING Introducing: The ROC curve

Introducing Very powerful performance measure For binary classification Reiceiver Operator Characteristic Curve (ROC Curve)

Probabilities as output Used decision trees and k-nn to predict class They can also output probability that instance belongs to class

Probabilities as output - example Binary classification Decide whether patient is sick or not sick decision function! Define probability threshold from which you decide patient to be sick Avoid sending sick patient home: lower threshold to 30% Decision tree: New patient: 70% 30% More patients classified as higher than 50% classify as More but also patients classified as

Confusion matrix Other performance measure for classification Important to construct the ROC curve

Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction P N Truth p TP FN n FP TN

Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction True Positives Prediction: P Truth: P Truth P N p TP FN n FP TN

Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction False Negatives Prediction: N Truth: P Truth P N p TP FN n FP TN

Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction False Positives Prediction: P Truth: N Truth P N p TP FN n FP TN

Confusion matrix Binary classifier: positive or negative (1 or 0) Prediction True Negatives Prediction: N Truth: N Truth P N p TP FN n FP TN

Ratios in the confusion matrix True positive rate (TPR) = recall False positive rate (FPR) Prediction TPR TP/(TP+FN) P N Truth p TP FN n FP TN

Ratios in the confusion matrix True positive rate (TPR) = recall False positive rate (FPR) Prediction Truly TPR TP/(TP+FN) P N p TP FN + Truth n FP TN Truly Falsely

Ratios in the confusion matrix True positive rate (TPR) = recall False positive rate (FPR) Prediction FPR FP/(FP+TN) P N Truth p TP FN n FP TN

Ratios in the confusion matrix True positive rate (TPR) = recall False positive rate (FPR) Prediction Falsely FPR FP/(FP+TN) P N p TP FN + Truth n FP TN Falsely Truly

ROC curve Horizontal axis: FPR Vertical axis: TPR How to draw the curve? True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate

Draw the curve Need classifier which outputs probabilities The decision function probability decide to diagnose threshold by decision function probability

Draw the curve Need classifier which outputs probabilities The decision function probability decide to diagnose threshold by decision function probability

True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 >=50%: sick < 50%: healthy 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate 50% probability

True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 all sick 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate 0% probability

True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 all healthy 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate 100% probability

Interpreting the curve Is it a good curve? Closer to left upper corner = better Good classifiers have big area under the curve True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate

> 0.9 = very good True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 AUC = 0.905 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Area under the curve (AUC)

INTRODUCTION TO MACHINE LEARNING Let s practice!