Machine learning II. Juhan Ernits ITI8600

Similar documents
Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems

Week 2 Video 3. Diagnostic Metrics

An Introduction to ROC curves. Mark Whitehorn. Mark Whitehorn

Various performance measures in Binary classification An Overview of ROC study

Predictive Models for Healthcare Analytics

Performance Evaluation of Machine Learning Algorithms in the Classification of Parkinson Disease Using Voice Attributes

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

Predictive performance and discrimination in unbalanced classification

METHODS FOR DETECTING CERVICAL CANCER

Modifying ROC Curves to Incorporate Predicted Probabilities

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

Using AUC and Accuracy in Evaluating Learning Algorithms

Yeast Cells Classification Machine Learning Approach to Discriminate Saccharomyces cerevisiae Yeast Cells Using Sophisticated Image Features.

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

3. Model evaluation & selection

Computer Models for Medical Diagnosis and Prognostication

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Behavioral Data Mining. Lecture 4 Measurement

ENSEMBLE CLASSIFIER APPROACH IN BREAST CANCER DETECTION AND MALIGNANCY GRADING- A REVIEW

PERFORMANCE MEASURES

Prediction of Malignant and Benign Tumor using Machine Learning

Applying Data Mining for Epileptic Seizure Detection

An Empirical and Formal Analysis of Decision Trees for Ranking

VU Biostatistics and Experimental Design PLA.216

4. Model evaluation & selection

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

BMI 541/699 Lecture 16

SYSTEMATIC REVIEWS OF TEST ACCURACY STUDIES

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Receiver operating characteristic

Predicting Breast Cancer Survivability Rates

Predictive Modeling of Terrorist Attacks Using Machine Learning

Tolerance of Effectiveness Measures to Relevance Judging Errors

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

When Overlapping Unexpectedly Alters the Class Imbalance Effects

Critical reading of diagnostic imaging studies. Lecture Goals. Constantine Gatsonis, PhD. Brown University

Meta-analysis of diagnostic test accuracy studies with multiple & missing thresholds

Analysis of Classification Algorithms towards Breast Tissue Data Set

Screening (Diagnostic Tests) Shaker Salarilak

Sensitivity, specicity, ROC

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

Learning from data when all models are wrong

Prediction of Diabetes Using Probability Approach

A scored AUC Metric for Classifier Evaluation and Selection

Classification of benign and malignant masses in breast mammograms

Zheng Yao Sr. Statistical Programmer

BREAST CANCER EPIDEMIOLOGY MODEL:

Example - Birdkeeping and Lung Cancer - Interpretation. Lecture 20 - Sensitivity, Specificity, and Decisions. What do the numbers not mean...

Introduction to Computational Neuroscience

Chapter 9: Comparing two means

PREDICTION OF HEART DISEASE USING HYBRID MODEL: A Computational Approach

Winner s Report: KDD CUP Breast Cancer Identification

What variables are important in predicting bovine viral diarrhea virus? A random forest approach

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

Module Overview. What is a Marker? Part 1 Overview

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Learning Decision Trees Using the Area Under the ROC Curve

Efficient AUC Optimization for Information Ranking Applications

Prediction of micrornas and their targets

Applied Machine Learning, Lecture 11: Ethical and legal considerations; domain effects and domain adaptation

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

I got it from Agnes- Tom Lehrer

COMP90049 Knowledge Technologies

Automatic Medical Coding of Patient Records via Weighted Ridge Regression

The Effects of Automated Risk Assessment on Reliability, Validity and Return on Investment (ROI)

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

How much data is enough? Predicting how accuracy varies with training data size

Empirical assessment of univariate and bivariate meta-analyses for comparing the accuracy of diagnostic tests

CHAPTER 2 MAMMOGRAMS AND COMPUTER AIDED DETECTION

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

Statistics, Probability and Diagnostic Medicine

A PRACTICAL APPROACH FOR IMPLEMENTING THE PROBABILITY OF LIQUEFACTION IN PERFORMANCE BASED DESIGN

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

CHAPTER 5 DECISION TREE APPROACH FOR BONE AGE ASSESSMENT

Examining differences between two sets of scores

Automatic Detection of Epileptic Seizures in EEG Using Machine Learning Methods

S4. Summary of the GALNS assay validation. Intra-assay variation (within-run precision)

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

A Practical Approach for Implementing the Probability of Liquefaction in Performance Based Design

Algorithms in Nature. Pruning in neural networks

ANALYSIS AND CLASSIFICATION OF EEG SIGNALS. A Dissertation Submitted by. Siuly. Doctor of Philosophy

Validating Machine-learned Diagnostic Classifiers in Safety Critical Applications with Imbalanced Populations

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

Improved Intelligent Classification Technique Based On Support Vector Machines

AUTOMATIC ACNE QUANTIFICATION AND LOCALISATION FOR MEDICAL TREATMENT

Protein Structure & Function. University, Indianapolis, USA 3 Department of Molecular Medicine, University of South Florida, Tampa, USA

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Sensitivity, Specificity, and Relatives

Multiple Treatments on the Same Experimental Unit. Lukas Meier (most material based on lecture notes and slides from H.R. Roth)

USING STATCRUNCH TO CONSTRUCT CONFIDENCE INTERVALS and CALCULATE SAMPLE SIZE

Meta-analysis of diagnostic research. Karen R Steingart, MD, MPH Chennai, 15 December Overview

Characterization of a novel detector using ROC. Elizabeth McKenzie Boehnke Cedars-Sinai Medical Center UCLA Medical Center AAPM Annual Meeting 2017

17/10/2012. Could a persistent cough be whooping cough? Epidemiology and Statistics Module Lecture 3. Sandra Eldridge

Search settings MaxQuant

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Transcription:

Machine learning II Juhan Ernits ITI8600

Hand written digit recognition 64

Example 2: Face recogition Classification, regression or unsupervised? How many classes?

Example 2: Face recognition Classification, regression or unsupervised? Classification How many classes? 3: no face, front view, profile view

Classifier is learnt from labelled data

Example 3: Spam detecion Classification problem Goal is to distinguish spam from non-spam Data x_i is word/phrase count, e.g. you won a $1M prize in lottery is more suspicious than you got a 5 in ITI8600 The spammers constantly modify their kit, so detection requires to keep up, i.e. need a training system.

Example 4: Stock price prediction

Example 5: Computational biology

Example 6: Machine translation

Three canonical learning problems:

The main steps of evaluation 12

What these steps depend on These steps depend on the purpose of the evaluation: Comparison of a new algorithm to other (may be generic or application-specific) classifiers on a specific domain (e.g., when proposing a novel learning algorithm) Comparison of a new generic algorithm to other generic ones on a set of benchmark Characterization of generic classifiers on benchmarks domains Comparison of multiple classifiers on a specific domain (e.g. to find the best algorithm for a given application task)

Which Classifier is better? Almost as many answers as there are performance measures! (e.g., UCI Breast Cancer) Algo Acc RMSE TPR FPR Prec Rec F AUC Info S NB 71.7.4534.44.16.53.44.48.7 48.11 C4.5 75.5.4324.27.04.74.27.4.59 34.28 3NN 72.4.5101.32.1.56.32.41.63 43.37 Ripp 71.4494.37.14.52.37.43.6 22.34 SVM 69.6.5515.33.15.48.33.39.59 54.89 Bagg 67.8.4518.17.1.4.17.23.63 11.30 Boost 70.3.4329.42.18.5.42.46.7 34.48 RanF 69.23.47.33.15.48.33.39.63 20.78 14

Which Classifier is better? Ranking the results Algo Acc RMSE TPR FPR Prec Rec F AUC Info S NB 3 5 1 7 3 1 1 1 2 C4.5 1 1 7 1 1 7 5 7 5 3NN 2 7 6 2 2 6 4 3 3 Ripp 4 3 3 4 4 3 3 6 6 SVM 6 8 4 5 5 4 6 7 1 Bagg 8 4 8 2 8 8 8 3 8 Boost 5 2 2 8 7 2 2 1 4 RanF 7 6 4 5 5 4 7 3 7 15

What should we make of that? Well, for certain pairs of measures, that makes sense, since each measure focuses on a different aspect of learning. For example, the TPR and the FPR are quite different, and often, good results on one yields bad results on the other. Precision and Recall also seem to tradeoff each other. How about the global measures (Acc, RMSE, the F- measure, AUC, the Information Score)? They too disagree as they each measure different (though more difficult to pinpoint as they are composite measures) aspects of learning. 16

Overview of Performance Measures 17

A Few Confusion Matrix-Based Performance Measures True class Hypothesized class V Pos Neg Yes TP FP No FN TN P=TP+FN A Confusion Matrix N=FP+TN Accuracy = (TP+TN)/(P+N) Precision = TP/(TP+FP) Recall = TP/(TP+FN) Recall/TP rate = TP/P FP Rate = FP/N ROC Analysis moves the threshold between the positive and negative class from a small FP rate to a large one. It plots the value of the Recall against that of the FP Rate at each FP Rate considered. 18

Issues with Accuracy True class Pos Neg Yes 200 100 No 300 400 P=500 N=500 True class Pos Neg Yes 400 300 No 100 200 P=500 N=500 Both classifiers obtain 60% accuracy They exhibit very different behaviours: On the left: weak positive recognition rate/strong negative recognition rate On the right: strong positive recognition rate/weak negative recognition rate 19

Issues with Precision/Recall True class Pos Neg Yes 200 100 No 300 400 P=500 N=500 True class Pos Neg Yes 200 100 No 300 0 P=500 N=100 Both classifiers obtain the same precision and recall values of 66.7% and 40% (Note: the data sets are different) They exhibit very different behaviours: Same positive recognition rate Extremely different negative recognition rate: strong on the left / nil on the right Note: Accuracy has no problem catching this! 20

Is the AUC the answer? Many researchers have now adopted the AUC (the area under the ROC Curve). The principal advantage of the AUC is that it is more robust than Accuracy in class imbalanced situations. Indeed, given a 95% imbalance (in favour of the negative class, say), the accuracy of the default classifier that issues negative all the time will be 95%, whereas a more interesting classifier that actually deals with the issue, is likely to obtain a worse score. The AUC takes the class distribution into consideration. 21

Is the AUC the Answer? (cont ) While the AUC has been generally adopted as a replacement for accuracy, it met with a couple of criticisms: The ROC curves on which the AUCs of different classifiers are based may cross, thus not giving an accurate picture of what is really happening. The misclassification cost distributions (and hence the skewratio distributions) used by the AUC are different for different classifiers. Therefore, we may be comparing apples and oranges as the AUC may give more weight to misclassifying a point by classifier A than it does by classifier B (Hand, 2009) Answer: the H-Measure, but it has been criticized too! 22

Cost Curves Cost-curves are more practical than ROC curves because they tell us for what class probabilities one classifier is preferable over the other. ROC Curves only tell us that sometimes one classifier is preferable over the other 23

RMSE The Root-Mean Squared Error (RMSE) is usually used for regression, but can also be used with probabilistic classifiers. The formula for the RMSE is: RMSE(f) = sqrt( 1/m Σ i=1m (f(x i ) y i ) 2 )) where m is the number of test examples, f(x i ), the classifier s probabilistic output on x i and y i the actual label. ID f(x i ) y i (f(x i ) y i ) 2 1.95 1.0025 2.6 0.36 3.8 1.04 4.75 0.5625 5.9 1.01 RMSE(f) = sqrt(1/5 * (.0025+.36+.04+.5625+.01)) = sqrt(0.975/5) = 0.4416 24

Information Score Kononenko and Bratko s Information Score assumes a prior P(y) on the labels. This could be estimated by the class distribution of the training data. The output (posterior probability) of a probabilistic classifier is P(y f), where f is the classifier. I(a) is the indicator fonction. IS(x) = I(P(y f) P(y)) * (-log(p(y))+log(p(y f)) + + I(P(y f) < P(y)) * (- log(1-p(y))+log(1-p(y f))) IS avg = 1/m Σ i=1m (IS(x i )) x P(y i f) y i IS(x) 1.95 1 0.66 2.6 0 0 3.8 1.42 4.75 0.32 5.9 1.59 P(y=1) = 3/5 = 0.6 P(y=0) = 2/5 = 0.4 IS(x 1 ) = 1 * (-log(.6) + log(.95)) + 0 * (-log(.4) + log).05)) = 0.66 IS avg = 1/5 (0.66+0+.42+.32+.59) = 0.40 25

Slides credits A. Zisserman: http://www.robots.ox.ac.uk/~az/lectures/ml/lect1. pdf Nathalie Japkowicz http://www.site.uottawa.ca/~nat/talks/icmla- Tutorial.pptx