Learning Decision Trees Using the Area Under the ROC Curve

Similar documents
Modifying ROC Curves to Incorporate Predicted Probabilities

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

A scored AUC Metric for Classifier Evaluation and Selection

CSE 255 Assignment 9

Evaluation of diagnostic tests

International Journal of Advance Research in Computer Science and Management Studies

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Various performance measures in Binary classification An Overview of ROC study

An Empirical and Formal Analysis of Decision Trees for Ranking

Predictive Models for Healthcare Analytics

When Overlapping Unexpectedly Alters the Class Imbalance Effects

Weight Adjustment Methods using Multilevel Propensity Models and Random Forests

An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

Dottorato di Ricerca in Statistica Biomedica. XXVIII Ciclo Settore scientifico disciplinare MED/01 A.A. 2014/2015

Predicting Breast Cancer Survivability Rates

Data Mining and Knowledge Discovery: Practice Notes

Using AUC and Accuracy in Evaluating Learning Algorithms

Machine learning II. Juhan Ernits ITI8600

Archetypes and Archetypoids: Contributions for estimating boundary cases in Anthropometry and Ergonomics

Using Genetic Algorithms to Optimise Rough Set Partition Sizes for HIV Data Analysis

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Outlier Analysis. Lijun Zhang

A Hybrid Approach for Mining Metabolomic Data

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

4. Model evaluation & selection

Predicting the Likelihood of Heart Failure with a Multi Level Risk Assessment Using Decision Tree

Assessment of performance and decision curve analysis

Colon cancer subtypes from gene expression data

Modeling Sentiment with Ridge Regression

Chapter 1. Introduction

High-fidelity Imaging Response Detection in Multiple Sclerosis

arxiv: v1 [cs.cv] 26 Mar 2016

Investigating the performance of a CAD x scheme for mammography in specific BIRADS categories

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

What variables are important in predicting bovine viral diarrhea virus? A random forest approach

Stage-Specific Predictive Models for Cancer Survivability

Title:Prediction of poor outcomes six months following total knee arthroplasty in patients awaiting surgery

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems

Statistics, Probability and Diagnostic Medicine

Credal decision trees in noisy domains

AN INVESTIGATION OF MULTI-LABEL CLASSIFICATION TECHNIQUES FOR PREDICTING HIV DRUG RESISTANCE IN RESOURCE-LIMITED SETTINGS

BIOINFORMATICS. Permutation importance: a corrected feature importance measure

Automatic Lung Cancer Detection Using Volumetric CT Imaging Features

Receiver operating characteristic

HEART DISEASE PREDICTION USING DATA MINING TECHNIQUES

Classification of Thyroid Nodules in Ultrasound Images using knn and Decision Tree

Module Overview. What is a Marker? Part 1 Overview

Lazy Associative Classification

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Predicting Patient Satisfaction With Ensemble Methods

Introduction to Discrimination in Microarray Data Analysis

Comparing Complete and Partial Classification for Identifying Latently Dissatisfied Customers

Classical Psychophysical Methods (cont.)

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Rohit Miri Asst. Professor Department of Computer Science & Engineering Dr. C.V. Raman Institute of Science & Technology Bilaspur, India

Sparsifying machine learning models identify stable subsets of predictive features for behavioral detection of autism

Ensembles for Unsupervised Outlier Detection: Challenges and Research Questions

Defining the Study Population for an Observational Study to Ensure Sufficient Overlap: A Tree Approach

Deep learning and non-negative matrix factorization in recognition of mammograms

Predicting Juvenile Diabetes from Clinical Test Results

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

BMI 541/699 Lecture 16

Learning with Rare Cases and Small Disjuncts

Sensitivity, specicity, ROC

Active Learning with Support Vector Machine Applied to Gene Expression Data for Cancer Classification

Beating Diabetes: Predicting Early Diabetes Patient Hospital Readmittance to Help Optimize Patient Care

Exploratory Quantitative Contrast Set Mining: A Discretization Approach

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

Efficient AUC Optimization for Information Ranking Applications

Consumer Review Analysis with Linear Regression

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

3. Model evaluation & selection

On the Combination of Collaborative and Item-based Filtering

Enzyme Immunoassay versus Latex Agglutination Cryptococcal Antigen Assays in Adults With non-hiv-related Cryptococcosis

EXTRACTION OF RETINAL BLOOD VESSELS USING IMAGE PROCESSING TECHNIQUES

arxiv: v2 [stat.ap] 7 Dec 2016

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

arxiv: v3 [q-bio.qm] 3 Jul 2014

Statistical Tools in Biology

Comparison of discrimination methods for the classification of tumors using gene expression data

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Contrasting the Contrast Sets: An Alternative Approach

Sudden Cardiac Arrest Prediction Using Predictive Analytics

Reliability of Ordination Analyses

Data Mining in Bioinformatics Day 4: Text Mining

Using Random Forest in the field of metabolomics

Sample Size Considerations. Todd Alonzo, PhD

The Impact of a Mouthwash Program on the Risk of Nosocomial Pneumonia at the ICU

4 Diagnostic Tests and Measures of Agreement

Predicting Kidney Cancer Survival from Genomic Data

10CS664: PATTERN RECOGNITION QUESTION BANK

Winner s Report: KDD CUP Breast Cancer Identification

Identifying Relevant micrornas in Bladder Cancer using Multi-Task Learning

INDUCTIVE LEARNING OF TREE-BASED REGRESSION MODELS. Luís Fernando Raínho Alves Torgo

Transcription:

Learning Decision rees Using the Area Under the ROC Curve Cèsar erri, Peter lach 2, José Hernández-Orallo Dep. de Sist. Informàtics i Computació, Universitat Politècnica de València, Spain 2 Department of Computer Science, University of Bristol, UK he 9th International Conference on Machine Learning, Sydney, Australia, 8-2 July 2002

Evaluating classifiers Accuracy/error is not a good evaluation measure of the quality of classifiers when: the proportion of examples of one class is much greater then the other class(es). A trivial classifier always predicting the majority class may become superior. not every misclassification has the same consequences (cost matrices). he most accurate classifier may not be the one that minimises costs. Conclusion: accuracy is only a good measure if the class distribution on the evaluation dataset is meaningful and if the cost matrix is uniform. ICML'2002 2

Evaluating classifiers Problem. We usually don t know a priori: the proportion of examples of each class in application time. the cost matrix. ROC analysis can be applied in these situations. Provides tools to: Distinguish classifiers that can be discarded under any circumstance (class distribution or cost matrix). Select the optimal classifier once the cost matrix is known. ICML'2002 3

Evaluating classifiers. ROC Analysis Given a confusion matrix: Real Predicted Yes No Yes 30 20 No 0 40 We can normalise each column: ROC diagram Real PR Yes No PR Predicted Yes No 0.75 0.25 0.33 0.67 PR 0 0 PR ICML'2002 4

Evaluating classifiers. ROC Analysis PR ROC diagram Given several classifiers: We can construct the convex hull of their points (PR,PR) and the trivial classifiers (0,0) and (,). he classifiers falling under the ROC curve can be discarded. 0 0 PR he best classifier of the remaining classifiers can be chosen in application time ICML'2002 5

Choosing a classifier. ROC Analysis true positive rate 00% 80% 60% 40% 20% 0% 0% 20% 40% 60% 80% 00% false positive rate Pcost Ncost = 2 Neg Pos = 4 slope = 4 2 = 2 ICML'2002 6

Choosing a classifier. ROC Analysis true positive rate 00% 80% 60% 40% 20% 0% 0% 20% 40% 60% 80% 00% false positive rate Pcost Ncost = 8 Neg Pos = 4 slope = 4 8 =.5 ICML'2002 7

Choosing a classifier. ROC Analysis If we don t know the slope (expected class distribution) Classifier with greatest AUC PR ROC diagram AUC 0 0 PR he Area Under the Curve (AUC) can be used as a metric for comparing classifiers. ICML'2002 8

ROC Decision rees A decision tree can be seen as an unlabelled decision tree (a clustering tree): Given n leaves and 2 classes, there are 2 n possible labellings. Clearly, each of the 2 n possible labellings of the n leaves of a given decision tree represents a classifier We can use ROC analysis to discard some of them! raining Distribution Labellings 2 3 4 5 6 7 8 Leaf 4 2 Leaf 2 5 Leaf 3 3 5 ICML'2002 9

ROC Decision rees ROC CURVE RUE POSIIVE RAIO 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Many labellings are under the convex hull. here is a special symmetry around (0.5,0.5). ALSE POSIIVE RAIO his set of classifiers has special properties which could allow a more direct computation of the optimal labellings. ICML'2002 0

ROC Decision rees. Optimal Labellings Given a decision tree for a problem with 2 classes formed by n leaves {l, l 2,..., l n } ordered by local positive accuracy, i.e, r r 2,..., r n- r n, we define the set of optimal labellings Γ = {S 0,S,...S n } where each labelling S i, 0 i n, is defined as: S i ={A i, A2 i,..., An i } where Aj i = (j,+) if j i and A j i = (j, ) if j >i. heorem: he convex hull corresponding to the 2 n possible labellings is formed by and only by all the ROC points corresponding to the set of optimal labellings Γ, removing repeated leaves with the same local positive accuracy. ICML'2002

Example We first order the leaves and then use only the optimal labellings: Leaf Leaf 2 Leaf 3 5 4 3 2 5 ROC CURVE hat matches exactly with the convex hull: RUE POSIIVE RAIO 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 ALSE POSIIVE RAIO ICML'2002 2

ROC Decision rees. Optimal Labellings Advantages: Only n+ labellings must be done (instead of 2 n ). he convex hull need not be computed. he AUC is much easier to be computed: O(n log n). he AUC measure can be easily computed for unlabelled decision trees. Decision trees can be compared using it, instead of using accuracy. Why don t we use this measure during decision tree learning? ICML'2002 3

AUC Splitting Criterion AUCSplit: Given a split s when growing the tree, we can compute the ordering of these leaves and calculate the corresponding ROC curve. he area under this curve can be compared to the areas of other splits in order to select the best split. ICML'2002 4

AUC Splitting Criterion AUCSplit vs. standard splitting criteria: Standard splitting criteria compare impurity of parent with weighted average impurity of children. I( s) [p,n] = [p,n ] [p 2,n 2 ] j=.. n j j f ( p AUC is an alternative not based on impurity. Example for 2 children: p AUCsplit = + j, + 2 2 p j p p ) n n + = pn pn 2 pn ICML'2002 5

# Gain Ratio Gini DKM EErr Experiments 2 3 8.5 ± 4.0 60.6 ± 0.4 98.8 ±.6 79.8 ±.9 57.7 ± 8.4 98.7 ±.7 79.8 ±.9 55.5 ± 7.9 98.7 ±.7 82.2 ± 5.3 69.8 ± 4. 95.4 ± 2.6 4 8.3 ± 8.0 80.6 ± 7.5 79.8 ± 8. 76.4 ± 5.6 5 96.9 ± 2.5 96.9 ± 2.5 96.9 ± 2.5 96.9 ± 2.5 6 ± 0 99.9 ± 0.2 ± 0 ± 0. 7 9. ± 6.6 90.9 ± 5.8 95.7 ± 5.3 93.6 ± 3.7 Methodology: 8 9 58. ± 24.4 88.8 ± 0.2 66.4 ± 8.3 56. ± 3.6 54.9 ± 8.6 90.8 ± 5.0 5.2 ± 3.5 59.0 ± 5. 25 binary datasets UCI. PEP Pruning. 0 2 65. ± 6.7 78.0 ± 5.2 99.7 ± 0.4 63.4 ± 8.2 27.8 ± 3.5 99.3 ± 0.4 65.6 ± 8.4 69.3 ± 25.7 99.7 ± 0.3 59.9 ± 9.4 30.5 ± 39.8 98.3 ± 0.8 0-fold cross-validation. 3 4 60.6 ± 0.2 95.5 ± 2.5 69.7 ± 0.4 95.2 ± 2.7 72.7 ± 6.8 96.8 ± 2. 68. ± 2.8 94.8 ± 2.9 5 92.9 ± 2.4 65.4 ± 24.4 72.9 ± 26.3 65 ± 24.2 irst we examine which is the best classical splitting criterion wrt. 6 7 8 9 20 2 83.2 ± 6.5 93.6 ± 3.2 50.5 ± 25.9 98. ± 0.7 ± 0 99.7 ± 0.6 48.6 ± 5.2 49.7 ± 46. 48.9 ± 27. 98.2 ± 0.8 ± 0 98.2 ± 0.7 96.9 ± 5.7 65.8 ± 45.5 52.5 ± 24.5 98. ± 0.8 ± 0 99.7 ± 0.3 34.8 ± 4. 3.7 ±.3 2.5 ± 2.4 97.8 ±. ± 0 96.3 ± 2. he AUC measure: 22 23 93.7 ± 3.7 73.7 ± 3. 8.7 ± 4.9 66.6 ± 9.9 66.6 ± 2.6 73.5 ± 4.3 50 ± 0 5.0 ± 4.0 24 98.7 ±.0 95.9 ± 2.4 99.4 ± 0.5 85.7 ± 0.5 25 98. ± 2.3 95.9 ± 3.3 98.0 ± 2.6 96.0 ± 3.3 M 85.53 77.26 83.9 7.2 ICML'2002 6

Gain Ratio AUCsplit Better? Experiments Set 2 Acc. 90.7±6.6 57.7±6.5 AUC 83.6±.8 6.±7.9 Acc. 96.5±3.9 56.0±6.2 AUC 94.3±6.7 56.7±8.0 Acc. x AUC x 3 97.6±7.8 97.4±8.5 99.±. 99.±.4 4 78.9±4.6 79.8±7.2 77.6±4.7 76.9±6.5 x x 5 95.8±2.6 95.2±3. 95.8±2.6 95.2±3. Methodology: 25 binary datasets UCI. 6 7 8 9 ±0 92.5±4. 72.±0.2 92.0±4.7 ±0 9.5±6. 6.3±6.9 90.4±7.0 ±0 92.9±3.7 69.5±0.6 89.6±5.0 ±0 94.7±4.6 59.3±6.2 89.7±6.7 x x PEP Pruning. 0 62.6±8.8 73.3±5.7 64.2±0.6 76.6±6.9 64.0±9.0 72.5±5. 65.8±0. 76.7±6.0 0x0-fold cross-validation. when differences are significant with t-test at 0.. 2 3 4 5 6 99.±2.3 68.2±0.2 95.4±2.5 86.4±4.2 98.0±0.9 99.5±.6 67.4±.9 96.3±2.5 85.±7.9 84.6±3. 99.2±0.6 7.0±0.4 96.2±2.5 83.4±4.0 98.6±0.8 99.5±0.6 73.6±.0 97.6±2. 63.5±22.3 94.8±5.6 x 7 95.2±.4 92.6±3.5 96.7±.2 95.±3. Next we compare 8 9 7.4±2.4 95.0±.8 6.5±20.8 98.2±0.9 68.9±.6 94.8±.9 59.8±2.3 98.±.0 the best classical 20 2 ±0 99.6±0.3 ±0 99.6±0.5 ±0 99.6±0.2 ±0 99.4±0.6 splitting criterion 22 23 96.8±0.9 70.4±3.9 93.3±4.7 72.2±4.9 96.8±0.2 7.±3.6 95.±6.9 73.3±4.0 with the AUCsplit: 24 25 99.5±0.2 98.9±.8 98.9±.4 94.2±9.4 99.5±0. 99.5±0.3 99.3±0.7 98.5±.8 M. 87.49 85.78 87.55 86.24 ICML'2002 7

Experiments Methodology: 6 of 25 binary datasets UCI with % of minority class < 5%. PEP Pruning. 0x0-fold cross-validation. inally we compare the results when class distribution changes. # Original Dist. GR AUCs. GR 50%-50% AUCs. Swapped Dist. GR AUCs. %min class 6 98.0 98.6 88.3 93.5 78.6 88.3 6.06 7 95.2 96.7 88.6 92.6 8.9 88.4.83 2 99.6 99.6 99.0 98.7 98.4 97.8 0.4 22 96.8 96.8 89.8 89.7 82.9 82.7 0.23 24 99.5 99.5 96.0 96.6 92.5 93.6 3.95 25 98.9 99.5 95.8 98.4 92.7 97.3 9.86 M. 98.0 98.5 92.9 94.9 87.8 9.4 ICML'2002 8

Conclusions and uture Work Labelling classifiers: One classifier can be many classifiers! Optimal labelling set identified (order by local positive accuracy) An efficient way to compute the AUC of a set of rules. AUCsplit criterion: Better results for the AUC measure uture work: Extension of the AUC measure and AUCsplit for c>2. Global AUC splitting criterion. Pre-pruning and post-pruning methods based on AUC. ICML'2002 9