Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems

Similar documents
A scored AUC Metric for Classifier Evaluation and Selection

METHODS FOR DETECTING CERVICAL CANCER

Machine learning II. Juhan Ernits ITI8600

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Statistics, Probability and Diagnostic Medicine

Outlier Analysis. Lijun Zhang

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Various performance measures in Binary classification An Overview of ROC study

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Yeast Cells Classification Machine Learning Approach to Discriminate Saccharomyces cerevisiae Yeast Cells Using Sophisticated Image Features.

Efficient AUC Optimization for Information Ranking Applications

Statistical methods for the meta-analysis of full ROC curves

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Estimation of Area under the ROC Curve Using Exponential and Weibull Distributions

Statistical methods for the meta-analysis of full ROC curves

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Introduction to Computational Neuroscience

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

PERFORMANCE MEASURES

Receiver operating characteristic

Detection of Mild Cognitive Impairment using Image Differences and Clinical Features

Classical Psychophysical Methods (cont.)

Saichiu Nelson Tong ALL RIGHTS RESERVED

Data complexity measures for analyzing the effect of SMOTE over microarrays

Predictive performance and discrimination in unbalanced classification

ROI DETECTION AND VESSEL SEGMENTATION IN RETINAL IMAGE

Week 2 Video 3. Diagnostic Metrics

ROC Curves (Old Version)

Behavioral Data Mining. Lecture 4 Measurement

An Empirical and Formal Analysis of Decision Trees for Ranking

Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification

Selection and Combination of Markers for Prediction

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework

COMPUTERIZED SYSTEM DESIGN FOR THE DETECTION AND DIAGNOSIS OF LUNG NODULES IN CT IMAGES 1

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Biomarker adaptive designs in clinical trials

Enhanced Detection of Lung Cancer using Hybrid Method of Image Segmentation

BREAST CANCER EPIDEMIOLOGY MODEL:

Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers

Modifying ROC Curves to Incorporate Predicted Probabilities

Using AUC and Accuracy in Evaluating Learning Algorithms

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Part [1.0] Introduction to Development and Evaluation of Dynamic Predictions

Zheng Yao Sr. Statistical Programmer

Brain Tumor segmentation and classification using Fcm and support vector machine

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Correcting AUC for Measurement Error

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Computer-aided diagnosis of subtle signs of breast cancer: Architectural distortion in prior mammograms

When Overlapping Unexpectedly Alters the Class Imbalance Effects

Large-Scale Statistical Modelling via Machine Learning Classifiers

Empirical assessment of univariate and bivariate meta-analyses for comparing the accuracy of diagnostic tests

Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta GA, USA.

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Winner s Report: KDD CUP Breast Cancer Identification

Identifikation von Risikofaktoren in der koronaren Herzchirurgie

ENSEMBLE CLASSIFIER APPROACH IN BREAST CANCER DETECTION AND MALIGNANCY GRADING- A REVIEW

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Comparing Two ROC Curves Independent Groups Design

EVALUATION AND COMPUTATION OF DIAGNOSTIC TESTS: A SIMPLE ALTERNATIVE

A Review on Sleep Apnea Detection from ECG Signal

Fusing Simultaneous EEG-fMRI by Linking Multivariate Classifiers

Sensitivity, Specificity, and Relatives

Computer Models for Medical Diagnosis and Prognostication

EXTRACTION OF RETINAL BLOOD VESSELS USING IMAGE PROCESSING TECHNIQUES

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

VU Biostatistics and Experimental Design PLA.216

Data Analytics for Optimal Detection of Metastatic Prostate Cancer

Detection of Glaucoma and Diabetic Retinopathy from Fundus Images by Bloodvessel Segmentation

Methods for Predicting Type 2 Diabetes

Prequential AUC: properties of the area under the ROC curve for data streams with concept drift

Noncompartmental Analysis (NCA) in PK, PK-based Design

arxiv: v1 [cs.lg] 3 Jan 2018

LOGISTIC REGRESSION VERSUS NEURAL

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

arxiv: v1 [cs.cv] 26 Mar 2016

Predictive Models for Healthcare Analytics

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R

Research on Classification of Diseases of Clinical Imbalanced Data in Traditional Chinese Medicine

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

A HMM-based Pre-training Approach for Sequential Data

PII: S (96) THE USE OF THE AREA UNDER THE ROC CURVE IN THE EVALUATION OF MACHINE LEARNING ALGORITHMS

Classification of benign and malignant masses in breast mammograms

Sensitivity, specicity, ROC

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Error Detection based on neural signals

Module Overview. What is a Marker? Part 1 Overview

Lecture 10: Learning Optimal Personalized Treatment Rules Under Risk Constraint

Classification with microarray data

Multivariate Bioequivalence

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Selection of Linking Items

Modelling and Application of Logistic Regression and Artificial Neural Networks Models

Bayesian methods in health economics

Transcription:

Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems Hiva Ghanbari Jointed work with Prof. Katya Scheinberg Industrial and Systems Engineering Department Lehigh University ICCOPT Conference, Tokyo, Japan August 2016 ICCOPT 2016 Derivative Free AUC Optimization (1 of 30)

Outline Introduction AUC Optimization via DFO-TR Parameter Tuning of Cost-Sensitive LR Summary ICCOPT 2016 Derivative Free AUC Optimization (2 of 30)

Outline Introduction AUC Optimization via DFO-TR Parameter Tuning of Cost-Sensitive LR Summary ICCOPT 2016 Derivative Free AUC Optimization (3 of 30)

Learning From Imbalanced Data Sets Many real-world machine learning problems are dealing with imbalanced learning data (a) Balanced data set (b) Imbalanced data set ICCOPT 2016 Derivative Free AUC Optimization (4 of 30)

Learning From Imbalanced Data Sets ICCOPT 2016 Derivative Free AUC Optimization (5 of 30)

Learning From Imbalanced Data Sets Rare positive example, as the minority class, but numerous negative ones, as the majority class In many applications the minority class is the more interesting and important one The rare class has a much higher misclassification cost compared to the majority class ICCOPT 2016 Derivative Free AUC Optimization (6 of 30)

Learning From Imbalanced Data Sets Most common methods for dealing with imbalance data sets Methods Advantages Drawbacks Independent of underlying May remove significant Undersampling classifier patterns Independent of underlying classifier Time consuming Oversampling Can be easily implemented May lead to over-fitting Cost sensitive Minimizes the cost of misclassification The misclassification costs are unknown ICCOPT 2016 Derivative Free AUC Optimization (7 of 30)

Ranking Quality of Classifier f(x) Consider linear classifier f(x) = w T x + b, where { +1, if f(x i ) > 0, y i = 1, otherwise. Consider function f s(x) = 1 1+e f(x), where f(x) = wt x + b. 1 f s(x) 0.5 0 f(x) = w T x + b ICCOPT 2016 Derivative Free AUC Optimization (8 of 30)

Ranking Quality of Classifier f(x) Consider linear classifier f(x) = w T x + b, where { +1, if f(x i ) > 0, y i = 1, otherwise. Consider function f s(x) = 1 1+e f(x), where f(x) = wt x + b. 1 f s(x) 0.5 0 f(x) = w T x + b In perfect classification, all positive examples are ranked higher than the negative ones. + + + + f s(x) ICCOPT 2016 Derivative Free AUC Optimization (8 of 30)

Receiver Operating Characteristic(ROC) Curve Sorted outputs based on descending value of f s(x) + + + + f s(x) Confusion Matrix of a binary classification problem, for a specific threshold to decide when negative turns into positive example +ve (Predicted Class) -ve (Predicted Class) +ve (Actual Class) True Positive (TP) False Negative (FN) -ve (Actual Class) False Positive (FP) True Negative (TN) Various thresholds result in different True Positive Rate = T P T P +F N False Positive Rate = F P F P +T N. and Definition 1 ROC curve presents the tradeoff between the true positive rate and the false positive rate, for all possible thresholds. ICCOPT 2016 Derivative Free AUC Optimization (9 of 30)

Area Under ROC Curve (AUC) How we can compare ROC curves? ICCOPT 2016 Derivative Free AUC Optimization (10 of 30)

Area Under ROC Curve (AUC) How we can compare ROC curves? Higher AUC = Better classifier ICCOPT 2016 Derivative Free AUC Optimization (10 of 30)

Measuring AUC How can we measure the value of AUC of a classifier? Lemma 2 Consider linear classifier f(x) = w T x + b, which has classified positive set S + := {x + i : i = 1,..., N + } from negative set S := {x j : j = 1,..., N }. Then, the associated AUC can be obtained by WMW result where F AUC (w) = I w[w T x i > w T x j ] = N+ N i=1 j=1 Iw[wT x + i > w T x j ]. N + N { +1, if w T x + i > w T x j, 0, otherwise. ICCOPT 2016 Derivative Free AUC Optimization (11 of 30)

Probabilistic Interpretation of AUC We can measure the ranking quality of the classifier f(x) through a probability value. F AUC (w) = P (w T X + > w T X ). where X + and X are randomly sampled from S + and S. If two sets S + and S are randomly drawn from distributions D 1 and D 2, then E(F AUC (w)) = P (w T ˆX+ > w T ˆX ), (1) where ˆX + and ˆX are randomly chosen from distributions D 1 and D 2, respectively. ICCOPT 2016 Derivative Free AUC Optimization (12 of 30)

Ranking Loss maximizing AUC function F AUC (w) is equivalent to minimizing ranking loss 1 F AUC (w) N+ N i=1 j=1 I[wT x + i wt x j 0] 1 F AUC (w) = 1 +, N + N where as is shown below I[w T x + i wt x j 0] = { +1, if w T x + i wt x j 0, 0, otherwise. I( ) 1 0 w T (x + i x j ) ICCOPT 2016 Derivative Free AUC Optimization (13 of 30)

Surrogate Convex Losses The main challenge of minimizing ranking loss 1 F AUC (w) is in its non-smoothness and discontinuousness owned from indicator function I( ). ICCOPT 2016 Derivative Free AUC Optimization (14 of 30)

Surrogate Convex Losses The main challenge of minimizing ranking loss 1 F AUC (w) is in its non-smoothness and discontinuousness owned from indicator function I( ). Exponential Loss: I exp(w, x + i x j ) = e wt (x + i x j ) 1 0 1 w T (x + i x j ) ICCOPT 2016 Derivative Free AUC Optimization (14 of 30)

Surrogate Convex Losses The main challenge of minimizing ranking loss 1 F AUC (w) is in its non-smoothness and discontinuousness owned from indicator function I( ). Logistic Loss: 1 0 1 w T (x + i x j ) I log (w, x + i x j ) = log(1 + e wt (x + i x j ) ), ICCOPT 2016 Derivative Free AUC Optimization (15 of 30)

Surrogate Convex Losses The main challenge of minimizing ranking loss 1 F AUC (w) is in its non-smoothness and discontinuousness owned from indicator function I(.). Hinge Loss: I hinge (w, x + i x j ) = max{0, 1 wt (x + i x j )} 1 0 1 w T (x + i x j ) ICCOPT 2016 Derivative Free AUC Optimization (16 of 30)

Outline Introduction AUC Optimization via DFO-TR Parameter Tuning of Cost-Sensitive LR Summary ICCOPT 2016 Derivative Free AUC Optimization (17 of 30)

Algorithmic Framework of DFO-TR Algorithm 1 Trust Region Based Derivative Free Optimization 0. Initializations. Define interpolation set X and compute corresponding function values F X. 1. Build the model. Build Q k (x) as a quadratic approximation of function f. 2. Minimize the model within the trust region. Find ˆx k such that Q k (ˆx k ) := min x Bk Q k (x), where B k = {x : x x k k }. Compute function f(ˆx k ) and the ratio ρ k = f(x k) f(ˆx k ) Q k (x k ) Q k (ˆx k ). 3. Update the interpolation set. 4. Update the trust region radius. ICCOPT 2016 Derivative Free AUC Optimization (18 of 30)

AUC, a Smooth Function of w in Expectation Theorem 3 If two d dimensional random vectors ˆX 1 and ˆX 2 have a joint multivariate normal distribution, such that ( ) ( ) ( ) ˆX1 µ1 Σ11 Σ N (µ, Σ), where, µ =, and Σ = 12, ˆX 2 µ 2 Σ 21 Σ 22 then the marginal distributions of ˆX1 and ˆX 2 are normal distributions with the following properties ˆX 1 N (µ 1, Σ 11 ), ˆX2 N (µ 2, Σ 22 ). ICCOPT 2016 Derivative Free AUC Optimization (19 of 30)

AUC, a Smooth Function of w in Expectation Theorem 4 Consider two random vectors ˆX 1 and ˆX 2, then for any vector w R d, we have Z = w T ( ˆX 1 ˆX 2 ) N (µ Z, σ 2 Z ), where µ Z = w T (µ 1 µ 2 ), σ 2 Z = wt (Σ 11 + Σ 22 Σ 12 Σ 21 ) w. Theorem 5 If two random vectors ˆX 1 and ˆX 2, respectively from the positive and the negative classes, have a joint normal distribution, then the expected value of AUC function can be defined as ( ) µz E(F AUC (w)) = φ, σ Z where φ is the cumulative function of the standard normal distribution, so that φ(x) = e 1 2 x2 /2π, for x R. ICCOPT 2016 Derivative Free AUC Optimization (20 of 30)

Computational Result (DFO-TR vs. Hinge) Performance of algorithms based on the average value of AUC and number of function evaluations, Algorithm sonar(d = 60, N = 208) fourclass(d = 2, N = 862) AUC fevals AUC fevals Hinge 0.849057 ± 0.003061 274 0.836226± 0.000923 480 DFO-TR 0.840323± 0.002453 254 0.836311± 0.000921 254 Algorithm svmguide1(d = 4, N = 3089) magic04(d = 10, N = 4020) AUC fevals AUC fevals Hinge 0.989319 ± 0.000008 334 0.843085 ± 0.000208 417 DFO-TR 0.989132± 0.000007 254 0.843511±0.000213 254 Algorithm svmguide3(d = 22, N = 1243) shuttle(d = 9, N = 3500) AUC fevals AUC fevals Hinge 0.793116±0.001284 368 0.988625±0.000021 266 DFO-TR 0.775246± 0.002083 254 0.987531±0.000035 254 Algorithm segment(d = 19, N = 2310) ijcnn1(d = 22, N = 4691) AUC fevals AUC fevals Hinge 0.993134 ± 0.000023 753 0.930685±0.000204 413 DFO-TR 0.99567±0.00071 254 0.910897± 0.000264 254 Algorithm letter(d = 16, N = 5000) poker(d = 10, N = 5010) AUC fevals AUC fevals Hinge 0.986699±0.000037 517 0.519942±0.001549 553 DFO-TR 0.985119±0.000042 254 0.520517±0.001618 254 ICCOPT 2016 Derivative Free AUC Optimization (21 of 30)

Computational Result (Stochastic DFO-TR) Stochastic vs Deterministic DFO-TR ICCOPT 2016 Derivative Free AUC Optimization (22 of 30)

Outline Introduction AUC Optimization via DFO-TR Parameter Tuning of Cost-Sensitive LR Summary ICCOPT 2016 Derivative Free AUC Optimization (23 of 30)

Cost-Sensitive Logistic Regression Problem Biasing the classifier toward the minority class where F (w) := 1 N N C(y i ) log(1 + exp( y i w T x i )) + λ 2 w 2, i=1 C(y i ) = { C +, if y i > 0, C, otherwise. ICCOPT 2016 Derivative Free AUC Optimization (24 of 30)

Cost-Sensitive Logistic Regression Problem Biasing the classifier toward the minority class where GOAL F (w) := 1 N N C(y i ) log(1 + exp( y i w T x i )) + λ 2 w 2, i=1 C(y i ) = { C +, if y i > 0, C, otherwise. Find the best value of costs C + and C and regularization parameter λ, to achieve a linear classifier f(x) = w T x + b with maximum value of AUC. ICCOPT 2016 Derivative Free AUC Optimization (24 of 30)

AUC as a Function of {C +, C, λ} AUC is a function of w F AUC (w) = N+ N i=1 j=1 Iw[wT x i > w T x j ]. N + N The vector w is a function of the parameters {C +, C, λ}, so that F (w) := 1 N N C(y i ) log(1 + exp( y i w T x i )) + λ 2 w 2, i=1 ICCOPT 2016 Derivative Free AUC Optimization (25 of 30)

AUC as a Function of {C +, C, λ} AUC is a function of w F AUC (w) = N+ N i=1 j=1 Iw[wT x i > w T x j ]. N + N The vector w is a function of the parameters {C +, C, λ}, so that F (w) := 1 N N C(y i ) log(1 + exp( y i w T x i )) + λ 2 w 2, i=1 F AUC (w) is a non-smooth function of the parameters {C +, C, λ}. ICCOPT 2016 Derivative Free AUC Optimization (25 of 30)

w a Smooth Function of {C +, C, λ} Theorem 6 In the cost sensitive logistic regression, the vector w is a smooth function of non-zero parameters C +,C, and λ if matrix M and added matrix M v have the same rank (one special case which satisfies this condition is when the matrix M is a full rank matrix) N + exp(y i w T x i ) M = (1 + exp(y i w T x i )) 2 (x ix T i ), i=1 v = 1 C + N + i=1 y i x i 1 + exp(y i w T x i ). (2) Similar condition is required when M and v are constructed based on the negative class. ICCOPT 2016 Derivative Free AUC Optimization (26 of 30)

Computational Result In this section we compare the following algorithms, Cost-Sensitive Logistic Regression based on Class Distribution Ratio Cost-Sensitive Logistic Regression based on Optimized Ratio Algorithm sonar fourclass svmguide1 magic04 CSLR-CDR 0.837656 0.835398 0.981900 0.841500 CSLR-OR 0.84183 0.835387 0.988035 0.843396 Algorithm diabetes german a9a svmguide3 CSLR-CDR 0.827639 0.788780 0.902774 0.774338 CSLR-OR 0.822690 0.784636 0.902868 0.797397 Algorithm connect-4 shuttle HAPT segment CSLR-CDR 0.899115 0.988924 0.999992 0.999791 CSLR-OR 0.898861 0.991161 0.999994 0.999841 Algorithm mnist ijcnn1 satimage vowel CSLR-CDR 0.994864 0.934975 0.761618 0.978475 CSLR-OR 0.995341 0.934986 0.760733 0.975405 Algorithm poker letter w1a w8a CSLR-CDR 0.502508 0.987558 0.960954 0.960331 CSLR-OR 0.521330 0.988293 0.962368 0.961645 performance of algorithms based on average value of AUC ICCOPT 2016 Derivative Free AUC Optimization (27 of 30)

Outline Introduction AUC Optimization via DFO-TR Parameter Tuning of Cost-Sensitive LR Summary ICCOPT 2016 Derivative Free AUC Optimization (28 of 30)

Summary and Future Study The main challenge of optimizing AUC is in its non-smoothness and discontinuousness Directly Optimizing AUC function via trust region based derivative free optimization Tuning parameters of cost-sensitive logistic regression to obtain a classifier with maximum AUC value ICCOPT 2016 Derivative Free AUC Optimization (29 of 30)

References A. R. Conn, K. Scheinberg, and L. N.Vicente, Introduction To Derivative-Free Optimization, SIAM J, 2009. M. Menickelly R. Chen and K. Scheinberg, Stochastic optimization using a trust-region method and random models, C. X. Ling, J. Huang, and H. Zhang, AUC: a Statistically Consistent and more Discriminating Measure than Accuracy, J.A. Hanley, B.J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143:29?36, 1982. ICCOPT 2016 Derivative Free AUC Optimization (30 of 30)