Data complexity measures for analyzing the effect of SMOTE over microarrays

Similar documents
When Overlapping Unexpectedly Alters the Class Imbalance Effects

Evaluating Classifiers for Disease Gene Discovery

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Colon cancer survival prediction using ensemble data mining on SEER data

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Hybridized KNN and SVM for gene expression data classification

Yeast Cells Classification Machine Learning Approach to Discriminate Saccharomyces cerevisiae Yeast Cells Using Sophisticated Image Features.

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Predictive performance and discrimination in unbalanced classification

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Recognition of HIV-1 subtypes and antiretroviral drug resistance using weightless neural networks

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

READ-BIOMED-SS: ADVERSE DRUG REACTION CLASSIFICATION OF MICROBLOGS USING EMOTIONAL AND CONCEPTUAL ENRICHMENT

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

MRI Image Processing Operations for Brain Tumor Detection

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Algorithms Implemented for Cancer Gene Searching and Classifications

Handling Unbalanced Data in Deep Image Segmentation

Improving k Nearest Neighbor with Exemplar Generalization for Imbalanced Classification

Research on Classification of Diseases of Clinical Imbalanced Data in Traditional Chinese Medicine

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Predicting Breast Cancer Survivability Rates

Data Imbalance in Surveillance of Nosocomial Infections

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

Sleep Staging with Deep Learning: A convolutional model

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Methods for Predicting Type 2 Diabetes

NMF-Density: NMF-Based Breast Density Classifier

Impute vs. Ignore: Missing Values for Prediction

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Introduction to Discrimination in Microarray Data Analysis

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Machine Learning for Imbalanced Datasets: Application in Medical Diagnostic

T-Relief: Feature Selection for Temporal High- Dimensional Gene Expression Data

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Variable Features Selection for Classification of Medical Data using SVM

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Development of Soft-Computing techniques capable of diagnosing Alzheimer s Disease in its pre-clinical stage combining MRI and FDG-PET images.

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Building an Ensemble System for Diagnosing Masses in Mammograms

The Application of Image Processing Techniques for Detection and Classification of Cancerous Tissue in Digital Mammograms

Classification of breast cancer using Wrapper and Naïve Bayes algorithms

Comparison of discrimination methods for the classification of tumors using gene expression data

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks

Emotion Detection Using Physiological Signals. M.A.Sc. Thesis Proposal Haiyan Xu Supervisor: Prof. K.N. Plataniotis

Derivative-Free Optimization for Hyper-Parameter Tuning in Machine Learning Problems

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

1 Pattern Recognition 2 1

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Decision Support System for Skin Cancer Diagnosis

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Machine Learning Classifications of Coronary Artery Disease

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

TRIPLL Webinar: Propensity score methods in chronic pain research

A HMM-based Pre-training Approach for Sequential Data

Prediction of heart disease using k-nearest neighbor and particle swarm optimization.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UvA-DARE (Digital Academic Repository)

Decision-Tree Based Classifier for Telehealth Service. DECISION SCIENCES INSTITUTE A Decision-Tree Based Classifier for Providing Telehealth Services

Gene expression correlates of clinical prostate cancer behavior

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Outlier Analysis. Lijun Zhang

Multiclass microarray data classification based on confidence evaluation

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

and errs as expected. The disadvantage of this approach is that it is time consuming, due to the fact that it is necessary to evaluate all algorithms,

An empirical evaluation of text classification and feature selection methods

Outlining a simple and robust method for the automatic detection of EEG arousals

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE

Learning with Rare Cases and Small Disjuncts

Copyright 2007 IEEE. Reprinted from 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, April 2007.

Prognostic Prediction in Patients with Amyotrophic Lateral Sclerosis using Probabilistic Graphical Models

Homo heuristicus and the bias/variance dilemma

Predicting Kidney Cancer Survival from Genomic Data

Design of Multi-Class Classifier for Prediction of Diabetes using Linear Support Vector Machine

Scientific Journal of Informatics Vol. 3, No. 2, November p-issn e-issn

AN EXPERIMENTAL STUDY ON HYPOTHYROID USING ROTATION FOREST

Inter-session reproducibility measures for high-throughput data sources

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

EXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES

Feature analysis to predict treatment outcome in rheumatoid arthritis

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods

Error Detection based on neural signals

Classification of Epileptic Seizure Predictors in EEG

Investigating the performance of a CAD x scheme for mammography in specific BIRADS categories

Local Image Structures and Optic Flow Estimation

Mammogram Analysis: Tumor Classification

Detection of Cognitive States from fmri data using Machine Learning Techniques

A Review on Arrhythmia Detection Using ECG Signal

Transcription:

ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Data complexity measures for analyzing the effect of SMOTE over microarrays L. Mora n-ferna ndez and V. Bolo n-canedo and A. Alonso-Betanzos Laboratory for Research and Development in Artificial Intelligence (LIDIA), Computer Science Dept., University of A Corun a, 171 A Corun a, Spain Abstract. Microarray classification is a challenging issue for machine learning researchers mainly due to the fact that there is a mismatch between gene dimension and sample size. Besides, this type of data have other properties that can complicate the classification task, such as class imbalance. A common approach to deal with the problem of imbalanced datasets is the use of a preprocessing step trying to cope with this imbalance. In this work we analyze the usefulness of the data complexity measures in order to evaluate the behavior of the SMOTE algorithm before and after applying feature gene selection. 1 Introduction During the last two decades, advances in molecular genetics technologies such as DNA microarrays have allowed the expression levels of thousands of genes to be measured simultaneously, stimulating a new line of research both in bioinformatics and in machine learning (ML). Microarray technology is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for diagnosis diseases, as they enable distinct kinds or subtypes of tumors to be classified according to expression patterns (profiles). The classification of this type of data has been viewed as a particular challenge for ML researchers mainly because of their extremely high dimensionality (from 2 to 2 features) in contrast to small samples sizes (often fewer than 1 patients) [1]. The existence of many fields relative to few samples means that false positives findings due to chance are very likely (in terms of both identifying relevant genes and building predictive models). Moreover, several studies have demonstrated that most of the genes measured in a DNA microarray experiment do not actually contribute to efficient sample classification. To avoid this curse of dimensionality, feature selection (FS) defined as the process of identifying and removing irrelevant features from the training data is advisable so as to identify the specific genes that enhance classification accuracy. Apart from the obvious problem of having an extremely high dimensionality, microarray datasets present other properties than can complicate the classification task, as the imbalance of the data. The class imbalance problem occurs when a dataset is dominated by a major class or classes which have significantly more instances than the other rare/minority classes in the data. For example, in the domain at hand, the cancer class tends to be rarer than the non-cancer class because usually there are more healthy This research has been financially supported in part by the Spanish Ministerio de Economı a y Competitividad (research projects TIN 212-3794 and TIN21-669-C2-1-R), by European Union FEDER funds and by the Consellerı a de Industria of the Xunta de Galicia (research project GRC214/3). V. Bolo ncanedo acknowledges Xunta de Galicia postdoctoral funding (ED481B 214/164-). 289

ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. patients in a real situation. However, it is important for practitioners to predict and prevent the appearance of cancer. In these cases, standard classifier learning algorithms have a bias toward the classes with a greater number of samples, since the rules that correctly predict those samples are positively weighted in favor of the accuracy metric, whereas specific rules that predict instances from the minority class are usually ignored (treated as noise), because more general rules are preferred. Therefore, minority class samples are more often misclassified than those from the other classes. In contrast, it is widely acknowledged that the prediction capacities of classifiers greatly depend on the characteristics of the data as well. Data complexity analysis is a relatively recent proposal by Ho and Basu [2] to identify data particularities which imply some difficulty for the classification task. Several studies have explored the use of data complexity analysis to characterize data and to relate data characteristics to classifier performance over microarray data [3, 4]. The aim of this work is to show that the data complexity measures are adequate to analyze the effect of the preprocessing in unbalanced data for classification. Specifically, we will consider the Synthetic Minority Oversampling Technique (SMOTE) [] over five microarray datasets before and after applying FS, obtaining promising results. 2 Oversampling approach: the SMOTE algorithm The traditional preprocessing techniques used to overcome the class imbalance problem are undersampling and oversampling methods. Undersampling is a technique which creates a subset of the original datasets by eliminating samples. It aims to attain the same number of samples of the majority class as in the minority class. As we mentioned above about microarray data, a very small number of samples are available in the whole dataset. Consequently, elimination of observations is not a good option for this type of datasets. In contrast, oversampling methods create a superset of the original dataset by replicating some instances or creating new instances from existing ones. Specifically, in this research we have chosen a popular oversampling method, the SMOTE algorithm. When applying SMOTE, the minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of oversampling required, neighbors from the k-nearest neighbors (k-nn) are randomly chosen. Synthetic samples are generated as follows: (1) take the difference between the feature vector (sample) under consideration and its nearest neighbor, (2) multiply this difference by a random number between and 1, and (3) add it to the feature vector consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general. In short, its main idea is to form new minority class samples by interpolating between several minority class samples that lie together. Thus, the overfitting problem is avoided and causes the decision boundaries for the minority class to spread further into the majority class space. 29

ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. 3 Data complexity measures To analyze the theoretical complexity of the microarray datasets chosen for this research, we have used some of the measures proposed in [2]. As these measures have been designed for two-class problems, we have converted the original multiclass problem to many instances of two-class problems by using the one-versus-rest strategy. F1: Maximum Fisher s discriminant ratio. This measure, which computes the maximum discriminative power of each feature, is defined as: f= (µ1 µ2 )2 σ12 + σ22 where µ1, µ2 are the means and σ12 and σ22 are the variances of the two classes in that feature dimension. We compute f for each feature and take the maximum as the F1 measure. Higher values indicate simpler classification problems. F3: Maximum (individual) feature efficiency. In a procedure that progressively removes unambiguous points falling outside the overlapping region in each chosen dimension, the efficiency of each feature is defined as the fraction of all remaining points separable by that feature. To represent the contribution of the most useful feature, we use maximum feature efficiency as measure F3. L1: Minimized sum of the error distance by linear programming. This measure evaluates to what extent the training data is linearly separable. It returns the sum of the differences between a linear classifier predicted value and the actual class value. Unlike Ho and Basu, we use SVM with a linear kernel. A zero value for L1 indicates that the problem is linearly separable. N1: Fraction of points on the class boundary. This measure constructs a classblind minimum spanning tree over the entire dataset, counting the number of points incident to an edge going across the two classes. This index thus reflects the fraction of such points over all points in the dataset. High values indicates smaller separation in distributions and a more difficult classification task. 4 Experimental results The experiments were carried out on five microarray datasets. Table 1 profiles the main characteristics of these datasets in terms of number of features, number of samples, number of classes and imbalance ratio (IR). The imbalance ratio is defined as the number of samples in the majority class divided by the number of samples in the minority class, where a high score indicates a highly unbalanced dataset. As classifier we have chosen k-nn [8]. It was executed using the Weka tool [9], using k = 1. A -fold cross validation is performed. In order to reduce the number of features, Correlation-based Feature Selection () [1] is used in this work. This simple multivariate filtering algorithm ranks feature subsets according to a correlationbased heuristic evaluation function. The evaluation function is biased towards subsets 291

ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Table 1: Characteristics of five microarray datasets. Dataset CLL-SUB-111 Leukemia-1 BrainTumor-2 BrainTumor-1 # Classes 3 3 4 # Features 1134 327 1367 92 126 # Samples 111 72 9 23 IR 4.63 4.22 2.14 1. 23.16 Download [6] containing features that are highly correlated with the class and uncorrelated with each other. Irrelevant features with low correlation with the class are ignored. Redundant features are screened out as they would be highly correlated with one or more of the remaining features. The acceptance of a feature depends on the extent to which it predicts classes in areas of the instance space not already predicted by other features. Table 2 shows the true positives rate for the four approaches: (1) classification before applying FS and oversampling (only classifier), (2) classification after applying FS (), (3) classification after applying oversampling (SMOTE) and (4) classification after applying oversampling and FS (+SMOTE). Table 2: True positives rate before and after applying feature selection and oversampling on five multiclass microarray datasets. CLL-SUB-111 Leukemia-1 BrainTumor-2 BrainTumor-1 Class- Class- Class- Class-3 Class- Class-3 Class-4 Class- Class-3 Class-4 Distribution (%) 9.91 44.14 4.9 2.78 12. 34.72 28. 14. 28. 3. 66.67 11.11 11.11 4.44 6.67 68.47 8.38 1.34 9.8 2.96 Only classifier 47.4 68.49 66.67 64.93 63.33 2. 6.33 4. 96.79 83.33 83.33 4. 9.63 68.67 84. 7. 6.87 8.1 96.67 96.67. 86.67 64. 94.33 66.67 4. 9.31 7.33 87. 6. SMOTE 41.67 8.37 86.28 8. 81.67 68.33 89.33 63.33 88.33 94.13 4. 92.76 9. 92. 8. +SMOTE 3.84 72.9 9.32 89.33 7.33 96.93 7. 6. 93.28 8. 83.33 The first conclusion that can be drawn is that classification results are better after applying FS (both before and after applying the SMOTE method). In order to shed light on this issue, we applied the data complexity measures to the microarrays datasets after 292

ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. redundant and/or irrelevant genes were removed. Figure 1 illustrates the behavior of these measures with FS () and without FS (), where the value of the complexity measure is the average of the binary sub-problems into which each multiclass dataset was decomposed. Standard deviations are also shown. Whilst F3 and L1 maintained their values, and F1 slighltly diminishes, the N1 measure decreased considerably. Higher N1 values indicate smaller separation in distributions and a more difficult classification task, so it seems logical to think that due to the nature of the measure (its nearest neighbor base definition), that lower values after applying the filter would improve classification performance. 2 1.4 2 1.2 1 1.8 F3 F1 1.6.4 1.2 (a) F1 (b) F3.7 1.2.6 1..8.4 L1 N1 1.4.6.3.4.2.2.1 (c) L1 (d) N1 Fig. 1: Data complexity measures with feature selection () and without feature selection () on five microarray datasets. Trying to explain in detail the effect of SMOTE on the minority classes, Table 3 briefly summarizes the results obtained by the four data complexity measures for the five microarray datasets chosen. Indicated for each data property (overlap between classes, non-linearity and closeness to class boundary) are the corresponding classes and the related complexity measures. As can be seen in Table 2, the minority classes of CLL-SUB-111 and Leukemia-1 achieved a true positives rate of 1% (even before applying oversampling) whilst the results for the majority classes are not as good (specially for CLL-SUB-111). This is happening because of the complexity problems that the classes with more samples present, and that is the reason why SMOTE algorithm cannot help to improve the classification performance on these datasets. In contrast, the effect of the SMOTE algorithm is remarkable on the minority class of BrainTumor-2 (class 1) which does not show any complexity problem improving its true positives rate from 2% to % after applying SMOTE along with the filter. The analysis followed for these datasets can be extended to BrainTumor-1 and. 293

ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Table 3: Theoretical complexity of the five microarray datasets. CLL-SUB-111 Leukemia-1 BrainTumor-2 BrainTumor-1 Overlap (F1, F3) 1, 2 2, 3, 4 Non-linearity (L1) 1, 2 2,3, 4 Boundary Class (N1) 1, 2 Conclusions In this work we have analyzed a common problem in microarray data, the so-called class imbalance problem, using an oversampling method which is a reference in this area, the SMOTE algorithm, before and after applying feature selection. We have observed that the imbalance ratio is not enough to predict the adequate performance of the classifier. As an alternative approach, we have computed several data complexity measures over the imbalanced datasets in order to support the application or not of an oversampling method. In view of the experimental results over five microarray datasets, we recommend to analyze the theoretical complexity of the datasets through the data complexity measures before applying the SMOTE algorithm. The rationale for this relies in the fact that the classifier is more affected by the complexity of the data itself than by the imbalance problem. We also suggest the use of a feature selection method before applying SMOTE. As future research, we plan to extend this study to other oversampling methods. References [1] Putri W Novianti, Victor L Jong, Kit CB Roes, and Marinus JC Eijkemans. Factors affecting the accuracy of a class prediction model in gene expression data. BMC bioinformatics, 16(1):199, 21. [2] T. K. Ho and M. Basu. Data complexity in pattern recognition. Springer, 26. [3] A. C. Lorena, I. G. Costa, N. Spolao r, and M. C. de Souto. Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing, 7(1):33 42, 212. [4] O. Okun and H. Priisalu. Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artificial intelligence in medicine, 4(2):11 162, 29. [] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, pages 321 37, 22. [6] Arizona State University. Feature selection datasets. http://featureselection.asu.edu/datasets.php [Last access: October 21]. A. Statnikov, C.Aliferis, and I. Tsardinos. Gems: Gene expression model selector. http://www.gemssystem.org/ [Last access: October 21]. [8] D. W. Aha, D.Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6(1):37 66, 1991. [9] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):1 18, 29. [1] Mark A Hall. Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato, 1999. 294