Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination

Similar documents
Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Comparison of discrimination methods for the classification of tumors using gene expression data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

T. R. Golub, D. K. Slonim & Others 1999

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Classification of cancer profiles. ABDBM Ron Shamir

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Final Project Report Sean Fischer CS229 Introduction

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION

Utilizing Posterior Probability for Race-composite Age Estimation

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Bivariate variable selection for classification problem

Design of Multi-Class Classifier for Prediction of Diabetes using Linear Support Vector Machine

An Improved Algorithm To Predict Recurrence Of Breast Cancer

SubLasso:a feature selection and classification R package with a. fixed feature subset

Variable Features Selection for Classification of Medical Data using SVM

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Predicting Kidney Cancer Survival from Genomic Data

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Classification with microarray data

Reveal Relationships in Categorical Data

! BIOL 2401! Week 5. Nervous System. Nervous System

Data Mining in Bioinformatics Day 4: Text Mining

Chapter 1. Introduction

Introduction to Discrimination in Microarray Data Analysis

Prediction of Malignant and Benign Tumor using Machine Learning

General: Brain tumors are lesions that have mass effect distorting the normal tissue and often result in increased intracranial pressure.

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

A Hybrid Approach for Mining Metabolomic Data

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Network-based pattern recognition models for neuroimaging

On the Combination of Collaborative and Item-based Filtering

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

T-Relief: Feature Selection for Temporal High- Dimensional Gene Expression Data

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Five Most Common Problems in Surgical Neuropathology

Information-theoretic stimulus design for neurophysiology & psychophysics

Mammogram Analysis: Tumor Classification

G3.02 The malignant potential of the neoplasm should be recorded. CG3.02a

EXPression ANalyzer and DisplayER

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Identification of Neuroimaging Biomarkers

Data complexity measures for analyzing the effect of SMOTE over microarrays

Breast Cancer Prevention and Early Detection using Different Processing Techniques

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

BREAST CANCER EPIDEMIOLOGY MODEL:

Predicting Sleep Using Consumer Wearable Sensing Devices

Nature Methods: doi: /nmeth.3115

Keywords: Leukaemia, Image Segmentation, Clustering algorithms, White Blood Cells (WBC), Microscopic images.

A Novel Iterative Linear Regression Perceptron Classifier for Breast Cancer Prediction

EXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Tumors of the Central Nervous System

3. Model evaluation & selection

EECS 433 Statistical Pattern Recognition

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Classifying Substance Abuse among Young Teens

Primary Level Classification of Brain Tumor using PCA and PNN

Biomedical Research 2016; Special Issue: S148-S152 ISSN X

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Learning Classifier Systems (LCS/XCSF)

Collin County Community College BIOL Week 5. Nervous System. Nervous System

Large-Scale Statistical Modelling via Machine Learning Classifiers

Finding the Augmented Neural Pathways to Math Processing: fmri Pattern Classification of Turner Syndrome Brains

BACKPROPOGATION NEURAL NETWORK FOR PREDICTION OF HEART DISEASE

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Enhanced Detection of Lung Cancer using Hybrid Method of Image Segmentation

7.1 Grading Diabetic Retinopathy

k-nn Based Classification of Brain MRI Images using DWT and PCA to Detect Different Types of Brain Tumour

Chapter 4 DESIGN OF EXPERIMENTS

NMF-Density: NMF-Based Breast Density Classifier

A study on Feature Selection Methods in Medical Decision Support Systems

Lung Cancer Diagnosis from CT Images Using Fuzzy Inference System

A novel approach to feature extraction from classification models based on information gene pairs

Brain Tumor segmentation and classification using Fcm and support vector machine

Predicting Breast Cancer Survivability Rates

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Mammogram Analysis: Tumor Classification

number Done by Corrected by Doctor Maha Shomaf

Transcription:

Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination Committee: Advisor: Dr. Rosemary Renaut Dr. Adrienne C. Scheck Dr. Kenneth Hoober Dr. Bradford Kirkman-Liff John Huynh E-mail: jahuynh@yahoo.com

Contents Introduction Data Support Vector Machine Feature Selection Hypothesis & Experimental design Result Conclusion Future work Experience Reference

Terminology Sample is data set including Gene = feature = attribute = column Example = data point = slide = array = row r_1 r_2 r_i r_m x1 x_11 x_21 x_i1 x_m1 x2 x_12 x_22 x_i2 x_m2...... xj x_1j x_2j x_ij x_mj...... xn x_1n x_2n x_in x_mn Class c_1 c_2 c_i c_m

Meningioma Dr. Adrienne C. Scheck s Lab, BNI (Barrow Neurological Institute) Meningioma: 20% of primary intracranial tumor Mortality/Morbidity: In one series by Coke et al, the overall survival rate for all patients at 5 and 10 years were 87% and 58%, respectively. Medial Sphenoid Wing Meningioma

Meningioma Correlating clinical process, microarray, NMR, and FISH with WHO classification grade I, II, and III. Tubercullum sellae meningioma

Anatomy Meningioma is tumor of arachnoid.

Histology Neuron & Purkinje cell (cerebellum) Neuroglial cells Astrocytes: nurture, support Protoplasmic astrocytes (gray matter) Fibrous astrocytes (white matter) Oligodendrocytes: myelin, support Microglia: immune system in brain Ependymal cells: epithelium Blood vessels

Meningioma - Histopathology Meningioma: whorl-like structure + psammoma bodies WHO grade I: benign WHO grade II: (atypical) A meningioma with increased mitotic activity or three or more of the following features: increased cellularity, small cells with high nucleus: cytoplasm ratio, prominent nucleoli, uninterrupted patternless or sheet-like growth, and foci of spontaneous or geographic necrosis. WHO grade III: (anaplastic) A meningioma exhibiting histological features of frank malignancy far in excess of the abnormalities present in atypical meningioma.

BNI Meningioma Data Affymetrix HG-U133 Plus 2.0 with 54,675 genes. Small data set with many genes Grade Primary Recurrence Total I 15 3 18 II 7 0 7 III 0 1 1 Total 22 4 26

BNI Meningioma Data Plan A: consider data as large data set Plan B: consider data as small data set Grade Train Test I 11 4 II 5 2 Total 16 6 Total 15 7 22

BNI Meningioma Data High quality

Microarray Gene expression- Microarray Pattern of gene expressions for each tissue Oligo-microarray vs cdna High density Fixed probe length (25) In-situ synthesis

Microarray Microarray explores gene expression in global scale. PM & MM

Lymphoma Data Amersham cdna microarray with 7129 genes Tissue = bone marrow, blood ALL: acute lymphocytic leukemia AML: acute myelogenous leukemia Incidence: peak 2-3 yrs old: 80/1,000,000; 2400 new/yr/usa, 31% of all cancers ALL AML Total Train Test 27 20 11 14 38 34 Total 47 25 72

Lymphoma Data Good quality Large sample size, smaller feature dimension

Inducer Problem The purpose of learning machine is to find the most accurate classifier by learning in the training set and testing in the testing set. It is the minimizing problem of the error function E in mathematics. Let call f is learning algorithm, data points X = {x1, x2,, xi,, xm} in Rn, target {y1, y2,, yi,, ym} in Y = {-1, +1} f: X Y xi f(xi) E = (yi - f(xi))2. Testing set requirement: the testing set must be never seen in the training process; otherwise the correctness of the testing phase is unexpected high.

Support Vector Machine Map data into the feature space Learn in the feature space Return the result to the output space Learning function f (xi) = xi w + b f(xi) > 0 for yi = +1, f(xi) < 0 for yi = -1 f(xi) = 0 for decision boundary Output space Input space Feature space

SVM Characteristics Maximum margin Low computer cost: Kernel function costs O(n). Training cost: the worst case costs O(nsv3 + nsv2m + nsvmn); the best case costs O(nm2). Testing cost: O(n).

Linear SVM - Separable Case No kernel = scalar dot product Margin = 2/ w minimizing w2 Constraints (xi w+b)yi >0

Linear SVM - Non-Separable Case Introduce slacks ξi to adjust the choosing of support vector when needed. This means adding a constraint C on the Lagrangean multipliers C = 100 in our experiment.

Non-Linear SVM There is no linear decision boundary in the input space

Non-Linear Support Vector Machine Introduce kernel function to map data into Euclidean high dimensional space: dot product.

Non-Linear Support Vector Machine Now the data and weight are in the hyperspace. Training and testing processes are in the high dimensional space

Problem of Microarray Data Instance space F1x F2 x x Fi x x Fn The training set must be a large enough subset of instance space. Over-fitting problem of small data set: the inducer performs well in training set, but acts poorly in test set. The computational cost of high dimensional data is so high (n = 54675). Multiple testing correction: FDR, SAM, Classical analysis methods are not suitable.

Feature Selection Benefits of feature selection are reducing Computer cost Over-fitting Feature selection actually is a search algorithm in the feature space to find the optimal feature subset. Given an inducer I, and a data set D with features X1,, Xi,, Xn from a distribution D over the labeled instance space, an optimal feature subset, Xopt, is a subset of the features such that the accuracy of the induced classifier C = I(D) is maximal (Kohavi97).

Feature Selection: How? Filter method vs wrapper method Feature ranking criteria Correlation coefficient Weight

Recursive Feature Elimination RFE is a top-down (backward) wrapper using weight as feature ranking criterion. Eliminate One feature in every loop: slow A subset in every loop: fast Are they the same optimal subsets? Is the feature ranking criteria are the same?

Feature Selection Meaning Create nested subsets Let define Rate of elimination Surviving subset Note that the feature selection module includes an inducer so the training set must be never seen in both Feature selection module Evaluation module (Kohavi97)

Full Two Factorial Experiment Design The evaluation cost is the accuracy. The evaluation methods: independent test and cross-validation. The inducer is SVM for both feature selection and evaluation (Guyon02). The factor A (row) is the rate of elimination. The factor B (column) is the surviving subset

Software Design Preprocessing data: linear normalization + log2 transformation (prep.java) SVM, feature selection and evaluation: Matlab 6.5R13

Result: Lymphoma The optimal subset is 32 genes.

Result: Lymphoma Box Plots Box Plots

Result: Lymphoma ANOVA Tsuc Vsuc

Result: Meningioma The optimal subset is 4.

Result: Meningioma Box Plots Small Plan: 4 Large Plan: 2 Large Small

Result: Meningioma ANOVA Correct choice is 4. Index 37881 22501 50198 16979 Probe 238018_at 222608_s_at 1564431_a_at 21552_x_at

Conclusion No interaction between the rate of elimination and the feature optimal subset Small data set: rely on cross-validation

Future Works More published data set: large + small, difficult + easy How small is small? Evaluation method for small data set: master gene lists + LOOCV Over-fitting and cross-validation

Experience Not all the data mining task will be success. Business focus: communication, learning, negotiation, team work, leadership, Understand and live with data: a high dimensional small data set Never alternate the data in preprocessing process (time cost) Experimental design: good planning Observation + Think + Reaction = Strategy Loop, deal with the fact, not with who. Repeatable: Document experiment results and analysis Welcome new idea: good + bad; read, read, and read Never seen rule of test data, evaluation algorithm, over-fitting Feature selection SVM Software

References (Blum) Avrim L. Blum and Pat Langley, Selection of Relevant Features and Examples in Machine Learning, http://citeseer.ist.psu.edu/blum97selection.html. (Burges98) Christopher J.C. Burges, A Turtorial on Support Vector Machines for Pattern Recognition, (1998), Web-print: http://citeseer.ist.psu.edu/397919.html. (Golub99) Golub et al, Molecular Classication of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 286 (1999), 531-7, http://www.broad.mit.edu/mpr/publications/projects/leukemia/golub et al 1999.pdf. (Guyon02) Isabelle Guyon et al., Gene Selection for Cancer Classication using Support Vector Machines, Machine Learning 46 (2002), no. 1-3, 389422, Web-print: http://citeseer.ist.psu.edu/guyon00gene.html. (Gunn98) Steve R. Gunn, Support Vector Machines for Classification and Regression, (1998), http:// www.ecs.soton.ac.uk/~srg/publications/pdf/svm.pdf (Kohavi97) Ron Kohavi and George H. John, Wrappers for Feature Subset Selection, Artifcial Intelligence 97 (97), 273-324. (Soroin03) Soroin Dr aghici, Data Analysis Tools for DNA Microarrays, Chapman and Hall/CRC, 2003. WHO Classification http://neurosurgery.mgh.harvard.edu/newwhobt.htm

Question?