Comparison of discrimination methods for the classification of tumors using gene expression data

Similar documents
Introduction to Discrimination in Microarray Data Analysis

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

T. R. Golub, D. K. Slonim & Others 1999

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Memorial Sloan-Kettering Cancer Center

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

10CS664: PATTERN RECOGNITION QUESTION BANK

PAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA

Classification of cancer profiles. ABDBM Ron Shamir

Hybridized KNN and SVM for gene expression data classification

Colon cancer subtypes from gene expression data

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Outlier Analysis. Lijun Zhang

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Evaluating Classifiers for Disease Gene Discovery

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data

Gene-microRNA network module analysis for ovarian cancer

Knowledge Discovery and Data Mining I

Statistical Applications in Genetics and Molecular Biology

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Biomarker adaptive designs in clinical trials

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

CSE 255 Assignment 9

Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Cancer outlier differential gene expression detection

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

UvA-DARE (Digital Academic Repository)

EECS 433 Statistical Pattern Recognition

Classification of Microarray Gene Expression Data

Summary of main challenges and future directions

Experimental Studies. Statistical techniques for Experimental Data. Experimental Designs can be grouped. Experimental Designs can be grouped

Gene expression analysis for tumor classification using vector quantization

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts

A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis

Evidence Contrary to the Statistical View of Boosting: A Rejoinder to Responses

Detecting Multiple Mean Breaks At Unknown Points With Atheoretical Regression Trees

ISIR: Independent Sliced Inverse Regression

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

BIOINFORMATICS ORIGINAL PAPER

TEMPORAL PREDICTION MODELS FOR MORTALITY RISK AMONG PATIENTS AWAITING LIVER TRANSPLANTATION

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

Keywords Data Mining Techniques (DMT), Breast Cancer, R-Programming techniques, SVM, Ada Boost Model, Random Forest Model

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

SubLasso:a feature selection and classification R package with a. fixed feature subset

Subject index. bootstrap...94 National Maternal and Infant Health Study (NMIHS) example

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

BIOINFORMATICS ORIGINAL PAPER

NIH Public Access Author Manuscript Best Pract Res Clin Haematol. Author manuscript; available in PMC 2010 June 1.

MACHINE LEARNING BASED APPROACHES FOR CANCER CLASSIFICATION USING GENE EXPRESSION DATA

International Journal of Pure and Applied Mathematics

Tissue Classi cation with Gene Expression Pro les ABSTRACT

Estimating the Number of Clusters in DNA Microarray Data

University of California, Berkeley

Practical Experience in the Analysis of Gene Expression Data

SUPPLEMENTARY INFORMATION In format provided by Javier DeFelipe et al. (MARCH 2013)

Boosted PRIM with Application to Searching for Oncogenic Pathway of Lung Cancer

Ensemble methods for classification of patients for personalized. medicine with high-dimensional data

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Data complexity measures for analyzing the effect of SMOTE over microarrays

NORTH SOUTH UNIVERSITY TUTORIAL 2

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Cancer Gene Extraction Based on Stepwise Regression

Clustering Autism Cases on Social Functioning

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Bivariate variable selection for classification problem

Predicting Juvenile Diabetes from Clinical Test Results

Eurekah Bioscience Collection

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA

Impute vs. Ignore: Missing Values for Prediction

A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system

Individual Participant Data (IPD) Meta-analysis of prediction modelling studies

Informative Gene Selection for Leukemia Cancer Using Weighted K-Means Clustering

Transcription:

Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley 2. Department of Statistics, UC Berkeley. Walter and Eliza Hall Institute, Melbourne

Tumor classification A reliable and precise classification of tumors is essential for successful treatment of cancer. Current methods for classifying human malignancies rely on a variety of morphological, clinical, and molecular variables. In spite of recent progress, there are still uncertainties in diagnosis. Also, it is likely that the existing classes are heterogeneous. DNA microarrays may be used to characterize the molecular variations among tumors by monitoring gene expression profiles on a genomic scale. This may lead to a more reliable classification of tumors.

Tumor classification, cont d There are three main types of statistical problems associated with tumor classification:. the identification of new/unknown tumor classes using gene expression profiles - cluster analysis / unsupervised learning; 2. the classification of malignancies into known classes - discriminant analysis / supervised learning;. the identification of marker genes that characterize the different tumor classes - variable selection.

Gene expression data Gene expression data on p genes (variables) for n mrna samples (observations) X n p = Genes x x 2... x p x 2 x 22... x 2p...... mrna samples x n x n2... x np x ij = gene expression level of gene j in mrna sample i { ( ) log Red intensity Green intensity, = log(avg. PM Avg. MM).

Gene expression data, cont d In some situations, the mrna samples are known to belong to certain classes (e.g. follicular lymphoma). Label the classes by {, 2,..., K}. Then, the data for each observation consist of: x i = ( x i, x i2,..., x ip ) - gene expression profile / predictor variables y i = tumor class / response.

The prediction problem Want to predict a response y given predictor variables x. Task: construct a prediction function f, such that f(x) is an accurate predictor of y. E.g. digit recognition for zipcodes; prediction of binding peptide sequences; prediction of tumor class from gene expression data.

Predictors A predictor or classifier for K tumor classes partitions the space X of gene expression profiles into K disjoint subsets, A,..., A K, such that for a sample with expression profile x = (x,..., x p ) A k the predicted class is k. Predictors are built from past experience, i.e. from observations which are known to belong to certain classes. Such observations comprise the learning set (LS) L = {(x, y ),..., (x n, y n )}. Classifier built from a learning set L: C(, L) : X {, 2,..., K}. Predicted class for an observation x: C(x, L).

Linear discriminant analysis Suggested in 96 by R. A. Fisher, linear discriminant analysis (LDA) consists of. finding linear combinations x a of the gene expression profiles x = (x,..., x p ) with large ratios of between-groups to within-groups sum of squares - discriminant variables; 2. predicting the class of an observation x by the class whose mean vector is closest to x in terms of the discriminant variables.

Linear discriminant analysis, cont d First discriminant variable Second discriminant variable -8-6 -4-2 0 2 4 -.0-0.5 0.0 0.5 2 2 2 2 2 2 2 2 2 x x x X B- 2 FL Lymphoma data: Linear discriminant analysis, p=50 genes

Nearest neighbor classifier These methods are based on a measure of distance between observations, such as the Euclidean distance or one minus the correlation between two gene expression profiles. The k nearest neighbor rule, due to Fix and Hodges (95), classifies an observation x as follows:. find the k observations in the learning set that are closest to x; 2. predict the class of x by majority vote, i.e., choose the class that is most common among those k observations. The number of neighbors k is chosen by cross-validation.

Nearest neighbor classifier, cont d Gene Gene 2-2 0 2 4-2 - 0 2 2 2 2 2 2 2 2 2 X B- 2 FL 2 Lymphoma data: two genes with largest BSS/WSS

Classification trees Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets, starting with X itself. Each terminal subset is assigned a class label and the resulting partition of X corresponds to the classifier. Three main aspects of tree construction: (i) the selection of the splits; (ii) the decision to declare a node terminal or to continue splitting; (iii) the assignment of each terminal node to a class. Different tree classifiers use different approaches to deal with these three issues. Here, we use CART - Classification And Regression Trees - of Breiman et al. (984).

Classification trees, cont d New observation: x = (x,..., x p )

Aggregating predictors Breiman (996, 998) found that gains in accuracy could be obtained by aggregating predictors built from perturbed versions of the learning set. In classification, the multiple versions of the predictor are aggregated by voting. Let C(, L b ) denote the classifier built from the bth perturbed learning set L b and let w b denote the weight given to predictions made by this classifier. The predicted class for an observation x is given by argmax k w b I(C(x, L b ) = k). b

Bagging Breiman (996). In the simplest form of bagging - bootstrap aggregating - perturbed learning sets of the same size as the original learning set are formed by forming non-parametric bootstrap replicates of the learning set, i.e. by drawing at random with replacement from the learning set. Predictors are built for each perturbed dataset and aggregated by plurality voting (w b = ).

Variants on bagging Parametric bootstrap. Perturbed learning sets are generated according to a mixture of multivariate normal (MVN) distributions. Convex pseudo-data. Breiman (996) Each perturbed learning set is generated by repeating the following n times:. select two instances (x, y) and (x, y ) at random from the learning set; 2. select at random a number v from the interval [0, d], 0 d, and let u = v;. define a new instance (x, y ) by y = y and x = ux + vx.

Boosting Freund and Schapire (997), Breiman (998). The data are re-sampled adaptively so that the weights in the re-sampling are increased for those cases most often misclassified. The aggregation of predictors is done by weighted voting. For a learning set L = {(x, y ),..., (x n, y n )}, let {p,..., p n } denote the re-sampling probabilities, initialized to be equal. For bth step of the boosting algorithm (adaptation of AdaBoost):

Boosting, cont d. generate a perturbed learning set L b of size n by sampling with replacement from L using {p,..., p n}; 2. build a classifier C(, L b ) based on L b ;. run the learning set L through the classifier C(, L b ) and let d i = if the ith case is classified incorrectly and d i = 0 o.w.; 4. define ɛ b = i p id i, β b = ( ɛ b )/ɛ b and w b = log(β b ) and update the re-sampling probabilities for the (b + )st step by p i = piβd i b i piβd i b.

Vote margins For aggregate classifiers, vote margins assessing the strength of a prediction may be defined for each observation. The vote margin for an observation x is defined to be M(x) = max k b w b I(C(x, L b ) = k) b w. b When the perturbed learning sets are given equal weights, i.e. w b =, the vote margin is simply the proportion of votes for the winning class, regardless of whether it is correct or not. Margins belong to [0, ].

Gene voting, Golub et al. (999) For binary classification, each gene casts a vote for class or 2, and the votes are aggregated over genes. Gene j s vote for a test set observation x = (x,..., x p) is given by where a j = x().j x(2).j sd () j + sd (2) j v j = a j(x j b j),, b j = 2 ( x().j + x(2).j ). Let V and V 2 denote the sum of positive and negative votes, respectively. The predicted class is if V V 2 and 2 otherwise. This is a minor variant on the sample ML discriminant rule for multivariate normal class densities with constant diagonal covariance matrices.

Datasets - Lymphoma Study of gene expression in the three most prevalent adult lymphoid malignancies using a specialized cdna microarray, the Lymphochip (Alizadeh et al., 2000). n = 8 mrna samples, three classes: B-cell chronic lymphocytic leukemia (B-) Follicular lymphoma (FL) Diffuse large B-cell lymphoma (DLBCL) p = 4, 682 genes. 29 cases 9 cases 4 cases

Datasets - Leukemia Study of gene expression in two types of acute leukemias using Affymetrix high-density oligonucleotide arrays (Golub et al., 999). n = 72 mrna samples, three classes: B-cell acute lymphoblastic leukemia (B-cell ALL) T-cell acute lymphoblastic leukemia (T-cell ALL) Acute myeloid leukemia (AML) p = 6, 87 genes. 8 cases 9 cases 25 cases

Data pre-processing Imputation. k-nearest neighbor imputation, where genes are neighbors and the similarity measure between two genes is the correlation in their expression profiles. Standardization. Standardize observations (arrays) to have mean 0 and variance across variables (genes).

Study design The original datasets are repeatedly randomly divided into a learning set and a test set, comprising respectively 2/ and / of the data. For each of N = 50 runs: Select a subset of p genes from the learning set based on their ratio of between to within-groups sum of squares, BSS/W SS. p = 50 for lymphoma, p = 40 for leukemia. Build the different predictors using the learning sets with p genes. Apply the predictors to the observations in the test set to obtain test set error rates.

Lymphoma: p=50 genes test set error rate 0.0 0.2 0.4 0.6 0.8.0 NN FLDA DLDA DQDA CART Bag CART CPD CART MVN CART Boost CART Lymphoma. Boxplots of test set error rates, p = 50 genes.

Leukemia: p=40 genes test set error rate 0.0 0.2 0.4 0.6 0.8.0 NN FLDA DLDA DQDA Golub CART Bag CART CPD CART MVN CART Boost CART Leukemia, two classes. Test set error rates, p = 40 genes.

Leukemia, k=: p=40 genes test set error rate 0.0 0.2 0.4 0.6 0.8.0 NN FLDA DLDA DQDA CART Bag CART CPD CART MVN CART Boost CART Leukemia, three classes. Test set error rates, p = 40 genes.

FL FL FL FL FL FL FL FL FL -70;Lymph node - -9-59 -7;Richter s -7-65 -24-6 -58-2 -6-66;SL06-57 -69-7 -0-5 -9-29 -5-48 -49-68 -4-5 -67-60 -22 FL-5;CD9+ FL-6;CD9+ FL-;CD9+ FL- FL-0 FL-0;CD9+ FL-2;CD9+ FL-9;CD9+ FL-9 DLCL-0042 DLCL-002 DLCL-006;OCT DLCL-009 DLCL-0007 DLCL-00 DLCL-00 DLCL-004 DLCL-000 DLCL-0029 DLCL-0004 DLCL-0006 DLCL-007 DLCL-0040 DLCL-0025 DLCL-0028 DLCL-002 DLCL-00 DLCL-005 DLCL-0005 DLCL-002 DLCL-0027 DLCL-0024 DLCL-0026 DLCL-009 DLCL-004 DLCL-0002 DLCL-006 DLCL-0020 DLCL-000 DLCL-0048 DLCL-00 DLCL-000 DLCL-008 DLCL-0008 DLCL-0052 DLCL-007 DLCL-002 DLCL-000 DLCL-005 DLCL-005 DLCL-0009 DLCL-004 FL FL FL FL FL FL FL FL FL -70;Lymph node - -9-59 -7;Richter s -7-65 -24-6 -58-2 -6-66;SL06-57 -69-7 -0-5 -9-29 -5-48 -49-68 -4-5 -67-60 -22 FL-5;CD9+ FL-6;CD9+ FL-;CD9+ FL- FL-0 FL-0;CD9+ FL-2;CD9+ FL-9;CD9+ FL-9 DLCL-0042 DLCL-002 DLCL-006;OCT DLCL-009 DLCL-0007 DLCL-00 DLCL-00 DLCL-004 DLCL-000 DLCL-0029 DLCL-0004 DLCL-0006 DLCL-007 DLCL-0040 DLCL-0025 DLCL-0028 DLCL-002 DLCL-00 DLCL-005 DLCL-0005 DLCL-002 DLCL-0027 DLCL-0024 DLCL-0026 DLCL-009 DLCL-004 DLCL-0002 DLCL-006 DLCL-0020 DLCL-000 DLCL-0048 DLCL-00 DLCL-000 DLCL-008 DLCL-0008 DLCL-0052 DLCL-007 DLCL-002 DLCL-000 DLCL-005 DLCL-005 DLCL-0009 DLCL-004 4,682 genes 50 genes Lymphoma. Images of correlation matrix between 8 mrna samples.

Bagging Percentage of correct predictions PV quartiles 0 0 0 0 20 0 40 50 60 70 80 Observations Lymphoma. Proportion of correct predictions (upper panel) and quartiles of vote margins (lower panel) for bagged CART predictors, p = 50 genes.

top 5 genes log(r/g,2) -5 0 5 Lymphoma: log(r/g) by class for 5 genes with largest BSS/WSS top 5 genes log(r/g,2) -5 0 5 Lymphoma: log(r/g) by class for 5 genes with largest BSS/WSS two largest classes pooled (B- and DLBCL) Lymphoma, three classes. Within class IQRs for 5 genes.

test set error rate 0.0 0.2 0.4 0.6 0.8.0 NN NN0 NN0p LDA LDA0 LDA0p CART CART0CART0p Bag Bag0 Bag0p Boost Boost0 Boost0p Lymphoma. Test set error rates for different gene sets.

Results In the main comparison, the nearest neighbor predictor has the smallest error rates, while LDA has the highest error rates. Aggregating predictors improves performance, the largest gains being with boosting and bagging with convex pseudo-data. For the binary class leukemia data, diagonal LDA as in Golub et al. performs similarly to nearest neighbors, boosting and bagging with convex pseudo-data.

For the lymphoma or leukemia datasets, increasing the number of variables to p = 200 doesn t affect much the performance of the various predictors. A more careful selection of a small number of genes (p = 0) improves the performance of LDA dramatically.

Discussion Diagonal LDA vs. correlated LDA: ignoring correlation between genes helps here. Unlike classification trees and nearest neighbors, diagonal or correlated LDA is unable to take into account gene interactions. Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions. Classification trees are capable of handling and revealing interactions between variables.

Useful by-product of aggregated predictors: vote margins. The relative performance of the different predictors may vary with variable selection.

Open questions Variable selection. A crude criterion such as BSS/W SS may not identify the genes that discriminate between all the classes and may not reveal interactions between genes. Statistical vs. biological significance. Cluster analysis. Identification and validation of new tumor classes.

Acknowledgments Biologists: Ash Alizadeh, Pat Brown, Mike Eisen. Statisticians: Leo Breiman, Sam Buttrey, David Nelson, Mark van der Laan, Jean Yang.