Introduction to Discrimination in Microarray Data Analysis

Similar documents
Comparison of discrimination methods for the classification of tumors using gene expression data

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Classification of cancer profiles. ABDBM Ron Shamir

T. R. Golub, D. K. Slonim & Others 1999

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

BIOINFORMATICS ORIGINAL PAPER

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

10CS664: PATTERN RECOGNITION QUESTION BANK

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Classification with microarray data

Bayesian Prediction Tree Models

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

CSE 255 Assignment 9

MammaPrint, the story of the 70-gene profile

Gene expression correlates of clinical prostate cancer behavior

MammaPrint Improving treatment decisions in breast cancer Support and Involvement of EU

Section 4.1. Chapter 4. Classification into Groups: Discriminant Analysis. Introduction: Canonical Discriminant Analysis.

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

SUPPLEMENTARY APPENDIX

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2008 Formula Grant

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

Identification of Tissue Independent Cancer Driver Genes

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Propensity Score Analysis to compare effects of radiation and surgery on survival time of lung cancer patients from National Cancer Registry (SEER)

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Lecture #4: Overabundance Analysis and Class Discovery

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Tissue Classification Based on Gene Expression Data

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Feature Vector Denoising with Prior Network Structures. (with Y. Fan, L. Raphael) NESS 2015, University of Connecticut

Statistical Applications in Genetics and Molecular Biology

Analysis of gene expression in blood before diagnosis of ovarian cancer

Outlier Analysis. Lijun Zhang

PAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA

Colon cancer subtypes from gene expression data

Knowledge Discovery and Data Mining I

Predicting Kidney Cancer Survival from Genomic Data

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Evaluating Classifiers for Disease Gene Discovery

Data complexity measures for analyzing the effect of SMOTE over microarrays

Biomarker adaptive designs in clinical trials

UvA-DARE (Digital Academic Repository)

NIH Public Access Author Manuscript Best Pract Res Clin Haematol. Author manuscript; available in PMC 2010 June 1.

Representational similarity analysis

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann

VL Network Analysis ( ) SS2016 Week 3

Reach and grasp by people with tetraplegia using a neurally controlled robotic arm

Experimental Design For Microarray Experiments. Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Modeling Sentiment with Ridge Regression

Mammogram Analysis: Tumor Classification

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

Cancer outlier differential gene expression detection

Integrative analysis of survival-associated gene sets in breast cancer

Impact of Prognostic Factors

Integration of high-throughput biological data

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

ISIR: Independent Sliced Inverse Regression

Lecture Outline Biost 517 Applied Biostatistics I

Classification of Microarray Gene Expression Data

Machine Learning for Predicting Delayed Onset Trauma Following Ischemic Stroke

SUPPLEMENTARY INFORMATION

BIOSTATISTICAL METHODS

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

8/8/2011. PONDERing the Need to TAILOR Adjuvant Chemotherapy in ER+ Node Positive Breast Cancer. Overview

R2: web-based genomics analysis and visualization platform

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks

Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets

Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets

Selection of Linking Items

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:

Numerous hypothesis tests were performed in this study. To reduce the false positive due to

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Inter-session reproducibility measures for high-throughput data sources

Gene expression profiling predicts clinical outcome of prostate cancer. Gennadi V. Glinsky, Anna B. Glinskii, Andrew J. Stephenson, Robert M.

SubLasso:a feature selection and classification R package with a. fixed feature subset

Where does "analysis" enter the experimental process?

4. Model evaluation & selection

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE

Cumulative metformin use and its impact on survival in gastric cancer patients after INDEX

Invasive breast cancer: stratification of histological grade by gene-based assays: a still relevant example from an older data set

Transcription:

Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t Veer breast cancer study study Investigate whether tumor ability for metastasis is obtained later in development or inherent in the initial gene expression signature. Retrospective sampling of node-negative women: 44 non-recurrences within 5 years of surgery and 34 recurrences. Additionally, 19 test sample (12 recur. and 7 non-recur) Want to demonstrate that gene expression profile is independently predictive of the recurrence. Reference L van t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan.. 2 1

Data Array: 2-color Agilent oligo arrays, ~24,000 genes Pre-processing: image quantification, background subtraction and normalization Filtering: at least 2-fold differential expression relative to the reference and p-value for being expressed < 0.01 in at least 5 samples. 4,348 such genes are remaining. Imputation: k-nearest neighbors imputation procedure (Troyanskaya et al): - For each gene with at least 1 missing values, find 5 genes most highly correlated with it - Replace missing value with the average of those 5 gene values in the samples with a missing value for a gene. 3 Biological question Experimental design Microarray experiment Failed Quality Measurement Pass Image analysis Normalization Sample/Condition Gene 1 2 3 4 1 0.46 0.30 0.80 1.51 2-0.10 0.49 0.24 0.06 3 0.15 0.74 0.04 0.10 : Preprocessing Estimation Testing Structure Discovery Analysis Clustering Discrimination Biological verification and interpretation 4 2

Tumor Classification Using Gene Expression Data Three main types of statistical problems associated with tumor classification: Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning clustering) Classification of malignancies into known classes (supervised learning discrimination) Identification of marker genes that characterize the different tumor classes (feature or variable selection). 5 Basic principles of discrimination Each object associated with a class label (or response) Y {1, 2,, K} and a feature vector (vector of predictor variables) of G measurements: X = (X 1,, X G ) Aim: predict Y from X. 1 2 K Predefined Class {1,2, K} Objects Y = Class Label = 2 X = Feature vector {colour, shape} Classification rule? X = {red, square} Y =? 6 3

Discrimination and Allocation Learning Set Data with known classes Classification rule Prediction Data with unknown classes Classification Technique Discrimination Class Assignment 7 Learning set Predefine classes Clinical outcome Bad prognosis recurrence < 5yrs Good Prognosis recurrence > 5yrs Good Prognosis? Matesis > 5 Objects Array Feature vectors Gene expression new array Reference L van t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan.. Classification rule 8 4

Classification Rule Performance Assessment e.g. Cross validation -Classification procedure, -Feature selection, -Parameters [pre-determine, estimable], Distance measure, Aggregation methods One can think of the classification rule as a black box, some methods provides more insight into the box. Performance assessment needs to be looked at for all classification rule. 9 Why select features Lead to better classification performance by removing variables that are noise with respect to the outcome May provide useful insights into etiology of a disease Can eventually lead to the diagnostic tests (e.g., breast cancer chip ) 10 5

Note how selecting genes based on their association with recurrence biases the results towards clear grouping. 11 Approaches to feature selection Methods fall into three basic category - Filter methods - Wrapper methods - Embedded methods The simplest and most frequently used methods are the filter methods. Adapted from A. Hartemnick 12 6

Filter methods R p Feature selection R s s << p Classifier design Features are scored independently and the top s are used by the classifier Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc Easy to interpret. Can provide some insight into the disease markers. Adapted from A. Hartemnick 13 Classification rule Maximum likelihood discriminant rule A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest. For known class conditional densities p k (X), the maximum likelihood (ML) discriminant rule predicts the class of an observation X by C(X) = argmax k p k (X) 14 7

Gaussian ML discriminant rules For multivariate Gaussian (normal) class densities X Y= k ~ N(µ k, Σ k ), the ML classifier is C(X) = argmin k {(X - µ k ) Σ k -1 (X - µ k ) + log Σ k } In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA) In practice, population mean vectors µ k and covariance matrices Σ k are estimated by corresponding sample quantities 15 ML discriminant rules - special cases [DLDA] Diagonal linear discriminant analysis class densities have the same diagonal covariance matrix = diag(s 12,, s p2 ) [DQDA] Diagonal quadratic discriminant analysis class densities have different diagonal covariance matrix k = diag(s 1k 2,, s pk2 ) Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (different variance calculation). 16 8

Nearest neighbor classification Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation). k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: - find the k observations in the learning set closest to X - predict the class of X by majority vote, i.e., choose the class that is most common among those k observations. The number of neighbors k can be chosen by cross-validation (more on this later). 17 Nearest neighbor rule 18 9

Classification tree Partition the feature space into a set of rectangles, then fit a simple model in each one Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself) Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier 19 Classification tree Gene 1 M i1 < -0.67 Gene 2 yes no 0 yes Gene 2 M i2 > 0.18 no 1 0.18 1 1 Gene 1 0 1-0.67 0 = no recurrence 1 = recurrence 20 10

21 Performance assessment Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase. One needs to estimate future performance based on what is available: often the same set that is used to build the classifier. Assessing performance of the classifier based on - Cross-validation. - Test set - Independent testing on future dataset 22 11

Diagram of performance assessment Classifier Training Set Resubstitution estimation Training set Performance assessment Classifier Independent test set Test set estimation 23 Diagram of performance assessment Classifier Training Set Resubstitution estimation (CV) Learning set Training set Classifier Cross Validation Performance assessment (CV) Test set Classifier Independent test set Test set estimation 24 12

Performance assessment (III) Common practice to do feature selection using the learning, then CV only for model building and classification. However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased. Features (variables) should be selected only from the learning set used to build the model (and not the entire set) 25 Summary of Case Study Analysis Classifier Feature-ranking statistic Performance assessment method Optimization parameters K-NN F-statistic Leave-one-out crossvalidation error rate Number of neighbors, Number of Features (k = 1,3 and p = 10,30,50,100,150, 200,250,300 ) Apply classifier with the smallest leave-one-out cross validation error rate to the test set. 26 13

Results Best classification is with k=3 and p=200 23/78 or 29% misclassified Confusion matrix: Predicted True no rec rec no rec 33 11 rec 12 22 Fisher exact test p-value for association between true and predicted class labels. Green horizontal line is drawn at p=.01 27 Application to the test set 1. Identify 200 variables with highest F-statistic using entire training set. 2. Build k-nn classifier with k=3 neighbors using entire training set 3. Apply that classifier to the test set Confusion matrix: Predicted True no rec rec no rec 5 2 rec 4 8 6/19 or 32% of samples misclassified 28 14

no rec rec no rec rec Training Set Test Set Samples are clustered within class. Genes are clustered based on training set and the resulting order is used for the test set. 29 Learning set Bad Good Classification Rule Feature selection. Correlation with class labels, very similar to t-test. Using cross validation to select 70 genes Case studies Reference 1 Retrospective study L van t Veer et al Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan 2002.. 295 samples selected from Netherland Cancer Institute tissue bank (1984 1995). Results Gene expression profile is a more powerful predictor then standard systems based on clinical and histologic criteria Reference 2 Cohort study M Van de Vijver et al. A gene expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine, Dec 2002. Agendia (formed by reseachers from the Netherlands Cancer Institute) Has started in Oct, 2003 1) 5000 subjects [Health Council of the Netherlands] 2) 5000 subjects New York based Avon Foundation. Custom arrays are made by Agilent including 70 genes + 1000 controls Reference 3 Prospective trials. Aug 2003 Clinical trials http://www.agendia.com/ 30 15

van de Vuver s breast data (NEJM, 2002) Consequent cohort of 295 additional breast cancer patients, mix of node-negative and node-positive samples. Want to use the predictor that was developed to identify patients at risk for metastasis. The predicted class was significantly associated with time to recurrence in the multivariate cox-proportional model. 31 Validation Apply predictor developed on the training set to the new samples to obtain binary label: recurrence or no recurrence Include that label as a covariate into cox-proportional model for survival along with known clinico-pathological risk factors Assess whether recurrence labels are independent predictors of time to recurrence (i.e. whether the covariate is significant in the moltivariate model) The answer is yes. 32 16

33 Various packages in R Software PAM http://www-stat.stanford.edu/~tibs/pam/index.html MvE http://www.tigr.org/software/tm4/mev.html Number of proprietary tools exists 34 17

SFGH Agnes Paquet David Erle Andrea Barczac UCSF Sandler Genomics Core Facility. UCB Terry Speed Jean Yang Sandrine Dudoit Acknowledgements UCSF /CBMB Ajay Jain Mark Segal UCSF Cancer Center Array Core Jain Lab 35 18