Classification with microarray data

Similar documents
Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Assessment of Risk Recurrence: Adjuvant Online, OncotypeDx & Mammaprint

Introduction to Discrimination in Microarray Data Analysis

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Classification of cancer profiles. ABDBM Ron Shamir

8/8/2011. PONDERing the Need to TAILOR Adjuvant Chemotherapy in ER+ Node Positive Breast Cancer. Overview

Genomic Profiling of Tumors and Loco-Regional Recurrence

30 years of progress in cancer research

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Rationale For & Design of TAILORx. Joseph A. Sparano, MD Albert Einstein College of Medicine Montefiore-Einstein Cancer Center Bronx, New York

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Predicting Breast Cancer Survival Using Treatment and Patient Factors

From single studies to an EBM based assessment some central issues

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Breast cancer classification: beyond the intrinsic molecular subtypes

BREAST CANCER EPIDEMIOLOGY MODEL:

Harmesh Naik, MD. Hope Cancer Clinic PERSONALIZED CANCER TREATMENT USING LATEST IN MOLECULAR BIOLOGY

RNA preparation from extracted paraffin cores:

Biomarker adaptive designs in clinical trials

Feature Vector Denoising with Prior Network Structures. (with Y. Fan, L. Raphael) NESS 2015, University of Connecticut

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

SubLasso:a feature selection and classification R package with a. fixed feature subset

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

EECS 433 Statistical Pattern Recognition

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

Research Article Breast Cancer Prognosis Risk Estimation Using Integrated Gene Expression and Clinical Data

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

RECENT ADVANCES IN THE MOLECULAR DIAGNOSIS OF BREAST CANCER

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination

DNA Microarray Analysis

MammaPrint, the story of the 70-gene profile

3. Model evaluation & selection

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

The Oncotype DX Assay in the Contemporary Management of Invasive Early-stage Breast Cancer

T. R. Golub, D. K. Slonim & Others 1999

Predicting Kidney Cancer Survival from Genomic Data

Critical reading of diagnostic imaging studies. Lecture Goals. Constantine Gatsonis, PhD. Brown University

BREAST CANCER. Dawn Hershman, MD MS. Medicine and Epidemiology Co-Director, Breast Program HICCC Columbia University Medical Center.

4. Model evaluation & selection

High False-Negative Rate of HER2 Quantitative Reverse Transcription Polymerase Chain Reaction of the Oncotype DX

Selection and Combination of Markers for Prediction

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Predicting Sleep Using Consumer Wearable Sensing Devices

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Multigene Testing in NCCN Breast Cancer Treatment Guidelines, v1.2011

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Survey on Breast Cancer Analysis using Machine Learning Techniques

Molecular Characterization of Breast Cancer: The Clinical Significance

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Is Gene Expression Profiling the Best Method for Selecting Systemic Therapy in EBC? Norman Wolmark Miami March 8, 2013

Contemporary Classification of Breast Cancer

Mammogram Analysis: Tumor Classification

She counts on your breast cancer expertise at the most vulnerable time of her life.

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2008 Formula Grant

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods

Comparison of discrimination methods for the classification of tumors using gene expression data

Integrative analysis of survival-associated gene sets in breast cancer

Predictive Models for Healthcare Analytics

Various performance measures in Binary classification An Overview of ROC study

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Only Estrogen receptor positive is not enough to predict the prognosis of breast cancer

Summary of main challenges and future directions

Hybridized KNN and SVM for gene expression data classification

BMI 541/699 Lecture 16

The Oncotype DX Assay A Genomic Approach to Breast Cancer

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

Week 2 Video 3. Diagnostic Metrics

Mammogram Analysis: Tumor Classification

Predicting clinical outcomes in neuroblastoma with genomic data integration

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Colon cancer subtypes from gene expression data

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

Prediction Models of Diabetes Diseases Based on Heterogeneous Multiple Classifiers

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Carcinome du sein Biologie moléculaire. Thomas McKee Service de Pathologie Clinique Genève

CS2220 Introduction to Computational Biology

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Improved Intelligent Classification Technique Based On Support Vector Machines

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

Classification consistency analysis for bootstrapping gene selection

Refining Prognosis of Early Stage Lung Cancer by Molecular Features (Part 2): Early Steps in Molecularly Defined Prognosis

Morphological and Molecular Typing of breast Cancer

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

The Good News. More storage capacity allows information to be saved Economic and social forces creating more aggregation of data

Case Studies of Signed Networks

Transcription:

Classification with microarray data Aron Charles Eklund eklund@cbs.dtu.dk DNA Microarray Analysis - #27612 January 8, 2010

The rest of today Now: What is classification, and why do we do it? How to develop a classifier with microarray data. How to evaluate a classifier Lunch Case study (Wiktor Mazin). Classification exercises.

Track A: The DNA Array Analysis Pipeline Question/hypothesis Design Array or buy a standard array Experimental Design Sample Preparation Hybridization NGS data Image analysis Mapping Expression Index Calculation Normalization Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering PCA Gene Annotation Analysis Promoter Analysis Classification Meta analysis Survival analysis Regulatory Network

Classification with gene expression profiles Given an object with a set of features (input data), a classifier predicts the class of the object. expression profile of a tumor specimen predicted class patient #7 low-risk tumor features (genes) ESR1 876.4 ERBB2 100.1 BRCA1 798.8 classifier or aggressive tumor classifier algorithm: A classifier is simply a set of rules, or an algorithm if (ESR1 > 100) AND (ERBB2 > 550) tumor = low-risk else tumor = aggressive

Why use classification with microarrays? 1. Classification as an end in itself Diagnosis of disease Selection of optimal chemotherapy Identification of pathogenic bacteria 2. Learn from the classifier Identify drug targets Or with SNP arrays... Genetic screening Designer babies

Two types of classifier 1. Simple methods (~ univariate) easy to understand easy to derive easy to evaluate 2. Machine learning methods (~ multivariate) complicated Even though we teach the #2, I recommend using #1!

A very simple classifier One gene! If XIST > 7 sex = female squamous cell carcinoma data from Raponi et al (2006) but I made up the classifier myself! XIST 5 6 7 8 9 10 11 female male

Another simple classifier The average of 25 correlated genes CIN = Chromosomal INstability high = bad prognosis low = better prognosis Carter et al (2006) Nature Genetics 38:1043

CIN signature predicts clinical outcome Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Bhattacharjee Lung adenocarcinoma N = 62, P = 0.0016 Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Lopez Rios Mesothelioma N = 86, P = 0.00027 Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Pomeroy Medulloblastoma N = 60, P = 0.0035 Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Van de Vijver Breast N = 254, P < 0.00001 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 120 0 50 100 150 200 Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Bild brc Breast N = 170, P = 0.0013 Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Nutt Glioma N = 50, P = 0.0041 Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Shipp Lymphoma N = 58, P = 0.015 Fraction metastasis-free 1.0 0.8 0.6 0.4 0.2 0.0 van t Veer Breast N = 97, P = 0.0037 0 20 40 60 80 100 120 0 10 20 30 40 50 60 0 50 100 150 0 50 100 150 Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Freije Glioma N = 85, P = 0.0012 Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Phillips Glioma N = 77, P = 0.029 Fraction recurrence-free 1.0 0.8 0.6 0.4 0.2 0.0 Sotiriou Breast N = 179, P = 0.00064 Fraction surviving 1.0 0.8 0.6 0.4 0.2 0.0 Wang Breast N = 286, P = 0.0079 0 20 40 60 80 0 20 40 60 80 100 0 50 100 150 0 50 100 150 Time (months) Figure 2 The CIN25 signature predicts survival in 12 independent cohorts representing six cancer types. Data sets are identified by the first author of the published manuscript and the type of cancer profiled. P values correspond to the log-rank test comparing the survival curves generated from the CIN25 signature as a categorical predictor.

Recipe for a simple univariate classifier 1a. Pick a gene or set of genes that you already know are relevant. or 1b. Choose a set of genes that are differentially expressed between the two groups. (t test, etc.) 2. Choose a simple rule for combining the gene measurements (mean, median, etc.) 3. Choose a decision threshold (or cutoff).

Now, a complicated classifier A 21-gene recurrence score for breast cancer The recurrence score on a scale from 0 to 100 is derived from the reference-normalized expression measurements in four steps. First, expression for each gene is normalized relative to the expression of the five reference genes (ACTB [the gene encoding b-actin], GAPDH, GUS, RPLPO, and TFRC). Reference-normalized expression measurements range from 0 to 15, with a 1-unit increase reflecting approximately a doubling of RNA. Genes are grouped on the basis of function, correlated expression, or both. Second, the GRB7, ER, proliferation, and invasion group scores are calculated from individual gene-expression measurements, as follows: GRB7 group score = 0.9 GRB7+0.1 HER2 (if the result is less than 8, then the GRB7 group score is considered 8); ER group score = (0.8 ER+1.2 PGR+BCL2+SCUBE2) 4; proliferation group score =(Survivin +KI67+MYBL2+CCNB1 [the gene encoding cyclin B1]+STK15) 5 (if the result is less than 6.5, then the proliferation group score is considered 6.5); and invasion group score=(ctsl2 [the gene encoding cathepsin L2] +MMP11 [the gene encoding stromolysin 3]) 2. The unscaled recurrence score (RSU) is calculated with the use of coefficients that are predefined on the basis of regression analysis of gene expression and recurrence in the three training studies: RSU=+0.47 GRB7 group score - 0.34 ER group score +1.04 proliferation group score+0.10 invasion group score +0.05 CD68-0.08 GSTM1-0.07 BAG1. A plus sign indicates that increased expression is associated with an increased risk of recurrence, and a minus sign indicates that increased expression is associated with a decreased risk of recurrence. Fourth, the recurrence score (RS) is rescaled from the unscaled recurrence score, as follows: RS=0 if RSU<0; RS=20 (RSU 6.7) if 0 RSU 100; and RS=100 if RSU>100....a complex function of several variables!! Paik et al (2004) NEJM 351:2817

Machine learning The goal: Given a set of training data (data of known class), develop a classifier that accurately predicts the class of novel data. tumor specimen number gene #1 #2 #3 #4 #5 #6 #7 ESR1 143.5 165.0 134.2 522.4 599.1 288.3 209.0 ERBB2 801.1 291.7 293.1 827.3 980.0 155.5 128.1 BRCA1 129.1 238.0 883.3 281.0 95.1 385.2 383.4 aggressive no yes yes no no yes??? Training set This is also called statistical learning.

Classification as mapping A classifier based on the expression levels of two genes, G1 and G2: Training data set: aggressive tumor friendly tumor unknown ESR1 >>> How do we do this algorithmically? Linear vs. nonlinear ERBB2 Multiple dimensions...?

Recipe for classification with machine learning 1. Choose a classification method simpler is probably better 2. Feature selection / dimension reduction fewer is probably better 3. Train the classifier 4. Assess the performance of the classifier

1. Choose a classification method Linear: Linear discriminant analysis (LDA) Nearest centroid Nonlinear: k nearest neighbors (knn) Artificial neural network (ANN) Support vector machine (SVM)

Linear discriminant analysis (LDA) Find a line / plane / hyperplane assumptions: sampled from a normal distribution variance/covariance same in each class G2 R: lda (MASS package) G1

Nearest centroid Calculate centroids for each class. Similar to LDA. Can be extended to multiple (n > 2) classes. G2 G1

k nearest neighbors (knn) For a test case, find the k nearest samples in the training set, and let them vote. k=3 need to choose k (hyperparameter) k must be odd need to choose distance metric G2 No real learning involved - the training set defines the classifier. R: knn (class package) G1

Artificial neural network (ANN) Gene1 Gene2 Gene3... Class A Class B Each neuron is simply a function (usually nonlinear). The network to the left just represents a series of nested functions. Intepretation of results is difficult. R: nnet (nnet package)

Support vector machine (SVM) Find a line / plane / hyperplane that maximizes the margin. Uses the kernel trick to map data to higher dimensions. very flexible many parameters G2 support vectors margin Results can be difficult to interpret. R: svm (e1071 package) G1

Comparison of machine learning methods Linear discriminant analysis Nearest centroid knn Relatively simple Based on distance calculation Good for simple problems Good for few training samples Distribution of data assumed Neural networks Support vector machines Relatively complicated Involve machine learning Several adjustable parameters Many training samples required (e.g.. 50-100) Flexible methods G2 G1

Recipe for classification Recipe for classification with machine learning 1. Choose a classification method 2. Feature selection / dimension reduction 3. Train the classifier 4. Assess classifier performance

Dimension reduction - why? Previous slides: 2 genes X 16 samples. Your data: 22000 genes X 31 samples. More genes = more parameters in classifier. Potential for over-training!! aka The Curse of Dimensionality. The problem with overtraining... training set accuracy test set (novel data) number of parameters

Dimension reduction Two approaches: 1. Data transformation Principal component analysis (PCA); use top components Average co-expressed genes (e.g. from cluster analysis) 2. Feature selection (gene selection) Significant genes: t-test / ANOVA High variance genes Hypothesis-driven

Recipe for classification Recipe for classification with machine learning 1. Choose a classification method 2. Feature selection / dimension reduction 3. Train the classifier 4. Assess classifier performance

Train the classifier If you know the exact classifier you want to use: just do it. but... Usually, we want to try several classifiers or several variations of a classifier. e.g. number of features, k in knn, etc. The problem: Testing several classifiers, and then choosing the best one leads to selection bias (= overtraining). This is bad. Instead we can use cross-validation to choose a classifier.

Cross validation Temporary training set Train Classifier Data set split Performance Temporary testing set Apply classifier to temporary testing set Do this several times! Each time, select different sample(s) for the temporary testing set. Leave-one-out cross-validation (LOOCV): Do this once for each sample (i.e. the temporary testing set is only the one sample). Assess average performance. If you are using cross-validation as an optimization step, choose the classifier (or classifier parameters) that results in best performance.

Cross-validation example Given data with 100 samples 5-fold cross-validation: Training: 4/5 of data (80 samples) Testing: 1/5 of data (20 samples) - 5 different models and performance results Leave-one-out cross-validation (LOOCV) Training: 99/100 of data (99 samples) Testing: 1/100 of data (1 sample) - 100 different models

Recipe for classification 1. Choose a classification method 2. Feature selection / dimension reduction 3. Train the classifier 4. Assess classifier performance

Assess performance of the classifier How well does the classifier perform on novel samples? Bad: Assess performance on the training set. Better: Assess performance on the training set using cross-validation. Alternative: Set aside a subset of your data as a test set. Best: Also assess performance on an entirely independent data set (e.g. data produced by another lab).

Problems with cross-validation 1. You cannot use the same cross-validation to optimize and evaluate your classifier. 2. The classifier obtained at each round of cross-validation is different, and is different from that obtained using all data. 3. Cross-validation will overestimate performance in the presence of experimental bias.... therefore an independent test set is always better!

Assessing classifier performance Patient ID predict relapse of cancer prediction actual outcome #001 yes no FP #002 no no TN TP = True Positive TN = True Negative FP = False Positive FN = False Negative #003 yes yes TP #004 no yes FN............ confusion matrix predict yes predict no actual yes 131 TP 21 FN actual no 7 FP 212 TN

Classifier performance: Accuracy This is the fraction of predictions that are correct range: [0.. 1] confusion matrix predict yes predict no actual yes 131 TP 21 FN Accuracy = 0.92 actual no 7 FP 212 TN

Unbalanced test data A new classifier: does patient have tuberculosis? confusion matrix predict yes predict no actual yes 5345 TP 21 FN Accuracy = 0.98 (?) actual no 113 FP 18 TN For patients who do not really have tuberculosis: this classifier is usually wrong! >>> Accuracy can be misleading, especially when the test cases are unbalanced

Classifier performance: sensitivity/specificity the ability to detect positives the ability to reject negatives range: [0.. 1] confusion matrix (cancer) predict yes predict no actual yes 131 TP 21 FN actual no 7 FP 212 TN Sensitivity = 0.86 Specificity = 0.97

Classifier performance: Matthews correlation coefficient A single performance measure that is less influenced by unbalanced test sets range: [-1.. 1] confusion matrix (cancer) predict yes predict no actual yes 131 TP 21 FN MCC = 0.84 actual no 7 FP 212 TN

Evaluating unbalanced test data The tuberculosis classifier confusion matrix predict yes predict no actual yes 5345 TP 21 FN actual no 113 FP 18 TN Accuracy = 0.98 (?) Sensitivity = 0.996 Specificity = 0.14 MCC = 0.24

Other performance metrics - survival a. b. Fraction disease free survival 1.0 0.8 0.6 0.4 0.2 0.0 gene low gene high HR = 2.70 (1.00 7.00) P = 0.040 Fraction disease free survival 1.0 0.8 0.6 0.4 0.2 0.0 gene low gene high HR = 1.05 (0.59 1.86) P = 0.88 0 2 4 6 8 0 2 4 6 8 Time (years) Time (years) Number at risk 25 23 21 19 8 25 22 14 10 1 Number at risk 68 60 46 39 29 69 53 47 44 31

Other performance metrics - ROC curve What if we don t know where to set the threshold? Receiver operating characteristics XIST 11 10 9 8 7 Sensitivity 1.0 0.8 0.6 0.4 0.2 0.0 AUC = 0.965 6 5 0.0 0.2 0.4 0.6 0.8 1.0 1 specificity female male AUC = Area under the (ROC) curve

Unbiased classifier assessment Avoid information leakage! 1. Use separate data sets for training and for testing. 2. Try to avoid wearing out the testing data set. Testing multiple classifiers (and choosing the best one) can only increase the estimated performance -- therefore it is biased! 3. Perform any feature selection using only the training set (or temporary training set).

Summary: classification with machine learning 1. Choose a classification method knn, LDA, SVM, etc. 2. Feature selection / dimension reduction PCA, possibly t-tests 3. Train the classifier use cross-validation to optimize parameters 4. Assess classifier performance use an independent test set if possible determine sensitivity and specificity when appropriate

The data set in the exercise Genome-wide expression profiling of human blood reveals biomarkers for Huntington's disease. Borovecki F, Lovrecic L, Zhou J, Jeong H, Then F, Rosas HD, Hersch SM, Hogarth P, Bouzou B, Jensen RV, Krainc D. Proc Natl Acad Sci U S A. 2005 Aug 2;102(31):11023-8. 31 samples: - 14 normal - 5 presymptomatic - 12 symptomatic Affymetrix HG-U133A arrays (also Codelink Uniset) Huntingon s disease: - neurological disorder - polyglutamine expansion in the huntingtin gene Why search for marker of disease progression (not diagnosis)? - assess treatment efficacy - surrogate endpoint in drug trials

Questions and/or Lunch