Classification of cancer profiles. ABDBM Ron Shamir

Similar documents
T. R. Golub, D. K. Slonim & Others 1999

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

MammaPrint, the story of the 70-gene profile

Introduction to Discrimination in Microarray Data Analysis

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

Gene expression correlates of clinical prostate cancer behavior

Contemporary Classification of Breast Cancer

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis

Comparison of discrimination methods for the classification of tumors using gene expression data

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

Classification with microarray data

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

Multigene Testing in NCCN Breast Cancer Treatment Guidelines, v1.2011

MammaPrint Improving treatment decisions in breast cancer Support and Involvement of EU

Data analysis in microarray experiment

Tissue Classification Based on Gene Expression Data

Triple Negative Breast Cancer

Package golubesets. August 16, 2014

Prognostic and predictive biomarkers. Marc Buyse International Drug Development Institute (IDDI) Louvain-la-Neuve, Belgium

30 years of progress in cancer research

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE

Predicting Kidney Cancer Survival from Genomic Data

Current Status and Future Development of Tools for Prognosis and Prediction - USA

Good Old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers q

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

Wen et al. (1998) PNAS, 95:

Harmesh Naik, MD. Hope Cancer Clinic PERSONALIZED CANCER TREATMENT USING LATEST IN MOLECULAR BIOLOGY

Application of the concept of False Discovery Rate on predicted cancer outcome with microarrays

Breast cancer: Molecular STAGING classification and testing. Korourian A : AP,CP ; MD,PHD(Molecular medicine)

Harmesh Naik, MD. Hope Cancer Clinic

L. Ziaei MS*, A. R. Mehri PhD**, M. Salehi PhD***

A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Cancerclass: An R package for development and validation of diagnostic tests from high-dimensional molecular data

VL Network Analysis ( ) SS2016 Week 3

Colon cancer subtypes from gene expression data

Molecular Staging for Survival Prediction of. Colorectal Cancer Patients

BIOINFORMATICS ORIGINAL PAPER

Segmentation and Analysis of Cancer Cells in Blood Samples

Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Classification of Microarray Gene Expression Data

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Joachim Eberle Head of R&D, Roche Centralized Diagnostics

EXPression ANalyzer and DisplayER

Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring

AllinaHealthSystems 1

Question 1 A. ER-, PR-, HER+ B. ER+, PR+, HER2- C. ER-, PR+, HER2- D. ER-, PR-, HER2- E. ER-, PR+, HER2+

Statistical validation of biomarkers and surogate endpoints

Simple Discriminant Functions Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

Identification of Tissue Independent Cancer Driver Genes

Maram Abdaljaleel, MD Dermatopathologist and Neuropathologist University of Jordan, School of Medicine

COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET) DETECTION OF ACUTE LEUKEMIA USING WHITE BLOOD CELLS SEGMENTATION BASED ON BLOOD SAMPLES

Lecture #4: Overabundance Analysis and Class Discovery

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

RNA preparation from extracted paraffin cores:

SubLasso:a feature selection and classification R package with a. fixed feature subset

Molecular Enhancement of Sentinel Node Evaluation

Breast Cancer: Who Gets It? Who Survives? The Latest Information

SUPPLEMENTARY APPENDIX

Chapter 1. Introduction

Understanding and Optimizing Treatment of Triple Negative Breast Cancer

Breast cancer classification: beyond the intrinsic molecular subtypes

ISPOR 4 th Asia Pacific Conference IP2 Gilberto de Lima Lopes

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

The 70-Gene Signature (MammaPrint) As a Guide for the Management of Early Stage Breast Cancer. California Technology Assessment Forum

Basement membrane in lobule.

NIH Public Access Author Manuscript Best Pract Res Clin Haematol. Author manuscript; available in PMC 2010 June 1.

Predictive Assays in Radiation Therapy

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Histological Type. Morphological and Molecular Typing of breast Cancer. Nottingham Tenovus Primary Breast Cancer Study. Survival (%) Ian Ellis

Hybridized KNN and SVM for gene expression data classification

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION

Linking Oncotype Dx results to SEER data and patient report to assess challenges in individualizing breast cancer care

Assessment of Risk Recurrence: Adjuvant Online, OncotypeDx & Mammaprint

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Package propoverlap. R topics documented: February 20, Type Package

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks

Package GANPAdata. February 19, 2015

Multiclass microarray data classification based on confidence evaluation

Cancer Gene Extraction Based on Stepwise Regression

FISH mcgh Karyotyping ISH RT-PCR. Expression arrays RNA. Tissue microarrays Protein arrays MS. Protein IHC

Keywords: Leukaemia, Image Segmentation, Clustering algorithms, White Blood Cells (WBC), Microscopic images.

FINALIZED SEER SINQ S NOVEMBER 2011

Bayesian Prediction Tree Models

Evaluation of public cancer datasets and signatures identifies TP53 mutant signatures with robust prognostic and predictive value

Microarray bioinformatics in cancer- a review

Corporate Medical Policy

Adjuvan Chemotherapy in Breast Cancer

A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer

Ensemble methods for classification of patients for personalized. medicine with high-dimensional data

Transcription:

Classification of cancer profiles 1

Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis; Limitations of morphology classification: tumors of similar histopathological appearance can have significantly different clinical courses and response to therapy; Traditionally cancer classification relied on specific biological insights challenges: finer classification of morphologically similar tumors at the molecular level; systematic and unbiased approaches; 2

Challenges Class prediction (classification) : Assign particular tumor samples to already-defined classes. Feature selection : Identify the most informative genes for prediction Class discovery : Define previously unrecognized tumor subtypes ( = clustering) Predict prognosis; suggest treatment! 3

Leukemia Golub et al., Science 286 (Oct 1999) 531-537 Computational paper: Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander Proc. RECOMB 2000 Slides based on: Elashof-Horvath UCLA course, http://www.genetics.ucla.edu/horvathlab/biostat278/biostat278.htm 4

Background: Leukemia Acute leukemia: variability in clinical outcome; subtle differences in nuclear morphology Subtypes: acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML); ALL subcategories: T-lineage ALL and B-lineage ALL; 1999 status: A combination of different tests needed for diagnosis (morphology, histochemistry, immunophenotyping..) Although usually accurate, leukemia classification imperfect and errors do occur 5

Objective Develop a systematic approach to cancer classification based on gene expression data Use leukemia as test case 6

The Data Primary samples: 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis; Independent samples: 34 leukemia samples (24 bone marrow and 10 peripheral blood samples); GE expression: Affymetrix arrays (6817 human genes) 1 st : training set. 2 nd : test set. Q: is there a class-specific signal in the data? 7

Metric for gene selection Want to find a set of predictive genes s.t. Typical exp patterns differ a lot between classes Low variance within each class c class vector (1,1,1,1,1,1,0,0,0,0,0,0,0) g expression vector of a gene µ i exp in class i, σ i - std in class i P(g,c) = (µ 1 - µ 0 )/(σ 1 + σ 0 ) Golub / S2N metric Pick k genes g with highest P(g,c) as predictor set. 8

Neighborhood Analysis: overview Define an "idealized expression pattern" c= (1,1,1,1,1,1,0,0,0,0,0,0,0) N(g)= no. of genes g s.t. P(g,c)> α Randomly permute c to π(c) R(g)= no. of genes g s.t. P(g,π(c))> α N(g) >> E(R(g)) would suggest classification is robust. 9

10

Neighborhood analysis (contd) For each class c plot the no of genes g with P(g,c)>x as a function of x For the actual classes For randomly permuted classes 11

12

Neighborhood Analysis: Results On the training set, ~1100 genes were more highly correlated with the AML-ALL class distinction than would be expected by chance. => ample data for informative class prediction 13

Predictor + Feature Selection Goal: create a classifier (predictor) Method: filtering. Choose k genes most correlated with the label on the training set 14

The Predictor size Pick k genes g with highest P(g,c) as predictor set,or, Pick k 1 genes highest P(g,c), k 2 with lowest Choosing k 1,k 2 : Roughly equal Few genes most statistically significant, best for clinical setting. Many genes more robust, many bio processes Too many genes unlikely to be meaningful and independent 15

Weighted voting S: set of features (genes) selected; x new sample Assign a weight w(g)=p(g,c) for each gene in S b g =(µ 1 - µ 2 )/2 half-way boundary for gene g Vote of gene g: V = w(g) (x g -b g ) V + = sum of positive votes; V - =sum of neg. votes The winning class: the one with larger abs value. prediction strength PS=(V winner -V loser )/ (V winner +V loser ) Assign x to the winning class if PS>0.3 Otherwise, x is undetermined. Many other voting schemes possible. 16

17

Testing the predictors Used a 50-gene predictor LOOCV : Assigned 36 / 38 samples as either AML or ALL, 2 as uncertain (PS < 0.3). All predictions were correct Independent test: Assigned 29 / 34 samples, at 100% accuracy Median PS = 0.77 in cross-validation, 0.73 in independent test (Fig. 3A). 18

19

Testing the predictors (contd) The average PS was lower for samples from one laboratory, which used a very different protocol for sample preparation; Should standardize sample preparation in clinical implementation. 20

How many features? Choosing k=50 was a bit arbitrary Results were insensitive to k: Predictors based on 10-200 genes were all 100% accurate strong correlation of many genes with the AML- ALL distinction. 21

23

Class Discovery If the AML-ALL distinction was not already known, could we discover it simply on the basis of gene expression? Strategy: Cluster the samples Assign a new sample to class with closest centroid. Test by cross-validation Compare to results on random classes 25

Class Discovery: Results A 2-cluster SOM was applied to cluster the 38 initial leukemia samples using exp patterns of all 6817 genes. The clusters were first evaluated by comparing them to the known AML-ALL classes (Fig. 4A). Class A1: mostly ALL (24 out of 25) Class A2: mostly AML (10 of 13 samples). 26

Testing discovered classes (a) Construct predictors for "type A1" or "type A2." (b) Cross-validation: Predictors with wide range of different numbers of informative genes performed well; (c) Independent test: median PS: 0.61. 74% of samples were above threshold. High prediction strengths indicate that the structure seen in the initial data set is also seen in the independent data set. 27

Testing discovered classes (2) (d) Random clusters yielded significantly poorer results in CV and the independent data set (Fig. 4B). => A1-A2 distinction is meaningful, not a statistical artifact of the initial data set. => the AML-ALL distinction could have been automatically discovered and confirmed without previous biological knowledge. 28

Multiple cluster analysis 4-cluster SOM divides the samples into clusters, which largely corresponded to AML, T-ALL, B-ALL x 2 (Fig. 4C). Evaluated these classes by constructing class predictors. The four classes could be distinguished from one another, with the exception of B3 versus B4 (Fig. 4D). 29

30

Multiple Clusters (2) The prediction tests confirmed the distinctions corresponding to AML, B-ALL, and T-ALL, Suggested to merge classes B3 and B4, composed primarily of B-lineage ALL. 31

Todd Golub, Donna Slonim 32

Breast Cancer Van t Veer et al, Nature 2002 33

The Challenge Out of young women who have breast cancer, only 15-20% will develop metastases. These women must be treated aggressively (chemotherapy) - but not the rest Can expression data help to identify this group? Understand disease process? 34

Van t Veer s data Goal: predict clinical outcome from expression 98 primary breast cancers: Sporadic: 34 with metastases within 5 years (poor prognosis group, mean time to metastasis 2.5 yrs) 44 without (good prognosis group, mean follow-up 8.7 yrs) All <55 yrs old, lymph node negative Carriers: 18 BRCA1, 2 BRCA2 mutation carriers Measured expression levels of ~25K genes (reference: mixture of sporadic) Selected 5K genes that showed significant change. 35

Hierarchical clustering (unsupervised) gives two main clusters: Most carriers fall into one cluster ER & lymphocytic infiltrate different clsuters 36

Supervised clustering: On sporadics, selected genes significantly correlated w metastasis 231 genes with CC >.3 ranked by CC Added 5 at a time and checked classification using leave-one-out Optimal accuracy with 70 genes: 83% ( ) Raised threshold so as to miss less poor prognosis patients (- - -) Independent validation: On 19 other cases, 2 misclassifications OR for metastasis for women with poor prognosis signature: 15 Prev methods: 2.4-6.4 38

How many of the women would current medical guidelines subject to chemotherapy? 39

van de Vijver et al. NEJM 2002 295 consecutive patients w breast cancer 151 lymph node negative 144 lymph node positive disease Applied the 70 gene poor prognosis signature to each: 180 poor, 115 good Ave 10 year survival rate: 55% vs 95% Odds to be free of metastasis at 10 years: 50% vs 85% (Hazard ratio: 5.1) 40

42

43

Conclusions The gene-expression profile we studied is a more powerful predictor of the outcome of disease in young patients with breast cancer than standard systems based on clinical and histologic criteria. 44

Laura van t Veer 45

46 Act 2

A first breast cancer diagnostic chip Phoenix, AZ April 21, 2005 - The Molecular Profiling Institute, Inc. (MPI) announced today that it is now providing the MammaPrint breast cancer test to breast cancer patients in the United States. This is the first commercially available microarray cancer diagnostic that analyzes patients' breast tumors for their individual DNA expression profile. "MammaPrint more accurately distinguishes between lymph node-negative breast cancer patients who would benefit from additional therapy from those who would not, helping oncologists offer more effective therapy to their patients. The 70 genes in a woman's tumor analyzed by 47 MammaPrint predict the 10-year survival of the http://www.eurekalert.org/pub_releases/2005-04/ttgr-tmp042105.php

48

Caveats Mammaprint Act 3 54

Ein-Dor et al. Bioinformatics 05 Reanalyzed the 96 sporadics samples of vant Veer Is the 70-gene signature selected unique? Training set: the same 77 patients of vv Ranked all genes by correlation to survival Features for classifier: (vv) genes 1-70; (1) 71-140; (2) 141-210, (7) 701-770 Applied each classifier to all 96 samples 55

56

Effect of Training set Vant Veer Rmas wamy 03 Selected 10 times a random set of 77 training samples out of the 96. For each, ranked the top 70 genes by correlation Compared to rank in 1st training set 57

Conclusions Many genes can be used to predict survival No gene correlates very strongly A gene s rank may fluctuate strongly Identities of the top 70 genes are not robust Much larger number of patients needed to identify those genes indicative of gene s importance to cancer pathology But: For prognosis, can produce fairly reliable signatures, using large enough gene set. 58

The dilemma If the results from adjuvant trials confirm the strong benefit for HER2-positive patients using adjuvant chemotherapy plus trastuzumab, would there be clinicians prepared to withhold adjuvant chemotherapy in a young patient with a node-negative, HER2-positive breast cancer and a good prognosis signature? Brenton et al., Journal of Clinical Oncology 23 (29) pp7350 (2005) 59

A prospective study Mammaprint Act 4 60

10 year prospective study of 6,693 patients from 112 institutions, 9 countries C classic clinical risk, MP genomic risk 61

The study design Of all patients with high clinical risk, treating based on MP would have saved chemo for 46% of the patients 62 http://www.agendia.com

Results With Chemo: 1.5% higher 63 http://www.agendia.com

Multi-Class Cancer Classification Ramaswamy et al (Golub s group), PNAS 2001 65

Data 218 profiles of tumors of 14 types Affy chips, 16K genes, 11K after variation filtering Training set: 144 samples; test: 54 samples Additional set: PD. 20 poorly differentiated carcinomas. Difficult to classify with traditional methods as they lack characteristic morphological hallmarks of the organ from which they arise. 66

Class Discovery Applied hierarchical clustering (Eisen), SOM Mixed success 67

Classification One vs. All (OVA) approach Use a 2-way classifier alg A Label the members of a class 1, rest 0; train A; classify all samples and get confidence values to assignments Repeat for each class. Get 14 Assign each sample to the class on which it was accepted with highest confidence. 68

69

Weighted voting, KNN and SVM had significant prediction accuracy SVM was consistent ly best Genes: best S2N metric in OVA for each class 70

Classification results 71

Recursive feature elimination OVA SVM classifier outputs a hyperplane w. 2 class = sign (Σw k x ik + b) Recursively remove the 10% with least w k values and retrain Stop when accuracy decreases (0r use to study gene number effects) 1 min w 2 st.. y( w x + b) 1, x i i i 72

Accuracy vs. gene number OVA: One vs All AP: all pairs WV: weighted voting 73