Diagnosis of multiple cancer types by shrunken centroids of gene expression

Similar documents
Nearest Shrunken Centroid as Feature Selection of Microarray Data

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY

Comparison of discrimination methods for the classification of tumors using gene expression data

Introduction to Discrimination in Microarray Data Analysis

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

T. R. Golub, D. K. Slonim & Others 1999

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Statistical Applications in Genetics and Molecular Biology

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Reach and grasp by people with tetraplegia using a neurally controlled robotic arm

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Colon cancer subtypes from gene expression data

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Outlier Analysis. Lijun Zhang

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008

UvA-DARE (Digital Academic Repository)

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann

MOST: detecting cancer differential gene expression

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

BIOINFORMATICS ORIGINAL PAPER

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Classification of cancer profiles. ABDBM Ron Shamir

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

micrornas (mirna) and Biomarkers

Gray level cooccurrence histograms via learning vector quantization

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Tissue Classification Based on Gene Expression Data

A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

Cancer Gene Extraction Based on Stepwise Regression

Classification with microarray data

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers

Boosted PRIM with Application to Searching for Oncogenic Pathway of Lung Cancer

Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012

Small Group Presentations

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

Improvised K-Means for Improving Raw-Data in Diagnosis of Thyroid Disease

MULTI-MODAL FETAL ECG EXTRACTION USING MULTI-KERNEL GAUSSIAN PROCESSES. Bharathi Surisetti and Richard M. Dansereau

L. Ziaei MS*, A. R. Mehri PhD**, M. Salehi PhD***

A methodology for the analysis of medical data

RNA-seq. Differential analysis

Information Processing During Transient Responses in the Crayfish Visual System

Profile Analysis. Intro and Assumptions Psy 524 Andrew Ainsworth

Prediction and Inference under Competing Risks in High Dimension - An EHR Demonstration Project for Prostate Cancer

Pharmacometric Modelling to Support Extrapolation in Regulatory Submissions

Multiclass microarray data classification based on confidence evaluation

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Construction of robust prognostic predictors by using projective adaptive resonance theory as a gene filtering method

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Contributions to Brain MRI Processing and Analysis

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

Memorial Sloan-Kettering Cancer Center

Section 4.1. Chapter 4. Classification into Groups: Discriminant Analysis. Introduction: Canonical Discriminant Analysis.

Classification of Microarray Gene Expression Data

3. Fixed-sample Clinical Trial Design

Prediction of Successful Memory Encoding from fmri Data

PAIRED AND UNPAIRED COMPARISON AND CLUSTERING WITH GENE EXPRESSION DATA

A Semi-supervised Approach to Perceived Age Prediction from Face Images

Bootstrapping Residuals to Estimate the Standard Error of Simple Linear Regression Coefficients

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6)

The Effects of Autocorrelated Noise and Biased HRF in fmri Analysis Error Rates

Learning in neural networks

Hypothesis testing: comparig means

Knowledge Discovery and Data Mining I

Sensory Cue Integration

An entropy-based improved k-top scoring pairs (TSP) method for classifying human cancers

Zainab M. AlQenaei. Dissertation Defense University of Colorado at Boulder Leeds School of Business Operations and Information Management Division

SubLasso:a feature selection and classification R package with a. fixed feature subset

A contrast paradox in stereopsis, motion detection and vernier acuity

PCA Enhanced Kalman Filter for ECG Denoising

3. Model evaluation & selection

Problem 1) Match the terms to their definitions. Every term is used exactly once. (In the real midterm, there are fewer terms).

A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION

UK Musculoskeletal Oncology: Something for All Ages. Lars Wagner, MD Pediatric Hematology/Oncology University of Kentucky

Model-free machine learning methods for personalized breast cancer risk prediction -SWISS PROMPT

(C) Jamalludin Ab Rahman

Data analysis in microarray experiment

Feasibility Study in Digital Screening of Inflammatory Breast Cancer Patients using Selfie Image

RNA preparation from extracted paraffin cores:

Introduction to Computational Neuroscience

Analyzing Gene Expression Data: Fuzzy Decision Tree Algorithm applied to the Classification of Cancer Data

Brain Tumor Detection of MRI Image using Level Set Segmentation and Morphological Operations

Gene expression correlates of clinical prostate cancer behavior

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE

Representational similarity analysis

Transcription:

Diagnosis of multiple cancer types by shrunken centroids of gene expression Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu PNAS 99:10:6567-6572, 14 May 2002

Nearest Centroid Classifier Calculate a centroid for each class x ik = j Ck x ij /n k Calculation of the distance between a test sample and each class centroid Class prediction by squared distance d 2 (x *,x k) = (x * i x ik) 2 i

'Nearest Shrunken Centroid' General Idea: Shrink by 'threshold' amount Advantages: Reduce noise Gene selection

Class Centroids Mean expression of gene i in class k: x ik = j Ck x ij /n k i th component of overall centroid x i = n x /n j=1 ij

Normalize Standard Deviation Let d ik = (x ik x i) m k (s i + s 0 ) where s i is pooled within-class standard deviation for gene i: and s i 2 = 1 n K K (x ij x ik) 2 k=1 j C k m k = 1/n k 1/n

Shrink the d ik by Soft-Thresholding New shrunken centroids: x'ik = x i + m k (s i + s 0 )d' ik d' ik = sign(d ik )( d ik ) +

Soft vs. Hard Thresholding Hard thresholding: d' ik = d ik I( d ik > ) More jumpy Higher minimum test error Higher gene expression estimation error

Contrasting the Shrunken Centroids Tibshirani et al, Stat Sci (2003)

Choose? by Cross-Validation Use k-fold cross validation Divide data into k roughly equal parts Fit model for many values of? and use CV to determine error Choose value of? that gives smallest error Note: Assume a separate test set

Linear Discriminant Analysis δ k LDA (x * ) = (x * x k) T W 1 (x * x k) 2logπ k Compute distance to centroids W is pooled within-class covariance matrix Shrunken centroid method can be seen as 'a heavily restricted form of LDA, necessary to cope with the large number of variables (genes)'

LDA vs. Nearest Centroid Equivalent if within-class covariance matrix is restricted to diagonal and prior is ignored Relative performance depends on correlation structure of the samples Tibshirani et al, Stat Sci (2003)

Class Probabilities and Discriminant Functions Purpose: Correct for relative numbers of samples in each class Expression levels: x * = (x 1 *, x 2 *,..., x p * ) Discriminant Score: δ k (x * ) = p i=1 (x i * x' ik ) 2 (s i + s 0 ) 2 2logπ k

Class Probabilities and Discriminant Functions New classification rule C(x * ) = l if δ l (x * ) = minδ k (x * ) k Gaussian Linear Discriminant Analysis pk (x* ) = e (1/ 2)δ k (x * ) K e (1/ 2)δ l (x * ) l=1

Adaptive Threshold Scaling Define scaling vector (θ 1, θ 2,, θ K ) Include in d ik : d ik = x ik x i m k θ k s i Adaptive procedure: start with all θ k =1, reduce by 10% for class k with largest training error, repeat Can dramatically reduce total number of genes used without increasing error rate

Overall Model Predictive Analysis of Microarrays Typically accurate classifier Minimizes number of genes and error Results simple to understand Software available at: http://www-stat.stanford.edu/~tibs/pam

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Goal 'Classify and predict the diagnostic category of a sample on the basis of its gene expression profile' Use a simple approach that performs well and is easy to interpret

Small Round Blue Cell Tumors Occur in children and young adults with a male preference Presents as a large mass within the abdomen, usually the pelvic region Aggressive tumor with a poor prognosis

Experimental Data Expression measurements on 2,308 genes from cdna microarrys 4 tumor classes: Burkitt lymphoma (BL) Ewing sarcoma (EWS) Neuroblastoma (NB) Rhabdomyosarcoma (RMS) 88 total samples 63 training samples 25 test samples (including 5 control samples)

'Reference 5' Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks Javed Khan, Jun S. Wei, Markus Ringner, Lao H. Saal, Marc Ladanyi, Frank Westermann, Frank Berthold, Manfred Schwab, Cristina R. Antonescu, Carsten Peterson, and Paul S. Meltzer Nature Medicine, June 2001: 7: 6: 673-9

The Artificial Neural Network

Khan et al Conclusions Development of a linear network Report 0% training and test errors Decided that 96 genes were 'important' 61 genes specifically expressed in a cancer type 41 had not been previously reported as associated with these diseases

Apply PAM Method Utilize nearest shrunken centroid method Eliminate noisy genes

Shrunken Centroids

Determination of? Soft thresholding using shrinkage parameter: d' ik = sign(d ik )( d ik ) + 10-fold cross validation: error minimized for? = 4.34 tr = training set cv = crossvalidated set te = test set

The Genes That Matter

The Genes That Matter

Heat Map Comparisons Tibshirani - 43 Genes Shrunken Centroid Method Khan - 96 Genes Artificial Neural Network Method

Class Probabilities Classified by True Class Classified by Predicted Class 'All 63 of the training samples and all 20 of the test samples known to be SRBCT are correctly classified'

Findings 43 important genes identified 27 also found by neural network method 1 of 8 presently considered to be diagnostic for SBRCTs Discusses other genes that play oncogenic roles

Conclusions 'The method of nearest shrunken centroids was successful in finding genes that accurately predict classes' 'The efficiency of our method in finding a relatively small number of predictive genes will facilitate the search for new diagnostic tools' 'The success of our methodology has implications for improving the diagnosis of cancer'

Leukemia Example Another example of shrunken centroid classification Data presents a 2-class problem 7,129 total genes and 34 total samples 20 acute lymphocytic leukemia (ALL) 14 acute mylogenous leukemia (AML) Data has previously been classified by Golub et al using a linear scoring procedure

Golub Gene Selection Correlation Measure: Weighted Vote: i = x i1 x i2 s i1 + s i2 G(x * ) = i x * i x i1 + x i2 2 i S(m ) = i S(m) (x i1 x i2 )x i * s i1 + s i2 i S(m) (x i1 x i2) s i1 + s i2 (x i1 + x i2) 2

2-Class Discriminant Scores Original discriminant score equation: δ k (x * ) = p i=1 2 class equation: (x i * x' ik ) 2 (s i + s 0 ) 2 2logπ k l(x * ) =? δ 1 (x * )? δ 2 (x * ) = i S( ) * ( x i1 x i2)x i (s i + s 0 ) 2 i S( ) ( x i1 x i2) ( x i1 (s i + s 0 ) 2 2 x i2) + log π 1 π 2

Methodological Comparison Variance vs. Standard Deviation Hard vs. soft thresholding Cross-validation (m vs.? ) Added features

Leukemia Classification tr = training set cv = crossvalidated set te = test set

Findings Significant genes shrunk from 50 to 21 Halved test error Some incorporation of marker genes

Overall Method Conclusions Classifier is potentially useful in high-dimensional classification problems Straightforward computations Simultaneous minimizing of error and number of genes used Questionable potential for gene identification Can also be applied in conjunction with unsupervised methods