Diagnosis of multiple cancer types by shrunken centroids of gene expression

Diagnosis of multiple cancer types by shrunken centroids of gene expression Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu PNAS 99:10:6567-6572, 14 May 2002

Nearest Centroid Classifier Calculate a centroid for each class x ik = j Ck x ij /n k Calculation of the distance between a test sample and each class centroid Class prediction by squared distance d 2 (x *,x k) = (x * i x ik) 2 i

'Nearest Shrunken Centroid' General Idea: Shrink by 'threshold' amount Advantages: Reduce noise Gene selection

Class Centroids Mean expression of gene i in class k: x ik = j Ck x ij /n k i th component of overall centroid x i = n x /n j=1 ij

Normalize Standard Deviation Let d ik = (x ik x i) m k (s i + s 0 ) where s i is pooled within-class standard deviation for gene i: and s i 2 = 1 n K K (x ij x ik) 2 k=1 j C k m k = 1/n k 1/n

Shrink the d ik by Soft-Thresholding New shrunken centroids: x'ik = x i + m k (s i + s 0 )d' ik d' ik = sign(d ik )( d ik ) +

Soft vs. Hard Thresholding Hard thresholding: d' ik = d ik I( d ik > ) More jumpy Higher minimum test error Higher gene expression estimation error

Contrasting the Shrunken Centroids Tibshirani et al, Stat Sci (2003)

Choose? by Cross-Validation Use k-fold cross validation Divide data into k roughly equal parts Fit model for many values of? and use CV to determine error Choose value of? that gives smallest error Note: Assume a separate test set

Linear Discriminant Analysis δ k LDA (x * ) = (x * x k) T W 1 (x * x k) 2logπ k Compute distance to centroids W is pooled within-class covariance matrix Shrunken centroid method can be seen as 'a heavily restricted form of LDA, necessary to cope with the large number of variables (genes)'

LDA vs. Nearest Centroid Equivalent if within-class covariance matrix is restricted to diagonal and prior is ignored Relative performance depends on correlation structure of the samples Tibshirani et al, Stat Sci (2003)

Class Probabilities and Discriminant Functions Purpose: Correct for relative numbers of samples in each class Expression levels: x * = (x 1 *, x 2 *,..., x p * ) Discriminant Score: δ k (x * ) = p i=1 (x i * x' ik ) 2 (s i + s 0 ) 2 2logπ k

Class Probabilities and Discriminant Functions New classification rule C(x * ) = l if δ l (x * ) = minδ k (x * ) k Gaussian Linear Discriminant Analysis pk (x* ) = e (1/ 2)δ k (x * ) K e (1/ 2)δ l (x * ) l=1

Adaptive Threshold Scaling Define scaling vector (θ 1, θ 2,, θ K ) Include in d ik : d ik = x ik x i m k θ k s i Adaptive procedure: start with all θ k =1, reduce by 10% for class k with largest training error, repeat Can dramatically reduce total number of genes used without increasing error rate

Overall Model Predictive Analysis of Microarrays Typically accurate classifier Minimizes number of genes and error Results simple to understand Software available at: http://www-stat.stanford.edu/~tibs/pam

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Goal 'Classify and predict the diagnostic category of a sample on the basis of its gene expression profile' Use a simple approach that performs well and is easy to interpret

Small Round Blue Cell Tumors Occur in children and young adults with a male preference Presents as a large mass within the abdomen, usually the pelvic region Aggressive tumor with a poor prognosis

Experimental Data Expression measurements on 2,308 genes from cdna microarrys 4 tumor classes: Burkitt lymphoma (BL) Ewing sarcoma (EWS) Neuroblastoma (NB) Rhabdomyosarcoma (RMS) 88 total samples 63 training samples 25 test samples (including 5 control samples)

'Reference 5' Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks Javed Khan, Jun S. Wei, Markus Ringner, Lao H. Saal, Marc Ladanyi, Frank Westermann, Frank Berthold, Manfred Schwab, Cristina R. Antonescu, Carsten Peterson, and Paul S. Meltzer Nature Medicine, June 2001: 7: 6: 673-9

The Artificial Neural Network

Khan et al Conclusions Development of a linear network Report 0% training and test errors Decided that 96 genes were 'important' 61 genes specifically expressed in a cancer type 41 had not been previously reported as associated with these diseases

Apply PAM Method Utilize nearest shrunken centroid method Eliminate noisy genes

Shrunken Centroids

Determination of? Soft thresholding using shrinkage parameter: d' ik = sign(d ik )( d ik ) + 10-fold cross validation: error minimized for? = 4.34 tr = training set cv = crossvalidated set te = test set

The Genes That Matter

Heat Map Comparisons Tibshirani - 43 Genes Shrunken Centroid Method Khan - 96 Genes Artificial Neural Network Method

Class Probabilities Classified by True Class Classified by Predicted Class 'All 63 of the training samples and all 20 of the test samples known to be SRBCT are correctly classified'

Findings 43 important genes identified 27 also found by neural network method 1 of 8 presently considered to be diagnostic for SBRCTs Discusses other genes that play oncogenic roles

Conclusions 'The method of nearest shrunken centroids was successful in finding genes that accurately predict classes' 'The efficiency of our method in finding a relatively small number of predictive genes will facilitate the search for new diagnostic tools' 'The success of our methodology has implications for improving the diagnosis of cancer'

Leukemia Example Another example of shrunken centroid classification Data presents a 2-class problem 7,129 total genes and 34 total samples 20 acute lymphocytic leukemia (ALL) 14 acute mylogenous leukemia (AML) Data has previously been classified by Golub et al using a linear scoring procedure

Golub Gene Selection Correlation Measure: Weighted Vote: i = x i1 x i2 s i1 + s i2 G(x * ) = i x * i x i1 + x i2 2 i S(m ) = i S(m) (x i1 x i2 )x i * s i1 + s i2 i S(m) (x i1 x i2) s i1 + s i2 (x i1 + x i2) 2

2-Class Discriminant Scores Original discriminant score equation: δ k (x * ) = p i=1 2 class equation: (x i * x' ik ) 2 (s i + s 0 ) 2 2logπ k l(x * ) =? δ 1 (x * )? δ 2 (x * ) = i S( ) * ( x i1 x i2)x i (s i + s 0 ) 2 i S( ) ( x i1 x i2) ( x i1 (s i + s 0 ) 2 2 x i2) + log π 1 π 2

Methodological Comparison Variance vs. Standard Deviation Hard vs. soft thresholding Cross-validation (m vs.? ) Added features

Leukemia Classification tr = training set cv = crossvalidated set te = test set

Findings Significant genes shrunk from 50 to 21 Halved test error Some incorporation of marker genes

Overall Method Conclusions Classifier is potentially useful in high-dimensional classification problems Straightforward computations Simultaneous minimizing of error and number of genes used Questionable potential for gene identification Can also be applied in conjunction with unsupervised methods