Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Similar documents
Diagnosis of multiple cancer types by shrunken centroids of gene expression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Colon cancer subtypes from gene expression data

Introduction to Discrimination in Microarray Data Analysis

Classification with microarray data

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008

Comparison of discrimination methods for the classification of tumors using gene expression data

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Classification of cancer profiles. ABDBM Ron Shamir

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

T. R. Golub, D. K. Slonim & Others 1999

Biomarker adaptive designs in clinical trials

Appendix: Instructions for Treatment Index B (Human Opponents, With Recommendations)

Modeling Sentiment with Ridge Regression

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

Nearest Shrunken Centroid as Feature Selection of Microarray Data

HARRISON ASSESSMENTS DEBRIEF GUIDE 1. OVERVIEW OF HARRISON ASSESSMENT

SAMPLING AND SAMPLE SIZE

Statistical Applications in Genetics and Molecular Biology

Small Group Presentations

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

APPENDIX N. Summary Statistics: The "Big 5" Statistical Tools for School Counselors

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

3. Model evaluation & selection

What is Regularization? Example by Sean Owen

Error Detection based on neural signals

What does the Nutrition Facts table tell you about this packaged food?

Numerical Integration of Bivariate Gaussian Distribution

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Abdul Latif Jameel Poverty Action Lab Executive Training: Evaluating Social Programs Spring 2009

Planning Sample Size for Randomized Evaluations.

Data complexity measures for analyzing the effect of SMOTE over microarrays

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

Outlier Analysis. Lijun Zhang

4. Model evaluation & selection

15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA

Predicting Breast Cancer Survival Using Treatment and Patient Factors

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

Stat Wk 9: Hypothesis Tests and Analysis

MAT Mathematics in Today's World

Classification of Mammograms using Gray-level Co-occurrence Matrix and Support Vector Machine Classifier

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Colour Communication.

Statistics 2. RCBD Review. Agriculture Innovation Program

Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha

Mammogram Analysis: Tumor Classification

Automated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection

UvA-DARE (Digital Academic Repository)

A New Approach for Detection and Classification of Diabetic Retinopathy Using PNN and SVM Classifiers

Experimental and survey design

Welcome to the RECIST 1.1 Quick Reference

Learning from data when all models are wrong

7.1 Grading Diabetic Retinopathy

Extraction and Identification of Tumor Regions from MRI using Zernike Moments and SVM

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination

Predicting Kidney Cancer Survival from Genomic Data

Chapter 19. Confidence Intervals for Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

CSE 258 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers

Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification

Sparsifying machine learning models identify stable subsets of predictive features for behavioral detection of autism

Technical Specifications

Case Studies of Signed Networks

Lesson 9: Two Factor ANOVAS

Clustering Autism Cases on Social Functioning

1. What is the I-T approach, and why would you want to use it? a) Estimated expected relative K-L distance.

Identification of Tissue Independent Cancer Driver Genes

Applying Machine Learning Methods in Medical Research Studies

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Chapter 19. Confidence Intervals for Proportions. Copyright 2010 Pearson Education, Inc.

Large-Scale Statistical Modelling via Machine Learning Classifiers

Exploration and Exploitation in Reinforcement Learning

Variable Features Selection for Classification of Medical Data using SVM

Reveal Relationships in Categorical Data

Learning with Rare Cases and Small Disjuncts

Identifikation von Risikofaktoren in der koronaren Herzchirurgie

Research methods in sensation and perception. (and the princess and the pea)

A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

Towards Open Set Deep Networks: Supplemental

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

Adjustment of systematic microarray data biases

Bayes Linear Statistics. Theory and Methods

Chapter 8 Estimating with Confidence

Evaluating Classifiers for Disease Gene Discovery

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)

Population. Sample. AP Statistics Notes for Chapter 1 Section 1.0 Making Sense of Data. Statistics: Data Analysis:

Transcription:

Classification Methods Course: Gene Expression Data Analysis -Day Five Rainer Spang

Ms. Smith DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith

30.000 properties of Ms. Smith The expression profile...... a list of 30,000 numbers... that are all properties of Ms. Smith... some of them reflect her health problem (a tumor)... the profile is a digital image of Ms. Smith s tumor How can these numbers tell us (predict) whether Ms. Smith has tumor type A or tumor type B?

Looking for similarities? Ms. Smith Compare her profile to profiles of people with tumor type A and to patients with tumor type B

Training and Application There are patients of known class - the trainings samples - There are patients of unknown class - the new samples - Ms. Smith

Statistical Learning Use the trainings samples...... to learn how to predict new samples Ms. Smith

Prediction with 1 gene Color coded expression levels of trainings samples A B Ms. Smith Ms. Smith Ms. Smith type A type B borderline Which color shade is a good decision boundary?

Data optimal decision rule Use the cutoff with the fewest misclassifications on the trainings samples Smallest training error Decision boundary Distribution of expression values in type B Distribution of expression values in type A Training error

Overfitting The decision boundary was chosen to minimize the training error The two distributions of expression values for type A and B will be similar but not identical in a set of new cases We can not adjust the decision boundary because we do not know the class of the new samples Test errors are in average bigger then training errors Test error This phenomenon is called overfitting

Accumulating information across genes The top gene The average of the top 10 genes ALL vs. AML Golub et al.

Using a weighted average With good weights you get an improved separation

The geometry of weighted averages Calculating a weighted average is identical to projecting (orthogonally) the expression profiles onto the line defined by the weights vector

Hyperplanes A B 2 genes 3 genes Together with an offset 0 the weight vector defines an orthogonal hyperplane that cuts the data in two groups

Linear Signatures A B

Nearest Centroids

Diagonal Linear Discriminant Analysis (DLDA) Rescale axis according to the variances of genes

Discriminant Analysis The data often shows evidence of non identical covariances of genes in the two groups Hence using LDA, DLDA or NC introduces a bias but a good bias

Gene Filtering Rank genes according to a score Choose top n genes Build a signature with these genes only Still 30.000 weights, but most of them are zero Note that the data decides which are zero and which are not Limitation: You have no chance to find these two genes

How many genes Is this a biological or a statistical question? Biology: How many genes carry diagnostic information? Statistics: How many genes should we use for classification? The microarray offers 30.000 genes or more

Finding the needle in the haystack A common myth: Classification information is restricted to a small number of genes, the challenge is to find them

The Avalanche Aggressive lymphomas with and without a MYC-breakpoint MYC-neg MYC-pos Verbundprojekt maligne Lymphome

Genes, Bias & Overfitting The gap between training error and test error becomes wider There is a good statistical reason for not including hundreds of genes in a model even if they are biologically effected

Soft Thresholding The shrunken centroid method and the PAM package Tibshirani et al 2002 genes genes genes genes genes genes genes genes genes genes genes genes genes genes genes

Centroid Shrinkage

How much shrinkage is good in PAM? Train Train Train Select Train Train Train Select Train Train cross validation Compute the CV-Performance for several values of D Pick the D that gives you the smallest number of CV- Misclassifications Adaptive Model Selection PAM does this routinely

Model Selection Output of PAM Small D, many genes poor performance due to overfitting High D, few genes, poor performance due to lack of information underfitting - The optimal D is somewhere in the middle

Signatures you miss Assume protein A binds to protein B and inhibits it The clinical phenotype is caused by active protein A Predictive information is in expression of A minus expression of B Naïve Idea: Don t calculate weights based on single gene scores but optimize over all possible hyperplanes

Two different signatures based on the same genes Calling signature genes markers for a certain disease is misleading!

Only one of these problems exists Problem 1: No separating line Problem 2: Many separating lines Why is this a problem?

What about Ms. Smith? This problem is also related to overfitting... more soon

Prediction with 30,000 genes With the microarray we have more genes than patients Think about this in three dimensions There are three genes, two patients with known diagnosis (red and yellow) and Ms. Smith (green) There is always one plane separating red and yellow with Ms. Smith on the yellow side and a second separating plane with Ms. Smith on the red side OK! If all points fall onto one line it does not always work. However, for measured values this is very unlikely and never happens in praxis.

The overfitting disaster From the data alone we can not decide which genes are important for the diagnosis, nor can we give a reliable diagnosis for a new patient This has little to do medicine. It is a geometrical problem.

Consequences If you find a separating signature, it does not mean (yet) that you have a top publication...... in most cases it means nothing.

Meaningless vs. meaningful signatures There always exist separating signatures caused by overfitting - meaningless signatures - Hopefully there is also a separating signature caused by a disease mechanism - meaningful signatures We need to learn how to find and validate meaningful signatures

How to distinguish a meaningful signature from a meaningless signature? The meaningless signature might be separating small training error -... but it will not be predictive large error in applications The goal is not a separating signature but a predictive signature: Good performance in clinical practice!!!

Back to the problem of many separating hyperplanes Which hyperplane is the best?

Support Vector Machines Fat planes: With an infinitely thin plane the data can always be separated correctly, but not necessarily with a fat one. Again if a large margin separation exists, chances are good that we found something relevant. Large Margin Classifiers

Maximal Margin Hyperplane There are theoretical results that the size of the margin correlates with the test (!) error (V. Vapnik) SVMs are not only optimized to fit to the training data but for predictive performance directly

No separable training set Penalty of error: distance to hyperplane multiplied by a parameter c Balance over- and underfitting

Independent Validation The accuracy of a signature on the data it was learned from is biased because of the overfitting phenomenon Validation of a signature requires independent test data Test error

Test Sets Split data into test and training data ok ok mistake

Selection Bias The test data must not be used for gene selection or adaptive model selection, otherwise the observed accuracy is biased Selection bias

Cross Validation Train Train Eval Train Train Train Train Train Eval Train You can not evaluate a fitted classification model (= signature) using cross validation Cross validation only evaluates the algorithm with which the signature was build Gene selection must be repeated for every relearning step in the cross validation In the loop gene selection

Leave one out Cross-Validation 1 Train Train Eval Train Train Train Essentially the same Train Train Eval But you only leave one sample out at a time and predict it using the others Good for small training sets 1 Train

Performance Estimation Estimators of performance have a variance which can be high. The chances of a meaningless signature to produce 100% accuracy on test data is high if the test data includes only few patients Nested 10-fold- CV Variance from 100 random partitions Reuse of the same data no sample variance

External Validation and Documentation Documenting a signature is conceptually different from giving a list of genes, although is is what most publications give you In order to validate a signature on external data or apply it in practice: - All model parameters need to be specified - The scale of the normalized data to which the model refers needs to be specified Add on normalization

Establishing a signature Split Data into Training and Test Data Test data only: Internal validation Full quantitative specification External Validations Training data only: Machine Learning - select genes - find the optimal number of genes - learn model parameters

DOs AND DONTs : 1. Decide on your diagnosis model (PAM,SVM,etc...) and don t change your mind later on 2. Split your profiles randomly into a training set and a test set 3. Put the data in the test set away... far away 4. Train your model only using the data in the training set (select genes, define centroids, calculate normal vectors for large margin separators, perform adaptive model selection...) don t even think of touching the test data at this time 5. Apply the model to the test data... don t even think of changing the model at this time 6. Do steps 1-5 only once and accept the result... don t even think of optimizing this procedure

Questions?

The bias variance trade off

Which model is best? Experience: Linear models work fine Sparse data: Expression data in high dimensions is sparse. The data does not contain information to identify non linear structures adequately, even if they exist.