Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University

Similar documents
Bayesian Prediction Tree Models

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data

Trans-Study Projection of Genomic Biomarkers in Analysis of Oncogene Deregulation and Breast Cancer

ABS04. ~ Inaugural Applied Bayesian Statistics School EXPRESSION

Integrated Analysis of Copy Number and Gene Expression

Bayesian analysis of binary prediction tree models for retrospectively sampled outcomes

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

RNA preparation from extracted paraffin cores:

T. R. Golub, D. K. Slonim & Others 1999

Cross-Study Projections of Genomic Biomarkers: An Evaluation in Cancer Genomics

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

Bayesian Inference for Single-cell ClUstering and ImpuTing (BISCUIT) Elham Azizi

Advanced Bayesian Models for the Social Sciences

LETTERS. Oncogenic pathway signatures in human cancers as a guide to targeted therapies

Association mapping (qualitative) Association scan, quantitative. Office hours Wednesday 3-4pm 304A Stanley Hall. Association scan, qualitative

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Biostatistical modelling in genomics for clinical cancer studies

Outline. Outline. Phillip G. Febbo, MD. Genomic Approaches to Outcome Prediction in Prostate Cancer

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Identification of Tissue Independent Cancer Driver Genes

Predictive Assays in Radiation Therapy

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Layered-IHC (L-IHC): A novel and robust approach to multiplexed immunohistochemistry So many markers and so little tissue

Breast cancer: Molecular STAGING classification and testing. Korourian A : AP,CP ; MD,PHD(Molecular medicine)

VL Network Analysis ( ) SS2016 Week 3

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

In-Vitro to In-Vivo Factor Profiling in Expression Genomics. 1.4 The E2F3 Signature in Ovarian, Lung, and Breast Cancer... 19

Selection and Combination of Markers for Prediction

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Comparison of segmentation methods in cancer samples

Introduction to Gene Sets Analysis

Screening for novel oncology biomarker panels using both DNA and protein microarrays. John Anson, PhD VP Biomarker Discovery

Award Number: W81XWH TITLE: Characterizing an EMT Signature in Breast Cancer. PRINCIPAL INVESTIGATOR: Melanie C.

Cancer outlier differential gene expression detection

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

Interaction of Genes and the Environment

Proteomic Biomarker Discovery in Breast Cancer

Understanding DNA Copy Number Data

Supplementary Information Titles Journal: Nature Medicine

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Complex Trait Genetics in Animal Models. Will Valdar Oxford University

MS&E 226: Small Data

Bayesian graphical models for combining multiple data sources, with applications in environmental epidemiology

SSM signature genes are highly expressed in residual scar tissues after preoperative radiotherapy of rectal cancer.

Bayesian Variable Selection Tutorial

CHAPTER II EXPLORATORY FACTOR ANALYSIS

Practical challenges that copy number variation and whole genome sequencing create for genetic diagnostic labs

Introduction to Discrimination in Microarray Data Analysis

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Prosigna BREAST CANCER PROGNOSTIC GENE SIGNATURE ASSAY

Prosigna BREAST CANCER PROGNOSTIC GENE SIGNATURE ASSAY

Mammographic density and risk of breast cancer by tumor characteristics: a casecontrol

An Introduction to Bayesian Statistics

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

Profili di espressione genica

FISH mcgh Karyotyping ISH RT-PCR. Expression arrays RNA. Tissue microarrays Protein arrays MS. Protein IHC

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Applying Tissue Phenomics to Colorectal Clinical Questions

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Janet E. Dancey NCIC CTG NEW INVESTIGATOR CLINICAL TRIALS COURSE. August 9-12, 2011 Donald Gordon Centre, Queen s University, Kingston, Ontario

Data Analysis Using Regression and Multilevel/Hierarchical Models

Final Project Report Sean Fischer CS229 Introduction

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

Triple Negative Breast Cancer

Statistical Assessment of the Global Regulatory Role of Histone. Acetylation in Saccharomyces cerevisiae. (Support Information)

Bayesian methods for combining multiple Individual and Aggregate data Sources in observational studies

Bayesian hierarchical modelling

Age-Specific Differences in Oncogenic Pathway Deregulation Seen in Human Breast Tumors

Big Image-Omics Data Analytics for Clinical Outcome Prediction

Chapter 1. Introduction

Comparison of discrimination methods for the classification of tumors using gene expression data

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

Analysis of acgh data: statistical models and computational challenges

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

EECS 433 Statistical Pattern Recognition

Bayesian Variable Selection Tutorial

Next-Gen Analytics in Digital Pathology

Exercises: Differential Methylation

Good Old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers q

Predicting Kidney Cancer Survival from Genomic Data

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

Statistical Analysis of Biomarker Data

FAQs for UK Pathology Departments

Nature Methods: doi: /nmeth.3115

Microarray Analysis and Liver Diseases

Classification of cancer profiles. ABDBM Ron Shamir

SLAUGHTER PIG MARKETING MANAGEMENT: UTILIZATION OF HIGHLY BIASED HERD SPECIFIC DATA. Henrik Kure

MolEcular Taxonomy of BReast cancer International Consortium (METABRIC)

Graphical Modeling Approaches for Estimating Brain Networks

New Enhancements: GWAS Workflows with SVS

Contemporary Classification of Breast Cancer

CS2220 Introduction to Computational Biology

Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification

Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer

Appendix 1. Sensitivity analysis for ACQ: missing value analysis by multiple imputation

Benefits and pitfalls of new genetic tests

A REVIEW OF BIOINFORMATICS APPLICATION IN BREAST CANCER RESEARCH

Transcription:

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics Mike West Duke University

Papers, software, many links: www.isds.duke.edu/~mw ABS04 web site: Lecture slides, stats notes, papers, data, links: www.isds.duke.edu/~mw/abs04 Integrated Cancer Biology Program icbp.genome.duke.edu Genome Institute @ Duke www.genome.duke.edu

Flightplan #1 #2 #3 #4 #5 #6 Genomics, Microarrays, Data: Big picture Bayesics - Regression and Shrinkage: Gene expression as predictors Patterns and Factors: Prediction via pattern profiling Sparse Modelling: Regression subset-structure uncertainty Sparse Models and Profiling: Gene expression as response: Designed experiments Sparse Models and Profiling: Gene expression as response: Latent factor models

PART 1: Sunday 11 th September 2005

Flightplan #1 #2 #3 #4 #5 #6 Genomics, Microarrays, Data: Big picture Bayesics - Regression and Shrinkage: Gene expression as predictors Patterns and Factors: Prediction via pattern profiling Sparse Modelling: Regression subset-structure uncertainty Sparse Models and Profiling: Gene expression as response: Designed experiments Sparse Models and Profiling: Gene expression as response: Latent factor models

Transitions in Biology: Data and Observation Observational science Molecular science Genomic science Data: Scale, Complexity Computational & Statistical Science

Phenotypes & data Low resolution phenotypes Small worlds, small data Breast cancer: Lymph node involvement Hormone receptor status Tumor size Visual assessment

Phenotypes & DATA Higher resolution Genome scale, big data Increased understanding Improved prediction

Modelling Genomic Data for Prediction Genomic Medicine Personalised Prognostics Integrated Clinico-Genomic Models (Breast cancer Pittman et al PNAS 04)

Genomic Data: Opportunities and Challenges Synthesis... or Translation Gene expression profiles: Signatures of states Laboratory/In vitro Laboratory/Animal models Human Observational Studies Human Clinical Studies

Genomic Data? Biological/disease state DNA copy (CGH) Proteomics DNA methylation Genotypes host host Serum gene expression SNPs Haplotypes Serum-metabolics Serum-proteomics Environmental factors Gene expression Metabolics Clinical markers

Affymetrix DNA Microarray Data Gene probesets Imaging/Scanning 100Mb raw data Expression intensity estimates X+/-S p genes, n samples Background, noise, gross defects, Cross-hybridization Sample-sample normalisation Low level data processing, analysis West et al 2001 Wong & Li (dchip) 2001 Bolstad, Irizarry, Speed et al 2003a,b RMA estimates www.bioconductor.org

First Generation Microarrays: Messy Data Multiple expression data sets Multiple array technologies Multiple species: genome A B mappings Same array platform: sample/lab/study/gene effects Assay/batch/reagent/hybridisation sensitivities... Sporadic - Sparse

Flightplan #1 #2 #3 #4 #5 #6 Genomics, Microarrays, Data: Big picture Bayesics - Regression and Shrinkage: Gene expression as predictors Patterns and Factors: Prediction via pattern profiling Sparse Modelling: Regression subset-structure uncertainty Sparse Models and Profiling: Gene expression as response: Designed experiments Sparse Models and Profiling: Gene expression as response: Latent factor models

Bayesics: Regression Gene expression as covariates (predictors) Molecular phenotyping: Predict aggressive vs. benign Disease susceptible vs. resistance Drug/treatment response Finding genes linked to response Patterns of association among genes Signatures of effect multiple genes

Regression Models and Shrinkage Phenotype z H=Subsets of genes LSE: (minimal) Bayes: Shrinkage priors Decision theory Regularisation - Ridge regression Key with many predictors Relevance of zero-mean location Prior: Posterior: Shrinkage:

Degrees and Dimensions of Shrinkage LSE as limiting case no shrinkage: Shrinks when it matters weak/no association Acts against over-fitting, improves stability and robustness in prediction Multiple shrinkage Shrinks out irrelevant covariates

Computation: MCMC in Regression Simulate Posterior: Iteratively resample conditional posteriors Sample means, histograms MC approximation of posterior

Computation: MCMC in Regression Modules in MCMC e.g. response error variance

Binary Regression Binary = thresholded latent continuous probit~normal, logit~logistic, Natural model/intepretation Computationally nice

Computation: MCMC in Binary Regression Linear model if z known Add module to impute latent z MC samples for z Easy summary, prediction

Binary Regression y=0/1 (ER -/+) Protein assay Immunohistochemical staining 0/1 (0-3) Basic Examples: Breast Cancer Data ER (O)Estrogen Receptor Status HER2 hormone status Lymph node (recurrence risk) status Frozen tumour: Gene expression Higher resolution Future clinical tests: Pr(ER+) nuclei of breast epithelial cells cytoplasm of breast epithelial cells ER positive tumour IHC for Estrogen Receptor (~60x magnification) brown-red & pink ~ ER+ nucleii of stromal cells; collagen

Prediction and {Gene, Variable, Feature} Selection Leave-one-out Cross-Validation (CV) analysis: (PNAS 2001 breast cancer) Honest assessment of precision Heterogeneity, small samples Pr(ER+) Feature/Variable selection Critical (dominant) component of predictive assessment

Prediction and {Gene, Variable, Feature} Selection Predicting lymph node status Pre-selection of 100 genes vs. Honest CV predictions Large p: Small models- Sparsity Pr(LN+) Variable selection, Uncertainty Complex interdependencies Multiplicities

Flightplan #1 #2 #3 #4 #5 #6 Genomics, Microarrays, Data: Big picture Bayesics - Regression and Shrinkage: Gene expression as predictors Patterns and Factors: Prediction via pattern profiling Sparse Modelling: Regression subset-structure uncertainty Sparse Models and Profiling: Gene expression as response: Designed experiments Sparse Models and Profiling: Gene expression as response: Latent factor models

Many Related Predictors: Patterns as Predictors Patterns of coordinately expressed genes: Signatures Metagenes PCA, SVD of expression data Biologically selected gene subsets Trained subset selection Clusters Cardiovascular disease: High/Low (Seo, West et al 2004)

Empirical Factor Regression SVD: Genes X = Metagene factors F Patterns-Factors underlying X are predictors PCA: X variable set selection p=n: Shrinkage priors key F variable selection

Expression Profiles: Signatures of States Translation of characterising genomic patterns Predictive profiling: oncogenic pathway deregulation Rb+/+ Rb-/- (Huang et al 03, Black et al 03)

Pathway Expression Characterisation Analysis Out-of of-sample prediction Cell line derived signatures predict differences in oncogenic activity in mouse tumours c-myc up-expression Metagene: gene subset & pattern as a predictor (Huang et al 03, Black et al 03)

Oncogene Sub-Pathway Profiles: Translation Myc Ras E2F3 Src β-cat Single Oncogenes pathway characterisation - potential targets Cell lines signatures Human lung cancers (ovarian, breast) Clinical prognostic - clinical evaluation - therapeutic evaluation (Bild et al 05)

Metagenes in Clinico-Genomic Prognostic Models Genomic Medicine Personalised Prognostics Gene expression clustering Metagene factors Non-linear regressions CART models Integration: non-genomic predictors (Breast cancer Pittman et al PNAS 04)

Flightplan #1 #2 #3 #4 #5 #6 Genomics, Microarrays, Data: Big picture Bayesics - Regression and Shrinkage: Gene expression as predictors Patterns and Factors: Prediction via pattern profiling Sparse Modelling: Regression subset-structure uncertainty Sparse Models and Profiling: Gene expression as response: Designed experiments Sparse Models and Profiling: Gene expression as response: Latent factor models

Standard Sparsity Priors in Regression Variable inclusion uncertainty Large p: parsimony sparsity small Sparsity priors: Augment: MCMC computation:

Large p - Shrinkage and Sparsity Model-based, automatic shrinkage - Simultaneous multiple tests Multiple shrinkage: conservative, parsimonious Decision theory/false discovery? Estimation versus Decision? Model/subset probabilities: Issues: Collinearities Multiple related models Computation with very large p (Clyde & George StatSci 04)

Stochastic Search Methods MCMC local search inspired Good models near good models Add/drop/replace variables Move by sampling new model Shoot out ALL neighbours: local proposals Swiftly find high probability regions of model space KEY: easily compute Catalogue of many good models Parallelisation (Hans, Dobra, West 05; Rich et al 2005 p=8400)

Cancer Genomics: Sparse Regressions & Prediction Brain cancer expression: p=8400 Survival regressions: multiple related 3-5 gene subsets key cellular motility/infiltration genes regression model uncertainty in prediction (Cancer Research, 05)

Sparsity and Penalizing Dimension Sparsity - Regression variable in/out probabilities Dimension - Implicit in Bayesian & other likelihood-based analyses (cf. BIC)

Regressions For Graphical Association Models p=8400 Cascade of regression models: - Models to predict/explain gene expression for survival predictive genes -and so on Generate Directed Acyclic Graphical models (DAGs) of association patterns in gene expression Exploratory data analysis, visualization uses http://graphexplore.cgt.duke.edu (Cancer Research, 05)

Sparse Graphs from Sparse Regressions EGFR Brain cancer gene expression Duke Keck Center for Neurooncogenomics (Dobra et al JMVA, 04; Jones et al Stat Sci 05)

Papers, software, many links: www.isds.duke.edu/~mw ABS04 web site: Lecture slides, stats notes, papers, data, links: www.isds.duke.edu/~mw/abs04 Integrated Cancer Biology Program icbp.genome.duke.edu Genome Institute @ Duke www.genome.duke.edu