Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Similar documents
T. R. Golub, D. K. Slonim & Others 1999

Classification of cancer profiles. ABDBM Ron Shamir

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

VL Network Analysis ( ) SS2016 Week 3

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Comparison of discrimination methods for the classification of tumors using gene expression data

Network-assisted data analysis

Nature Methods: doi: /nmeth.3115

SubLasso:a feature selection and classification R package with a. fixed feature subset

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Structural Variation and Medical Genomics

An Improved Algorithm To Predict Recurrence Of Breast Cancer

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

Contemporary Classification of Breast Cancer

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

EXPression ANalyzer and DisplayER

CNV PCA Search Tutorial

Introduction to Discrimination in Microarray Data Analysis

for the TCGA Breast Phenotype Research Group

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

Integrated Analysis of Copy Number and Gene Expression

Journal: Nature Methods

RNA preparation from extracted paraffin cores:

Title: Human breast cancer associated fibroblasts exhibit subtype specific gene expression profiles

Cancer Informatics Lecture

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

MammaPrint, the story of the 70-gene profile

Defining Actionable Novel Discoveries, Annotating Genomes, and Reanalysis in Cancer A Laboratory Perspective

Statistical Analysis of Biomarker Data

Nature Neuroscience: doi: /nn Supplementary Figure 1

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

SUPPLEMENTARY INFORMATION

Genetic alterations of histone lysine methyltransferases and their significance in breast cancer

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022

FISH mcgh Karyotyping ISH RT-PCR. Expression arrays RNA. Tissue microarrays Protein arrays MS. Protein IHC

Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification

Cancer outlier differential gene expression detection

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Data analysis in microarray experiment

Evaluation of public cancer datasets and signatures identifies TP53 mutant signatures with robust prognostic and predictive value

SUPPLEMENTARY INFORMATION

Quantitative Biology Lecture 1 (Introduction + Probability)

SUPPLEMENTARY APPENDIX

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA

Predicting clinical outcomes in neuroblastoma with genomic data integration

Figure S1. Analysis of endo-sirna targets in different microarray datasets. The

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

Multigene Testing in NCCN Breast Cancer Treatment Guidelines, v1.2011

Supplementary Figures

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Supplementary information for: Human micrornas co-silence in well-separated groups and have different essentialities

Expert-guided Visual Exploration (EVE) for patient stratification. Hamid Bolouri, Lue-Ping Zhao, Eric C. Holland

Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Canadian College of Medical Geneticists (CCMG) Cytogenetics Examination. May 4, 2010

MicroRNA expression profiling and functional analysis in prostate cancer. Marco Folini s.c. Ricerca Traslazionale DOSL

Nature Genetics: doi: /ng Supplementary Figure 1

SSM signature genes are highly expressed in residual scar tissues after preoperative radiotherapy of rectal cancer.

The Tail Rank Test. Kevin R. Coombes. July 20, Performing the Tail Rank Test Which genes are significant?... 3

Mutational Impact on Diagnostic and Prognostic Evaluation of MDS

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University

Award Number: W81XWH TITLE: Characterizing an EMT Signature in Breast Cancer. PRINCIPAL INVESTIGATOR: Melanie C.

Molecular Methods in the Diagnosis and Prognostication of Melanoma: Pros & Cons

Wen et al. (1998) PNAS, 95:

Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer

Breast cancer classification: beyond the intrinsic molecular subtypes

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Genomics in the Clinical Practice - Today and Tomorrow

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

Chapter 1. Introduction

1.5. Research Areas Treatment Selection

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

User Guide. Association analysis. Input

Nature Genetics: doi: /ng Supplementary Figure 1. Somatic coding mutations identified by WES/WGS for 83 ATL cases.

Detection of copy number variations in PCR-enriched targeted sequencing data

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

micrornas (mirna) and Biomarkers

Module 3: Pathway and Drug Development

MOST: detecting cancer differential gene expression

CS2220 Introduction to Computational Biology

Hybridized KNN and SVM for gene expression data classification

Corporate Medical Policy

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Models of cell signalling uncover molecular mechanisms of high-risk neuroblastoma and predict outcome

Digitizing the Proteomes From Big Tissue Biobanks

Implementation of nation-wide molecular testing in oncology in the French Health care system : quality assurance issues & challenges

Cross-Study Projections of Genomic Biomarkers: An Evaluation in Cancer Genomics

ncounter Assay Automated Process Immobilize and align reporter for image collecting and barcode counting ncounter Prep Station

IN SPITE of a very quick development of medicine within

SUPPLEMENTARY INFORMATION

MammaPrint Improving treatment decisions in breast cancer Support and Involvement of EU

CHAPTER 6. Conclusions and Perspectives

Supplementary Figure 1: Comparison of acgh-based and expression-based CNA analysis of tumors from breast cancer GEMMs.

Van test naar diagnose naar

ACE ImmunoID Biomarker Discovery Solutions ACE ImmunoID Platform for Tumor Immunogenomics

Transcription:

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University

Review

3 Figure by Lawrence Berkeley Lab Human Genome Center, Berkeley, California, USA

Synopsis - http://www.jsonline.com/news/health/112248249.html Causes - http://media.journalinteractive.com/documents/ dna2gr.pdf 4

http://cbm.msoe.edu/scienceolympiad/ module2012/xiaphome.html 5

6 Glossary - http://www.jsonline.com/news/health/ 105196904.html

Reading Assignment Chapter 2, Text 7

Outline What can we do with high throughput gene expression data? Profiling and comparative studies gene signatures Machine learning approaches biomarker discovery Correlation analysis gene function prediction using co-expression Network analysis network biomarkers Genomic variants the drivers (?) 8

Affymetrix GeneChip silicon chip oligonucleiotide probes lithographically synthesized on the array

How does microarray work?

How does microarray work?

Two-channel microarray Printed microarrays Long probe oligonucleotides (80-100) long are printed on the glass chip Comparative hybridization experiment

Reading Material Chapter 3, Text 13 13

Outline What can we do with high throughput gene expression data? Profiling and comparative studies Machine learning approaches biomarker discovery Correlation analysis gene function prediction using co-expression Network analysis network biomarkers 14

Data Analysis Raw data (image) QC (artifact) Probe level data PM vs MM probes Normalization Visualization QC (MA plot, box plot, batch effect, PCA) Comparative analysis Clustering analysis Classification (feature selection) Enrichment analysis Pathway and network analysis Reading for probes

Hypothesis Testing Pick a dataset from GEO

Hypothesis Testing Two set of samples sampled from two distributions

New Resource https://insilicodb.com/browse

Outline What can we do with high throughput gene expression data? Profiling and comparative studies Machine learning approaches biomarker discovery Correlation analysis gene function prediction using co-expression Network analysis network biomarkers 19

A Case Study 20 20

Breast Cancer Prognosis Marker 21 http://www.nature.com/nature/journal/v415/n6871/full/415530a.html

Derive Prognostic Markers Gene expression microarray data for 295 breast cancer patients including 69 ER-negative ones Survival time is a major evaluation criteria for cancer treatment, thus developing prognosis markers is particularly important for cancers Full clinical information ER, PR, HERR2 status, survival status and time, metastasis status, lymph node status, etc 22

Unsupervised Analysis the Data Clustering 23

24 Gene Feature Selection

Supervised Analysis Test Validation 25

Gene Markers The 70 gene signature was later incorporated into a test called Mammaprint and is used clinically (FDA approved) for breast cancer prognosis Other clinically used gene signatures also exists such as OncoDX, PAM50 Prognosis signature is not enough for personalized treatment. Ongoing work is more and more shifted toward Predictive maker discovery (e..g, to predict drug response). 26

Another Case Study 27 27

http://www.ncbi.nlm.nih.gov/pubmed/10521349 28

Goal - Stratification www.google.com J 29

PubMed Id: 10521349 30

Also Available @ http://www.broadinstitute.org/cgi-bin/cancer/ publications/pub_paper.cgi? mode=view&paper_id=43 http://www.broadinstitute.org/cancer/software/ genepattern/datasets/ 31

The Premise Class discovery to automatically distinguish between - acute myeloid leukemia: AML - acute lymphoblastic leukemia: ALL No knowledge of these classes Genetic causes known - specific chromosomal translocations Use microarrays 32

The Data 38 bone marrow samples (27 ALL, 11 AML) Obtained from patients at time of diagnosis RNA from bone marrow mononuclear cells Affymetrix microarrays Probes for 6817 human genes 33

The Methods Schematic illustration of methodology. (a) Strategy for cancer classification. Tumor classes may be known a priori or discovered on the basis of the expression data by using Self-Organizing Maps (SOMs) as described in the text. Class Prediction involves assignment of an unknown tumor sample to the appropriate class on the basis of gene expression pattern. This consists of several steps: neighborhood analysis to assess whether there is a significant excess of genes correlated with the class distinction, selection of the informative genes and construction of a class predictor, initial evaluation of class prediction by cross-validation, and final evaluation by testing in an independent data set. 34

Supervised Analysis 35 Neighborhood Analysis. The class distinction is represented by an 'idealized expression pattern c, in which the expression level is uniformly high in class 1 and uniformly low in class 2. Each gene is represented by an expression vector, consisting of its expression level in each of the tumor samples. In the figure, the dataset consists of 12 samples comprised of 6 AMLs and 6 ALLs. Gene g1 is well correlated with the class distinction, while g2 is poorly correlated. Neighborhood analysis involves counting the number of genes having various levels of correlation with c. The results are compared to the corresponding distribution obtained for random idealized expression patterns c*, obtained by randomly permuting the coordinates of c. An unusually high density of genes indicates that there are many more genes correlated with the pattern than expected by chance. The precise measure of distance and other methodological details are described in notes (16,17) and on our web site.

Prediction The prediction of a new sample is based on 'weighted votes' of a set of informative genes. Each such gene gi votes for either AML or ALL, depending on whether its expression level x_i in the sample is closer to mu_aml or mu_all (which denote, respectively, the mean expression levels of AML and ALL in a set of reference samples). The magnitude of the vote is w_i v_i, where w_i is a weighting factor that reflects how well the gene is correlated with the class distinction and v_i = x_i - (mu_aml + mu_all)/2 reflects the deviation of the expression level in the sample from the average of mu_aml and mu_all. The votes for each class are summed to obtain total votes V_AML and V_ALL. The sample is assigned to the class with the higher vote total, provided that the prediction strength exceeds a predetermined threshold. The prediction strength reflects the margin of victory and is defined as (V_win-V_lose)/ (V_win+V_lose), where as V_win and V_lose are the respective vote totals for the winning and losing classes. Methodological details are described in the paper (notes 19,20). 36

Testing Prediction strengths. The scatterplots show the prediction strengths (PS) for the samples in cross-validation (left) and on the independent sample (right). Median PS is denoted by a horizontal line. Predictions with PS below 0.3 are considered as uncertain. 37

Results Genes distinguishing ALL from AML. The 50 genes most highly correlated with the ALL/AML class distinction are shown. Each row corresponds to a gene, with the columns corresponding to expression levels in different samples. Expression levels for each gene are normalized across the samples such that the mean is 0 and the standard deviation is 1. Expression levels greater than the mean are shaded in red, and those below the mean are shaded in blue. The scale indicates standard deviations above or below the mean. The top panel shows genes highly expressed in ALL, the bottom panel shows genes more highly expressed in AML. Note that while these genes as a group appear correlated with class, no single gene is uniformly expressed across the class, illustrating the value of a multi-gene prediction method. 38

Outline What can we do with high throughput gene expression data? Profiling and comparative studies Machine learning approaches biomarker discovery Correlation analysis gene function prediction using co-expression Network analysis network biomarkers 39

Gene Co-Expression http://www.ncbi.nlm.nih.gov/pubmed/17922014 40

Gene Co-Expression Correlated gene expression profiles Both positive and negative correlation Why do genes co-express? Functional reasons? Mechanistic reasons? Genetic reasons? 41

Gene Co-Expression Genes co-express with multiple anchor genes Anchor genes were selected from BRCA1 pathways Goal using co-expression relationships to identify new breast cancer genes 42

Gene Co-Expression HMMR was ranked highest Experimental validation sirna silencing of HMMR led to similar phenotypes in HeLa and breast cancer cells as knockout of BRCA1 multiple centrosomes GWAS study (using CGEMS data) showed that certain mutation on HMMR is associated with increased breast cancer risk HMMR sirna 43

Outline What can we do with high throughput gene expression data? Profiling and comparative studies Machine learning approaches biomarker discovery Correlation analysis gene function prediction using co-expression Network analysis network biomarkers 44

Integrative Genomics Multiple Phenotypes 45

Microarray Data + Protein Interaction Networks Systems approach Instead of which genes changed to which part of the network is perturbed? Network-based classification of breast cancer metastasis. (Chung et al, Mol. Sys. Bio., 2007) 46

47 Microarray Data + Protein Interaction Networks

Microarray Data + Protein Interaction Networks Defining gene network activity score Correlate the activity score with different disease conditions (e.g., metastasis vs non-metastasis) Network mining a greedy approach 48

Microarray Data + Protein Interaction Networks Statistical evaluation of the identified gene modules based on random permutations 49

Microarray Data + Protein Interaction Networks Marker reproducibility and metastasis prediction performance. 50 2007 by European Molecular Biology Organization Han Yu Chuang et al. Mol Syst Biol 2007;3:140

Outline What can we do with high throughput gene expression data? Profiling and comparative studies Machine learning approaches biomarker discovery Correlation analysis gene function prediction using co-expression Network analysis network biomarkers Genomic variants the drivers 51

Iden%fying a causa%ve de novo muta%on Veltman and colleagues - Nat Genet. 2010 Dec;42(12):1109-12 Patient with idiopathic disorder (1) Sequence genome ~22,000 variants (exome re-sequencing) (2) Select only coding mutations MSGTCASTTR MSGTNASTTR ~5,640 coding variants (3) Exclude known variants seen in healthy people ~143 novel coding variants For 6/9 patients, they were able to identify a single likely-causative mutation (4) Sequence parents and exclude their private variants (5) Look at affected gene function and mutational impact ~5 de novo novel coding variants 52

53 53

DREAM 9.5 contest 54

55

Synopsis - http://www.jsonline.com/news/health/112248249.html Causes - http://media.journalinteractive.com/documents/ dna2gr.pdf 56

The End Next Dr. Parvin s Lecture on Sequencing T 57 57