Integration of high-throughput biological data

Similar documents
Introduction to Gene Sets Analysis

Analysis of paired mirna-mrna microarray expression data using a stepwise multiple linear regression model

Gene-microRNA network module analysis for ovarian cancer

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

Identification of Tissue Independent Cancer Driver Genes

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:

Inferring condition-specific mirna activity from matched mirna and mrna expression data

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

EXPression ANalyzer and DisplayER

Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers

Comparison of discrimination methods for the classification of tumors using gene expression data

Expert-guided Visual Exploration (EVE) for patient stratification. Hamid Bolouri, Lue-Ping Zhao, Eric C. Holland

Chapter 1. Introduction

Introduction to Discrimination in Microarray Data Analysis

RNA- seq Introduc1on. Promises and pi7alls

From mirna regulation to mirna - TF co-regulation: computational

Integrated analysis of mirna/mrna expression and gene methylation using sparse canonical correlation analysis.

R2: web-based genomics analysis and visualization platform

Systematic exploration of autonomous modules in noisy microrna-target networks for testing the generality of the cerna hypothesis

Network-based biomarkers enhance classical approaches to prognostic gene expression signatures

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

Profiles of gene expression & diagnosis/prognosis of cancer. MCs in Advanced Genetics Ainoa Planas Riverola

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Identifying Relevant micrornas in Bladder Cancer using Multi-Task Learning

a) List of KMTs targeted in the shrna screen. The official symbol, KMT designation,

Supplementary information for: Human micrornas co-silence in well-separated groups and have different essentialities

New Enhancements: GWAS Workflows with SVS

ncounter Assay Automated Process Immobilize and align reporter for image collecting and barcode counting ncounter Prep Station

Prediction and Inference under Competing Risks in High Dimension - An EHR Demonstration Project for Prostate Cancer

Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics, 2010

IPA Advanced Training Course

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Feature Vector Denoising with Prior Network Structures. (with Y. Fan, L. Raphael) NESS 2015, University of Connecticut

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University

Comparison of segmentation methods in cancer samples

Biostatistics II

Can melanoma treatment be guided by a panel of predictive and prognostic microrna Biomarkers?

Predictive statistical modelling approach to estimating TB burden. Sandra Alba, Ente Rood, Masja Straetemans and Mirjam Bakker

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Set the stage: Genomics technology. Jos Kleinjans Dept of Toxicogenomics Maastricht University, the Netherlands

RNA-seq Introduction

On the Reproducibility of TCGA Ovarian Cancer MicroRNA Profiles

Paternal exposure and effects on microrna and mrna expression in developing embryo. Department of Chemical and Radiation Nur Duale

Expanded View Figures

SUPPLEMENTARY FIGURES: Supplementary Figure 1

Certificate Courses in Biostatistics

Understanding DNA Copy Number Data

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Structured Association Advanced Topics in Computa8onal Genomics

RISK PREDICTION MODEL: PENALIZED REGRESSIONS

Gene Ontology 2 Function/Pathway Enrichment. Biol4559 Thurs, April 12, 2018 Bill Pearson Pinn 6-057

Integrative Omics for The Systems Biology of Complex Phenotypes

A Network Partition Algorithm for Mining Gene Functional Modules of Colon Cancer from DNA Microarray Data

microrna PCR System (Exiqon), following the manufacturer s instructions. In brief, 10ng of

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Bayesian Inference for Single-cell ClUstering and ImpuTing (BISCUIT) Elham Azizi

Nature Neuroscience: doi: /nn Supplementary Figure 1

Big Image-Omics Data Analytics for Clinical Outcome Prediction

Screening for novel oncology biomarker panels using both DNA and protein microarrays. John Anson, PhD VP Biomarker Discovery

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

ncounter Assay Automated Process Capture & Reporter Probes Bind reporter to surface Remove excess reporters Hybridize CodeSet to RNA

Nature Methods: doi: /nmeth.3115

Computer Models for Medical Diagnosis and Prognostication

Index. E Eftekbar, B., 152, 164 Eigenvectors, 6, 171 Elastic net regression, 6 discretization, 28 regularization, 42, 44, 46 Exponential modeling, 135

Identification, Review, and Systematic Cross-Validation of microrna Prognostic Signatures in Metastatic Melanoma

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Small RNAs and how to analyze them using sequencing

Integrative analysis of survival-associated gene sets in breast cancer

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

Inferring Biological Meaning from Cap Analysis Gene Expression Data

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Introduction to Systems Biology of Cancer Lecture 2

University of Pittsburgh Cancer Institute UPMC CancerCenter. Uma Chandran, MSIS, PhD /21/13

Colon cancer subtypes from gene expression data

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Simple, rapid, and reliable RNA sequencing

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

The value of Omics to chemical risk assessment

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Biomarker Identification in Breast Cancer: Beta-Adrenergic Receptor Signaling and Pathways to Therapeutic Response

Statistical Assessment of the Global Regulatory Role of Histone. Acetylation in Saccharomyces cerevisiae. (Support Information)

Response and resistance in melanoma. Helen Rizos

Impact of Prognostic Factors

bivariate analysis: The statistical analysis of the relationship between two variables.

From reference genes to global mean normalization

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

저작권법에따른이용자의권리는위의내용에의하여영향을받지않습니다.

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

Ecological Statistics

GENE EXPRESSION AND microrna PROFILING IN LYMPHOMAS. AN INTEGRATIVE GENOMICS ANALYSIS

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Transcription:

Integration of high-throughput biological data Jean Yang and Vivek Jayaswal School of Mathematics and Statistics University of Sydney Meeting the Challenges of High Dimension: Statistical Methodology, Theory and Applications 1-12 Oct, 2012

Gene Ontology SNP Clinical Data *! *! *! *! *! mirna data Protein: itraq Integrating biological data GeneChip Affymetrix Expression Literature Array CGH DNA copy number cdna / long oligo microarray Expression

omics data SNP data DNA- seq Exome capture genome DNA (A) Association Microarray RNA- seq microrna Quan<ta<ve data e.g. itraq, SILAC. transcriptome proteome mrna Protein Pathways Gene Ontology Transcrip<on factor other meta data Clinical data (B) Prediction phenome Phenotype Central dogma of molecular biology

omics data SNP data DNA- seq Exome capture genome DNA (A) Association Microarray RNA- seq microrna Quan<ta<ve data e.g. itraq, SILAC. transcriptome proteome mrna Protein Pathways Gene Ontology Transcrip<on factor other meta data Clinical data (B) Prediction phenome Phenotype Central dogma of molecular biology

Technologies mrna MicroRNA *! *! *! *! *! microarray! Next-gen Sequencing! Variable (5000-30000 genes) or (2000 mirnas)! N samples! Expression! data! Count! data!

Role or mirna Comprises ~1% of genes in animals (~300+ mirnas in human) ~20-30% of human genes are mirna targets: each mirna targets ~200 to 400 genes. mirna molecules target mrna (gene) transcript along short target sites via compelementary base-pairing.

Case study: Stage III Melanoma Melanomas A skin cancer more prevaleant in Caucasians Most common in sunny climates (often correlated with latitude) Metastasise (Stage III) About 40% die within a year 20% die between 1 and 4 years 40% are cleared of the cancer (after 4 years) Samples (n=79) were obtained from A/Professor Graham Mann's group from the Westmead Institute for Cancer Research and Melanoma Institute, Australia.

Clinical data Survival Time in Years Two survival groups Number of individuals 0 10 20 30 40 0 1 2 3 4 Bad prognosis: Survival < 1 year and died due to melanoma Good prognosis: Survival 4 years with no sign of relapse Years of Survival

Stage III Melanoma data set Gene and microrna expression data obtained for 45 matched samples 23 good prognosis (Alive NSR > 4 yrs) 22 bad prognosis (Dead melanoma < 1 yr) Gene expression data microrna expression data Classifier microrna-gene modules DE genes DE mirnas

Data Stage III Melanoma patients Clinical 48 x 21 Gene expression 47 x 26085 mirna 45 x 1347

External data : target prediction algorithms Several computational mirna-target prediction algorithms have been developed e.g. TargetScan, PicTar, microcosm (based on miranda), and TargetMiner Large variations in results obtained using different algorithms microcosm Low sensitivity values Most widely used approach combines the results from multiple target prediction algorithms TargetScan

Data Stage III Melanoma patients External database mirna - mrna Gene expression 47 x 26085 Focusing on gene expression data only

Simple approach: gene set test Test the hypothesis that a set of genes (rather than a single gene) is differen<ally expressed Current implementa<ons in R e.g. GSEA, PGSEA, and GSA packages do not allow for tes<ng of mul<ple condi<ons simultaneously. For example, cannot test a <mecourse gene expression data for iden<fying regulatory mirs Define the gene set as microrna targetss

Functional enrichment DE in A vs B or other statistics Functional Gene set S Is a functional gene set S (eg: target genes of mir-196) overrepresented among DE genes?

Functional enrichment DE in A vs B or other statistics Functional Gene set S Is a functional gene set S (eg: target genes of mir-196) overrepresented among DE genes? Fisher s exact test DE! Not DE! Target! 15! 200! Not Target! 100! 10000!

Functional enrichment DE in A vs B or other statistics Functional Gene set S Is a functional gene set S (eg: target genes of mir-196) overrepresented among DE genes? Fisher s exact test DE! Not DE! Target! 15! 200! Not Target! 100! 10000! Is a functional gene set S differentially expressed? Wilcoxon Rank Sum test

Single platform gene set approaches Cut-off specified No cut-off specified Self-contained Competitive Univariate Multivariate Univariate Multivariate Over-representation Test Hypergeometric Test Binomial Test of Proportions ROAST Hotelling s T 2 N-statistic Wilcoxon Rank Sum Test Gene Set Enrichment Analysis PAGE None found Logistic Regression CAMERA

Data Stage III Melanoma patients External database mirna - mrna Gene expression 47 x 26085 Limitations: Only able to Identify interesting mirnas one at a time.

Data Stage III Melanoma patients External database mirna - mrna Gene expression 47 x 26085 mirna 45 x 1347 Questions Association? Network? Regulation? Can we look for relationship between groups / clusters / modules?

mirna-gene clique identification mirna clusters Gene clusters Step I Step II Interested in the statistical significance of association between a mirna cluster and gene cluster even if every mirna-gene pair does not have a significant association

Step I: Selection of clusters Expression Data : X [M x N ] Mapping matrix, M mirna - mrna [M x G] Aimed to group genes based on expression profiles and mapping information. Unguided clustering e.g. PAM Guided clustering: Multivariate Random Forest (MRF) Unguided clustering e.g. PAM Evaluation of two clustering methods Xiao and Segal, PLoS Comp Bio, 09 Significant clusters MRF: Xiao and Segal, PLoS Comp Bio, 09

Unguided clustering Based on mirna-mrna mapping data Generate a matrix of N genes and M mirnas such that an element (i, j) of the matrix takes the value 1 if mirna I is predicted to target mrna j and 0, otherwise Partition around medoids algorithm Obtain the pairwise distances between genes Assign each gene to a cluster such that the distance between the genes within a cluster is minimized Identifies mirnas that share similar targets or genes that are co-regulated

Guided clustering Uses both expression data and mirna-mrna mapping information More likely to find genes that are grouped together because share similar targets (or regulators) similar expression profiles Based on multivariate random forest (MRF)

Identifying gene clusters using MRF Obtain the proximity matrix such that 0 ρ(i, j) 1 Obtain dissimilarity matrix = 1 proximity matrix Employ PAM to obtain gene clusters But there are questions How many clusters are there? How to evaluate whether MRF-based guided clustering is better than unguided clustering?

Guided vs. unguided clustering Permutation test For a given cluster, obtain a measure of cluster tightness (θ) Generate a random cluster of the same size as before and estimate cluster tightness, θ* Obtain P(θ* θ)

Step 2: Identify significant association? microrna mrna A mirmr pair is considered to have an association if mapping matrix has the value 1. evidence of a relationship between mirna expression and mrna expression based on a linear model u = α + βv, change in mrna expression change in mirna expression Testing H0: β = 0 vs. H1: β 0 using a t-test.

Methods for measuring association Conditions Methods Multiple conditions (N is large) Correlation Pearson s correlation Canonical correlation Bayesian model GenMir++ Two conditions unbalanced samples say case (disease) vs. control (healthy) Linear regression models 5-10 conditions e.g. a short timeseries dataset or different subtypes of MM Odds-ratio method

Association between mirna-mrna pair To determine the cluster pairs or modules of interest, we perform a permutation test Identify the number of mirnamrna pairs with statistically significant association, η Obtain this number for a randomly selected mrna cluster, η* I J1 J2 J3 J4 J5 Obtain P(η* η) I J1*

microrna-gene modules Back to our data Input for MRF based analysis 935 unique genes from a total of 15,568 unique genes 158 micrornas from a total of 1,347 Genes and micrornas were selected based on Unadjusted p-value for a comparison of GP vs BP < 0.05 and microrna-gene pair predicted in the union of PicTar, TargetScan and MicroCosm

75 clusters 40 clusters Some examples of clusters include mir-15a, -16, -195 let-7a, -7c, -7f, mir-98 mir-26a, -26b, -410 mir-107, -25, -32 Enriched for a few KEGG pathways, e.g. RNA degradation mrna surveillance pathway

Some results for the GBM dataset

omics data SNP data DNA- seq Exome capture genome DNA (A) Association Microarray RNA- seq microrna Quan<ta<ve data e.g. itraq, SILAC. transcriptome proteome mrna Protein Pathways Gene Ontology Transcrip<on factor other meta data Clinical data (B) Prediction phenome Phenotype Central dogma of molecular biology

Data Stage III Melanoma patients External database mirna - mrna Gene expression 47 x 26085 mirna 45 x 1347 Questions Association? Network? Regulation? Can we look for relationship between groups / clusters / modules?

Data Stage III Melanoma patients Clinical 48 x 21 Gene expression 47 x 26085 mirna 45 x 1347 Questions Prediction? Finding biomarkers?

Data Stage III Melanoma patients Clinical 48 x 21 Questions Prediction? Finding biomarkers?

Missing values 48 samples and 21 variables comprising of continuous, binary and ordinal variables Cleaned data Missing Data Multiple Imputation

Draw one bootstrap sample Bootstrap Multiple Imputation (B-MI) procedure Feature selection method

Results from clinical data BMI procedure LOOCV error rate* glm with stepwise selection 39% [9] lasso 43% [4] Elastic net 35% [13] Ridge 41% [21-all] *In [x] are the number of variables selected The set of 13 clinical variables which gives lowest CV error rate is used in subsequent integration models

Data Stage III Melanoma patients Clinical 48 x 21 Gene expression 47 x 26085 mirna 45 x 1347 Questions Prediction? Finding biomarkers?

Sparse canonical correlation analysis *!*! *! *! *! Data X n x p Data Z n x q Outcome Y n x 1 Composite latent variable for X Composite latent variable for Z Sparse linear combinations Xu and Zv Maximize Correlation between Xu and Zv Parkhomenko E, Tritchler D, Beyene, J : Statist. Appl. Genet. Mol. Biol. (2009).

SCCA Data X n x p Data Z n x q SCCA Clinical + Var1 = Xu OR Var2 = Zv Predict outcome

Weighted lasso Correlation between Gene and mirna; Weight 1 =1/correlation Non-zero coefficients in a SCCA between Gene and mirna; weights 2 = 1/coef Gene & mirna weight 1= 1/ correlation weight 2= 1/ coef Weighted lasso for gene expression data Selected genes

Weighted lasso (cont) Gene data [Selected genes] + Clinical Logistic regression Final predictive variables

Results Here the preselected clinical variables + Genes + mirnas selected through integration methods are used as variables. SCCA Weighted lasso weight 1 weight 2 Direct Clinical 35% Clinical + G + M 33% 25% 23% 41%

Summary and thoughts Based on MRF, we have a framework for finding interesting mirna -mrna modules. Further work is needed for developing better association measures under different conditions. Still very challenging to find approaches for methods evaluation?? Methods for integrating high-throughput data largely depends on the question of interest. Further research in finding robust and stable models utilizing both external and internal information. Biological interpretation.

Acknowledgements University of Sydney, School of mathematics and statistics Samuel Müller John Robinson John Ormerod Bioinformatics research group Anna Campain Vivek Jayaswal St Vincent Centre for Applied Biomedical Research David Ma Mark Lutherborrow Adam Bryant. Melanoma program at MIA/WMI Genomics Team Gulietta Pupo Svetlana Pianova Varsha Tembe Sarah-Jane Schramm Graham Mann Melanoma Program at MIA/RPA Tumour bank and pathology Cadance Carter Chitra De Silva Ken Lai James Wilmott Richard Scolyer Supported in part by ARC grant DP0770395 (YHY, AC, VJ)