DeconRNASeq: A Statistical Framework for Deconvolution of Heterogeneous Tissue Samples Based on mrna-seq data

Similar documents
Package DeconRNASeq. November 19, 2017

CS 6824: Tissue-Based Map of the Human Proteome

Introduction. Introduction

Supplementary Figures

Pitfalls in Linear Regression Analysis

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

Supplementary Figures

Heritability enrichment of differentially expressed genes. Hilary Finucane PGC Statistical Analysis Call January 26, 2016

Chapter 1. Introduction

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB

Cellecta Overview. Started Operations in 2007 Headquarters: Mountain View, CA

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Simple, rapid, and reliable RNA sequencing

New Enhancements: GWAS Workflows with SVS

Essentials in Bioassay Design and Relative Potency Determination

Ranking radiotherapy treatment plans: physical or biological objectives?

Reliability of Ordination Analyses

Use of MALDI-TOF mass spectrometry and machine learning to detect the adulteration of extra virgin olive oils

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

How Systems Biology Can Advance Immune Oncology. Birgit Schoeberl, PhD NEDMG Summer Symposium May 31st

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

TCGA. The Cancer Genome Atlas

Linear Regression in SAS

Modeling Sentiment with Ridge Regression

Numerical Integration of Bivariate Gaussian Distribution

MethylMix An R package for identifying DNA methylation driven genes

CONTRACTING ORGANIZATION: The Chancellor, Masters and Scholars of the University of Cambridge, Clara East, The Old Schools, Cambridge CB2 1TN

Early dissemination in prostate cancer

A Quick-Start Guide for rseqdiff

Lecture 8 Understanding Transcription RNA-seq analysis. Foundations of Computational Systems Biology David K. Gifford

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Identifying Novel Targets for Non-Small Cell Lung Cancer Just How Novel Are They?

CoINcIDE: A framework for discovery of patient subtypes across multiple datasets

Quantitative Evaluation of Edge Detectors Using the Minimum Kernel Variance Criterion

EXPression ANalyzer and DisplayER

Corporate Medical Policy

Estimating Testicular Cancer specific Mortality by Using the Surveillance Epidemiology and End Results Registry

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz

Use Case 9: Coordinated Changes of Epigenomic Marks Across Tissue Types. Epigenome Informatics Workshop Bioinformatics Research Laboratory

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

Bayesian Inference for Single-cell ClUstering and ImpuTing (BISCUIT) Elham Azizi

Version No. 7 Date: July Please send comments or suggestions on this glossary to

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

CHILD HEALTH AND DEVELOPMENT STUDY

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

GlobalAncova with Special Sum of Squares Decompositions

Identification of Tissue Independent Cancer Driver Genes

Inference of Isoforms from Short Sequence Reads

Quantification of DNA Methylation and Epimutations from Bisulfite Sequencing Data the BEAT package

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

chapter 1 - fig. 2 Mechanism of transcriptional control by ppar agonists.

Assignment 5: Integrative epigenomics analysis

Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction

Neuroinformatics. Ilmari Kurki, Urs Köster, Jukka Perkiö, (Shohei Shimizu) Interdisciplinary and interdepartmental

NIRS-based BCIs: Reliability and Challenges

A Real-Time Monte Carlo System for Internal Dose Evaluation Using an Anthropomorphic Phantom with Different Shapes of Tumors Inserted

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

ANALYTISCHE STRATEGIE Tissue Imaging. Bernd Bodenmiller Institute of Molecular Life Sciences University of Zurich

A Multigene Genetic Programming Model for Thyroid Disorder Detection

Feasibility Evaluation of a Novel Ultrasonic Method for Prosthetic Control ECE-492/3 Senior Design Project Fall 2011

IMPC phenotyping SOPs in JMC

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Multi-omics data integration colon cancer using proteogenomics approach

omiras: MicroRNA regulation of gene expression

SUPPLEMENTARY APPENDIX

A Comparison of Robust and Nonparametric Estimators Under the Simple Linear Regression Model

Transcriptome Deconvolution of Heterogeneous Tumor Samples with Immune Infiltration

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Measuring DNA Methylation with the MinION

Gene-microRNA network module analysis for ovarian cancer

ESTIMATING THE SUBJECT BY TREATMENT INTERACTION IN NON- REPLICATED CROSSOVER DIET STUDIES

Performance of Median and Least Squares Regression for Slightly Skewed Data

One-Way Independent ANOVA

Dissecting the expression landscape of RNA-binding proteins in human cancers

RNA-seq: filtering, quality control and visualisation. COMBINE RNA-seq Workshop

DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging

PSSV User Manual (V2.1)

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

BIMM 143. RNA sequencing overview. Genome Informatics II. Barry Grant. Lecture In vivo. In vitro.

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

On the Reproducibility of TCGA Ovarian Cancer MicroRNA Profiles

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

MATS: a Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data

In this module I provide a few illustrations of options within lavaan for handling various situations.

Outline. Model Development GLUCOSIM. Conventional Feedback and Model-Based Control of Blood Glucose Level in Type-I Diabetes Mellitus

M. S. Raleigh et al. C6748

Figure 1.1 PHITS geometry for PTB irradiations with: broad beam, upper panel; mono energetic beams, lower panel. Pictures of the setups and of the

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

FAX, A FEMALE ADULT VOXEL PHANTOM FOR RADIATION PROTECTION DOSIMETRY

AbSeq on the BD Rhapsody system: Exploration of single-cell gene regulation by simultaneous digital mrna and protein quantification

CNV PCA Search Tutorial

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

Nature Getetics: doi: /ng.3471

Simple Linear Regression

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Transcription:

DeconRNASeq: A Statistical Framework for Deconvolution of Heterogeneous Tissue Samples Based on mrna-seq data Ting Gong, Joseph D. Szustakowski April 30, 2018 1 Introduction Heterogeneous tissues are frequently collected (e.g. blood, tumor etc.) from humans or model animals. Therefore mrna-seq samples are often heterogeneous with regard to those cell types, which render it difficult to distinguish whether gene expression variation reflects a shift in cell populations, a change of cell-type-specific expression, or both ([1]). In this vignette, we present an efficient pipeline and methodology: Decon- RNASeq, an R package for deconvolution of heterogeneous tissues based on mrna-seq data. It adopts a globally optimized non-negative decomposition algorithm through quadratic programming for estimating the mixing proportions of distinctive tissue types in next generation sequencing data. We demonstrate the feasibility and validity of DeconRNASeq across a range of mixing levels and sources using mrna-seq data mixed in-silico at known concentrations. We presented the workflow of DeconRNASeq package in this vignette. This tool allows processing of sequencing data for assessing the performance of linear models and estimating accurately mixing fractions for multiple species of tissues or cells, and is even able to provide accurate estimates for relatively rare cell types (<= 0.02). We applied our approach to a realistic simulations involving complex mixtures of multiple tissues derived from an appropriate experimental design. 2 File Structure Requirements A single study analysis requires at least two inputs: datasets An m-by-n matrix of expression values, where m is the number of genes (or probes) under consideration and n is the total number of mrna- Seq mixing samples consisting of multiple samples. These expression values should be normalized in some manner. But we leave the users to select their preferred normalization methods. It should be noted, expression is generally not on the log 2 scale, which may destroy the linear model in the context of mrna-seq expression deconvolution. 1

signatures An m-by-n matrix of expression values, where m is the number of genes (or probes) which are cell- or tissue-type specific and n is the total number of cell or tussi types. These espression values should be normalized in the same manner of datasets. We do not suggest to take log transformation also. Our demo uses a simulated example data set, which can be accessed using the code given below. > library(deconrnaseq) > ## multi_tissue: expression profiles for 10 mixing samples from > ## multiple tissues > data(multi_tissue) > datasets <- x.data[,2:11] > signatures <- x.signature.filtered.optimal[,2:6] > proportions <- fraction For the mixtures, there are 28745 genes. And we have 10 samples. In silico mixed data were simulated using ([2]) data, with disparate proportions drawn from random numbers. The mixing proportions used by each type of tissue are shown in the following. It should also be noted that we investigated the influence of extremely low numbers of contaminating cell types (<2 percent). > proportions brain muscle lung liver heart reads.1.rpkm 0.0463 0.0323 0.0805 0.0747 0.7662 reads.2.rpkm 0.0606 0.1156 0.0278 960 0.1000 reads.3.rpkm 0.0728 058 0.1051 0.1262 0.0900 reads.4.rpkm 0.0709 0.0887 0.7242 0.0975 0.0188 reads.5.rpkm 672 0.1347 0.0486 0.0674 0.0821 reads.6.rpkm 0.1368 181 0.1764 0.3678 0.1010 reads.7.rpkm 0.0780 100 800 0.1603 717 reads.8.rpkm 0.1250 0.3997 0.1830 0.1198 0.1726 reads.9.rpkm 309 0.1230 0.5723 0.0214 0.0524 reads.10.rpkm 284 0.3242 0.0913 0.0644 0.0917 We adopted mrna-seq data from the Illumina BodyMap 2.0 (GSE 30611) as a training data set to define tissue-specific signatures for different human tissues (adrenal gland, adipose, brain, breast, colon, heart, kidney, liver, lung, lymph node, ovary, prostate, skeletal muscle, testes, thyroid, and white blood cells). We selected the overlapped tissues with mixed data and generated tissuespecific transcriptional profiles We assessed potential expression signatures via the methods described in ([3]) as the basis matrix for deconvolution. In this case study, we conjecture that genes with extremely small or large read counts somehow violate our assumptions of linearity or are otherwise unreliable. Therefore, we putatively removed the genes with RPKM less than 200 within any of the five tissues from our gene signatures. Consequently, for the tissue-type selected gene signatures, we selected 1570 genes that consist of the signatures for the five tissues and deconvoluted the data. 2

> signatures <- x.signature.filtered.optimal[,2:6] > attributes(signatures)[c(1,2)] $names [1] "brain" "muscle" "lung" "liver" "heart" $class [1] "data.frame" 3 Deconvolution Analysis After we initiated the parameters/arguments in the last section, we can perform the deconvolution analysis as following. > DeconRNASeq(datasets, signatures, proportions, checksig=false, + known.prop = TRUE, use.scale = TRUE, fig = TRUE) svd calculated PCA Importance of component(s): PC1 PC2 PC3 PC4 PC5 R2 0.8388 0.1155 0.02277 0.0177 0.00514 Cumulative R2 0.8388 0.9544 0.97714 0.9948 0.99998 Attention: the number of pure cell types = 5 defined in the signature matrix; PCA results indicate that the number of cell types in the mixtures = 4 $out.all brain muscle lung liver [1,] 0.04684774 0.00000000 0.09392811 0.09471810 [2,] 0.07181238 0.11724299 0.08290500 1870050 [3,] 0.06846447 0805500 0.12654897 0.13545429 [4,] 0.04241299 0.04891053 1470260 0.18971422 [5,] 4705933 0.13437755 0.07087937 0.08121135 [6,] 0.11852187 0.19122874 3043088 0.35351462 [7,] 0.04647516 0.09320312 0.32264865 0.18147697 [8,] 0.09992317 0.34888971 2163452 0.13292569 [9,] 0.16108764 0.06903200 0.54792347 0.11886160 [10,] 3684023 0.31901296 0.10717750 0.07104230 heart [1,] 0.76450606 [2,] 0.10933912 [3,] 0.06147727 [4,] 0.10425967 [5,] 0.06647239 [6,] 0.10630389 [7,] 0.35619610 [8,] 0.19662690 [9,] 0.10309529 [10,] 0.06592701 3

$out.pca PC1 PC2 PC3 PC4 PC5 R2 0.83885 0.11552 0.02277 0.01770 0.00514 Cumulative R2 0.83885 0.95437 0.97714 0.99484 0.99998 $out.rmse [1] 0.04380197 actual brain RMSE = 0.029 estimated brain RMSE = 0.048 actual muscle RMSE = 0.047 0.0 estimated muscle RMSE = 0.051 actual lung actual liver 0.0 estimated lung 0.0 estimated liver 0.8 RMSE = 0.044 actual heart 0.0 estimated heart The system we describe here assumes that all relevant cell types are accounted for in the cell-specific expression matrix. In reality, this might not always be the case, as complex tissue samples might include unexpected contaminants, or rare cell populations that have not been previously characterized via expression profiling. Several studies have reported that identifying the correct number of transcriptional source signals in complex samples is very challenging. One common approach is to use principle component analysis (PCA) to estimate the number of sources based on their cumulative variance contributions. Our package includes procedures to assist the user in identifying the appropriate number of sources guided by PCA. When the number of pure cell or tissue types defined in the expression signature matrix is inconsistent with the PCA estimation, our package will give the notification. However, we leave the users to decide the constituent signal components in the mixtures. 4

The output out.all includes the estimated mixing fractions of multiple sources in each mixing sample, the out.rmse outputs the mean RMSE(root mean square error) for all estimated tissue proportions if the true proportions are known. We also generated the scatter plots of estimated tissue proportions (y axis) vs. actual tissue proportions (x axis) for deconvolution of heterogeneous tissues if fig is TRUE. 4 Condition Number of the Expression Signatures A basis matrix that contains genes that together form a complete but parsimonious set of robust markers for the tissue types of interest in mrna-seq data is crucial to the success of the deconvolution. Therefore, if we know the mixing proportions, a complete set of matrices comprised of different quantities of the most differentially-expressed genes can be tested by comparing the results of each matrix to the known mixture fractions. A matrix condition number estimates the sensitivity of a system of linear equations to errors in the data. Hence, the condition number is low when the matrix is stable. Thus, we also provide the plot of the condition number vs. the number of genes from the gene signature in the deconvolution experiments. An example is provided by re-runing the same experiment with the parameter checksig to be true. > DeconRNASeq(datasets, signatures, proportions, checksig=true, + known.prop = TRUE, use.scale = TRUE, fig = TRUE) 5

Condition number of the signature matrix Condition number 4 6 8 10 12 14 16 20 200 380 560 740 920 1100 1280 1460 Number of genes in the signature References [1] Kuhn, A., et al. Population-specific expression analysis (PSEA) reveals molecular changes in diseased brain Nat Meth 8, 945-947 [2] Pan, Q., et al. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing Nat Genet 40, 1413-1415 [3] Gong, T. Optimal Deconvolution of Transcriptional Profiling Data Using Quadratic Programming with Application to Complex Clinical Blood Samples PLoS One 6, e27156 6