The LiquidAssociation Package

Similar documents
False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University

MEA DISCUSSION PAPERS

Memorial Sloan-Kettering Cancer Center

MLE #8. Econ 674. Purdue University. Justin L. Tobias (Purdue) MLE #8 1 / 20

AIMS: Absolute Assignment of Breast Cancer Intrinsic Molecular Subtype

Stat Wk 9: Hypothesis Tests and Analysis

Vega: Variational Segmentation for Copy Number Detection

MethylMix An R package for identifying DNA methylation driven genes

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

User Guide. Association analysis. Input

R documentation. of GSCA/man/GSCA-package.Rd etc. June 8, GSCA-package. LungCancer metadi... 3 plotmnw... 5 plotnw... 6 singledc...

A Quick-Start Guide for rseqdiff

metaseq: Meta-analysis of RNA-seq count data

Generalized Estimating Equations for Depression Dose Regimes

A Cross-Study Comparison of Gene Expression Studies for the Molecular Classification of Lung Cancer

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

Package CLL. April 19, 2018

Clinical trial design issues and options for the study of rare diseases

Binary Diagnostic Tests Paired Samples

Ulrich Mansmann IBE, Medical School, University of Munich

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

GlobalAncova with Special Sum of Squares Decompositions

Hypothesis testing: comparig means

Introduction to antiprofiles

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Quality assessment of TCGA Agilent gene expression data for ovary cancer

cape: A package for the combined analysis of epistasis and pleiotropy

Missy Wittenzellner Big Brother Big Sister Project

Cancer outlier differential gene expression detection

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Package MSstatsTMT. February 26, Title Protein Significance Analysis in shotgun mass spectrometry-based

Sheila Barron Statistics Outreach Center 2/8/2011

AP Statistics. Semester One Review Part 1 Chapters 1-5

Package inote. June 8, 2017

LOGLINK Example #1. SUDAAN Statements and Results Illustrated. Input Data Set(s): EPIL.SAS7bdat ( Thall and Vail (1990)) Example.

Metabolomic Data Analysis with MetaboAnalyst

How To Use MiRSEA. Junwei Han. July 1, Overview 1. 2 Get the pathway-mirna correlation profile(pmset) and a weighting matrix 2

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Hypothesis Testing. Richard S. Balkin, Ph.D., LPC-S, NCC

Power & Sample Size. Dr. Andrea Benedetti

4. STATA output of the analysis

Package MethPed. September 1, 2018

Inferential Statistics

EXPression ANalyzer and DisplayER

Introduction. Lecture 1. What is Statistics?

On Regression Analysis Using Bivariate Extreme Ranked Set Sampling

Package AbsFilterGSEA

Using Messina. Mark Pinese. October 13, Introduction The problem Example: Designing a colon cancer screening test...

Package inctools. January 3, 2018

A Case Study: Two-sample categorical data

splicer: An R package for classification of alternative splicing and prediction of coding potential from RNA-seq data

SCHOOL OF MATHEMATICS AND STATISTICS

Performance of Median and Least Squares Regression for Slightly Skewed Data

Supplementary Information

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

Experimental Design For Microarray Experiments. Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit

Using mixture priors for robust inference: application in Bayesian dose escalation trials

Biostatistics II

The Pretest! Pretest! Pretest! Assignment (Example 2)

MOST: detecting cancer differential gene expression

Bayesian Bi-Cluster Change-Point Model for Exploring Functional Brain Dynamics

Abstract Title Page Not included in page count. Authors and Affiliations: Joe McIntyre, Harvard Graduate School of Education

ANOVA in SPSS (Practical)

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Example 7.2. Autocorrelation. Pilar González and Susan Orbe. Dpt. Applied Economics III (Econometrics and Statistics)

Yeast Cells Classification Machine Learning Approach to Discriminate Saccharomyces cerevisiae Yeast Cells Using Sophisticated Image Features.

6. Unusual and Influential Data

Day 11: Measures of Association and ANOVA

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Regression models, R solution day7

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022

One-Way Independent ANOVA

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer

SubLasso:a feature selection and classification R package with a. fixed feature subset

8/28/2017. If the experiment is successful, then the model will explain more variance than it can t SS M will be greater than SS R

Math 215, Lab 7: 5/23/2007

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Clustering mass spectrometry data using order statistics

Stat Wk 8: Continuous Data

Missing data. Patrick Breheny. April 23. Introduction Missing response data Missing covariate data

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

Analytic Strategies for the OAI Data

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

ABSTRACT THE INDEPENDENT MEANS T-TEST AND ALTERNATIVES SESUG Paper PO-10

simulatorapms October 25, 2011 An example of an ISI using apcomplex on simex

Examining differences between two sets of scores

About diffcoexp. Wenbin Wei, Sandeep Amberkar, Winston Hide

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Bayesian Dose Escalation Study Design with Consideration of Late Onset Toxicity. Li Liu, Glen Laird, Lei Gao Biostatistics Sanofi

Linear Regression in SAS

Migraine Dataset. Exercise 1

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Package ordibreadth. R topics documented: December 4, Type Package Title Ordinated Diet Breadth Version 1.0 Date Author James Fordyce

Definition 1: A fixed point iteration scheme to approximate the fixed point, p, of a function g, = for all n 1 given a starting approximation, p.

Chapter 9. Factorial ANOVA with Two Between-Group Factors 10/22/ Factorial ANOVA with Two Between-Group Factors

Transcription:

The LiquidAssociation Package Yen-Yi Ho October 30, 2018 1 Introduction The LiquidAssociation package provides analytical methods to study three-way interactions. It incorporates methods to examine a particular kind of three-way interaction called liquid association (LA). The term liquid association was first proposed by Li (2002). It describes the extend to which the correlation of a pair of variables depends on the value of a third variable. The term liquid was used in contrast with solid to emphasize that the association between a pair of variables changes according to the value of a third variable. Li first reported the existence of liquid association in the yeast cell cycle gene expression data set by Spellman et al. (1998). Furthermore, the liquid association phenomena could potentially be found in other kind of data as well. Building on the ground-breaking work by Li, Ho et al. (2009) extended liquid association to accommodate more intricate co-dependencencies among the 3 variables, calling the expanded statistic generalized liquid association, or GLA for short. This software package provides functions to implement the estimation of liquid association through two approaches: the direct and the model-based estimation approach. For the model-based approach, we introduce the conditional normal model (CNM) and provided a generalized estimating equations (GEE)-based estimation procedure. In addition, we provide functions to perform hypothesis testing using direct estimate (sgla) and model-based estimates (GEEb5) in this package. 2 Simple Usage Here we present a typical work flow to investigate liquid association using a gene triplet data. We start with load the R package and the example data. In this package, we use the yeast cell cycle gene expression data by Spellman (1998). The data can be obtained through package yeastcc. The annotation package for the yeast experiment org.sc.sgd.db can be obtained through Bioconductor. > library(liquidassociation) > library(yeastcc) > library(org.sc.sgd.db) > data(spycces) > lae <- spycces[,-(1:4)] > ### get rid of the NA elements > lae <- lae[apply(is.na(exprs(lae)),1,sum) < ncol(lae)*0.3,] > probname<-rownames(exprs(lae)) > genes <- c("hts1","atp1","cyt1") > genemap <- unlist(mget(genes,revmap(org.sc.sgdgenename),ifnotfound=na)) > whgene<-match(genemap,probname) After removing genes with high missing percentage, we keep 5721 genes for further analysis. We are interested in three genes: HTS1, ATP1, and CYT1 that are involved in the Yeast electron transport pathway. We would like to know whether the correlation of HTS1 and ATP1 gene can be modulated by the level of CYT1 gene. 1

> data<-t(exprs(lae[whgene,])) > esetdata<-lae[whgene,] > data<-data[!is.na(data[,1]) &!is.na(data[,2]) &!is.na(data[,3]),] > colnames(data)<-genes > str(data) num [1:66, 1:3] 0.3-0.25 0.18 0.05 0.05-0.16 0.26 0.14 0.03 0.01... - attr(*, "dimnames")=list of 2..$ : chr [1:66] "alpha_0" "alpha_7" "alpha_14" "alpha_21".....$ : chr [1:3] "HTS1" "ATP1" "CYT1" We notice that after removing missing observations, there are 66 observations left. We can use the plotgla function to examine whether the correlation of the first two genes changes according to the level of the third gene. The plotgla function produces scatter plot conditioning on the level of a third gene, X 3. We can specify which column in the data to be the third gene using the argument dim. For example, we can specify HTS1 and ATP1 gene as the first two modulated genes, and CYT1 as the third controller gene by setting dim=3. The cut argument is used to specify the number of grid points over the third controller gene. In the plotgla function, the input data can be in ExpressionSet class as well. > plotgla(data, cut=3, dim=3, pch=16, filen="glaplot", save=true) 1.8<X3<0.1 0.1<X3<1 ATP1 1.0 0.5 0.0 0.5 corr= 0.21 ATP1 0.5 0.0 0.5 corr= 0.63 0.6 0.4 0.2 0.0 0.2 0.4 0.6 HTS1 1.0 0.5 0.0 0.5 HTS1 Figure 1: Conditional distributions of (HTS1, ATP1 CYT1) according to the gene expression level of CYT1. According to Figure 1, we find the evidence for the existence of liquid association among the triplet since the correlation of HTS1, and ATP1 gene changes from -0.21 to 0.63. We can further perform estimation and hypothesis testing to quantify the strength of liquid association. As described in Li (2002) and Ho et al (2009), the calculation of the liquid association measures assumes all three variable are standardized with mean 0 and variance 1 and the third gene follows normal distribution. Hence in this package, within the GLA, LA, CNM.full, CNM.simple, getsgla, getsla functions the standardization and normalization steps are performed internally (so that the three variables are with mean 0 and variance 1, and the third variable is normally distributed through quantile normalization). We now use GLA static to robustly measure the magnitude of liquid association. The measure ranges from 2/π 0.798 to 2/π. When GLA=0, it means that the correlation of the two modulated genes does not change according to the level of the third controller gene, hence there is no evidence of LA. In addition, when GLA >0, it indicates that the correlation of the first two genes increases with increasing 2

value of the third gene and vise versa. We use the function GLA to calculate the GLA estimate for a given triplet data as follows: > LAest<-LA(data) > GLAest<-rep(0,3) > for ( dim in 1:3){ + GLAest[dim]<-GLA(data, cut=4, dim=dim) + } > LAest LA(HTS1,ATP1,CYT1) 0.5633369 > GLAest [1] 0.4162624 0.3085295 0.3219779 The data argument in these function can also be in ExpressionSet class as follows: > esetgla<-gla(esetdata, cut=4, dim=3, genemap=genemap) > esetgla GLA(HTS1,ATP1 CYT1) 0.3219779 In the above example, we calculate three GLA estimates by sequentially changing the third controller gene. In addition, the three-product-moment estimator proposed by Li (2002) can also be calculated using the LA function as shown above. In the example, we find noticeable differences between the three-prodcut-moment and GLA estimates, LAest and GLAest, respectively. As described in Ho (2009), the GLA estimator is more robust than LA in the sense that it could still correctly capture liquid association even when the marginal mean and variance depend on the third variable. The second approach to estimate liquid association is using the model-based estimator. We fit the CNM using the function CNM.full as shown below. Furthermore, the CNM model is written as: ( σ 2 where Σ = 1 ρσ 1 σ 2 ρσ 1 σ 2 σ2 2 as written below: X 3 N(µ 3, σ 2 3) X 1, X 2 X 3 N( ( µ1 µ 2 ), Σ). ). The mean vector (µ 1, µ 2 ) and variance matrix Σ depend on the level of X 3 µ 1 = β 1 X 3, µ 2 = β 2 X 3, log σ 2 1 = α 3 + β 3 X 3, log σ2 2 = α 4 + β 4 X 3, log [ 1 + ρ 1 ρ ] = α 5 + β 5 X 3. > FitCNM.full<-CNM.full(data) > FitCNM.full Model: CNM(HTS1,ATP1 CYT1) estimates san.se wald p value a3-0.01944285 0.20975976 0.008591625 9.261490e-01 3

a4-0.27887616 0.16905326 2.721295365 9.901763e-02 a5 0.87944227 0.43552192 4.077506175 4.345775e-02 b1-0.02107489 0.06517541 0.104559409 7.464253e-01 b2 0.42294929 0.07987766 28.036635027 1.190404e-07 b3 0.06814325 0.21511309 0.100348770 7.514115e-01 b4-0.18465426 0.20408239 0.818667618 3.655700e-01 b5 1.32577318 0.49759016 7.098962665 7.712858e-03 The main parameter of interest in the CNM for examining the existence of liquid association is b 5. As shown in the result above, the GEEb5 Wald test statistic is displayed in the 8th row 3rd column (GEEb5=7.1) with p value 0.008. We conclude there is statistically significant evidence that indicates the existence of liquid association for (HTS1, ATP1 CYT1). We can also perform hypothesis testing using the direct estimate approach using the getsgla function as follows: > sglaest<-getsgla(data, boots=20, perm=50, cut=4, dim=3) > sglaest sgla p value 2.981711 0.000000 > slaest<-getsla(data,boots=20, perm=50) > slaest sla p value 6.641536 0.000000 where the argument boots specify the number of bootstrap iteration for estimating the bootstrap standard error. For demonstration purpose, we set boots to 20. The perm argument specifies the number of iterations to calculate permuted p value. In addition, we can also perform hypothesis testing using sla statistic based on the three-product-moment estimator proposed by Li (2002). In real applications, sgla is likely to be more robust than sla. We draw similar conclusion using sgla test statistic and GEEb5 with p value less than 0.01. 3 Extended Examples In this section, we demonstrate an typical analysis procedure using the entire gene expression data set. In this example, we first filter out genes with small variances and keep 150 genes for further analysis. > lae<-t(exprs(lae)) > V<-apply(lae, 2, var, na.rm=true) > ibig<-v > 0.5 > sum(ibig) [1] 150 > big<-which(ibig) > bigtriplet<-lae[,big] > dim(bigtriplet) [1] 73 150 We can annotation the ORF ID in the data set to their gene names as follows: > x <- org.sc.sgdgenename > mappedgenes <- mappedkeys(x) > xx <- as.list(x[mappedgenes]) > mapid<-names(xx) 4

> orfid<-colnames(bigtriplet) > genename1<-xx[match(orfid, mapid)] > imap<-which(sapply(genename1, length)!=1) > bigtriplet<-bigtriplet[,-imap] > colnames(bigtriplet)<-genename1[-imap] After removing ORF probe set that can not be mapped to genes, a total number of 400995 possible triplet combinations that can be generated by the 135 genes. We now calculate the GLA estimates for all possible triplet combinations. Here, we demonstrate calculating GLA estimates for triplet combination #1 to #100. > num<-choose(ncol(bigtriplet),3) > pick<-t(combn(1:ncol(bigtriplet),3)) > GLAout<-matrix(0, nrow=100, 3) > for ( i in 1:100){ + dat1<-bigtriplet[,pick[i,]] + for ( dim in 1:3 ){ + GLAout[i,dim]<-GLA(dat1, cut=4, dim=dim) + } + } We would like to find triplets with large GLA estimates. For example, we can choose triplet with GLA greater than 0.2 as follows: > GLAmax<-apply(abs(GLAout),1, max) > imax<-which(glamax > 0.20) > pickmax<-pick[imax,] > GLAmax<-GLAout[imax,] > trip.order<-t(apply(t(abs(glamax)), 2, order, decreasing=true)) Furthermore, we perform hypothesis testing using the first triplet as demonstration. Based on the analysis result, we can not reject the null hypothesis that the liquid association is 0. We draw the same conclusion using GEEb5 and sgla statistics. > whtrip<-1 > data<-bigtriplet[,pickmax[whtrip,]] > data<-data[,trip.order[whtrip,]] > data<-data[!is.na(data[,1]) &!is.na(data[,2]) &!is.na(data[,3]),] > data<-apply(data,2,qqnorm2) > data<-apply(data,2, stand) > FitCNM1<-CNM.full(data) > FitCNM1 Model: CNM(SEO1,RFA1 SSA1) estimates san.se wald p value a3-0.02128987 0.1545360 0.01897957 0.890425061 a4-0.13810219 0.1666190 0.68699255 0.407189215 a5-0.32754023 0.2640596 1.53859884 0.214826486 b1 0.07713727 0.1149752 0.45011195 0.502281794 b2-0.05877725 0.1083481 0.29429007 0.587484380 b3-0.04416097 0.1449181 0.09286070 0.760571391 b4-0.49537099 0.1761766 7.90614258 0.004926721 b5-0.71803322 0.4028203 3.17735935 0.074665304 > sgla1<-getsgla(data, boots=20, perm=50, cut=4, dim=3) > sgla1 sgla p value -1.309713 0.200000 > 5

4 Reference References Ho, Y.-Y., Cope, L. M., Louis, T. A., and Parmigiani, G. (2009). Generalized liquid association. Technical report, Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 183. Li, K.-C. (2002). Genome-wide coexpression dynamics: theory and application. Proc Natl Acad Sci U S A 99, 16875 16880. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273 3297. 5 Session Information ˆ R version 2.8.1 (2008-12-22), x86_64-unknown-linux-gnu ˆ Locale: LC_CTYPE=en_US.iso885915;LC_NUMERIC=C;LC_TIME=en_US.iso885915;LC_COLLATE=en_US.iso885915;LC_M ˆ Base packages: base, datasets, graphics, grdevices, methods, stats, tools, utils ˆ Other packages: AnnotationDbi 1.4.2, Biobase 2.2.2, DBI 0.2-4, geepack 1.0-13, LiquidAssociation 1.0.4, org.sc.sgd.db 2.2.6, RSQLite 0.7-1, yeastcc 1.2.6 6