Experimental Design For Microarray Experiments. Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit

Similar documents
MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

10 Intraclass Correlations under the Mixed Factorial Design

Confidence Intervals On Subsets May Be Misleading

The Pretest! Pretest! Pretest! Assignment (Example 2)

Complexity DNA. Genome RNA. Transcriptome. Protein. Proteome. Metabolites. Metabolome

Simple Linear Regression the model, estimation and testing

Multiple Regression Analysis

CHAPTER OBJECTIVES - STUDENTS SHOULD BE ABLE TO:

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Hypothesis Testing. Richard S. Balkin, Ph.D., LPC-S, NCC

Introduction to Discrimination in Microarray Data Analysis

26:010:557 / 26:620:557 Social Science Research Methods

Clinical trial design issues and options for the study of rare diseases

THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER

EC352 Econometric Methods: Week 07

AP Statistics. Semester One Review Part 1 Chapters 1-5

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Pitfalls in Linear Regression Analysis

Factorial Analysis of Variance

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Dr. Kelly Bradley Final Exam Summer {2 points} Name

South Australian Research and Development Institute. Positive lot sampling for E. coli O157

HPS301 Exam Notes- Contents

Computational Capacity and Statistical Inference: A Never Ending Interaction. Finbarr Sloane EHR/DRL

Biostatistical modelling in genomics for clinical cancer studies

Part I. Boolean modelling exercises

CHAPTER 4 Designing Studies

Section 3.2 Least-Squares Regression

Processing of RNA II Biochemistry 302. February 14, 2005 Bob Kelm

Performance of Median and Least Squares Regression for Slightly Skewed Data

Problem Set 5 KEY

Cellular Communication

Multiple Treatments on the Same Experimental Unit. Lukas Meier (most material based on lecture notes and slides from H.R. Roth)

10.1 Estimating with Confidence. Chapter 10 Introduction to Inference

Running head: INDIVIDUAL DIFFERENCES 1. Why to treat subjects as fixed effects. James S. Adelman. University of Warwick.

Mark J. Anderson, Patrick J. Whitcomb Stat-Ease, Inc., Minneapolis, MN USA

Processing of RNA II Biochemistry 302. February 13, 2006

Genetic Analysis of Anxiety Related Behaviors by Gene Chip and In situ Hybridization of the Hippocampus and Amygdala of C57BL/6J and AJ Mice Brains

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Statistical Assessment of the Global Regulatory Role of Histone. Acetylation in Saccharomyces cerevisiae. (Support Information)

For more information about how to cite these materials visit

Experimental Methods. Anna Fahlgren, Phd Associate professor in Experimental Orthopaedics

Inference of Isoforms from Short Sequence Reads

High-Throughput Sequencing Course

9 research designs likely for PSYC 2100

Processing of RNA II Biochemistry 302. February 18, 2004 Bob Kelm

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

The value of Omics to chemical risk assessment

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Cancer and Tyrosine Kinase Inhibition

RNA-seq Introduction

Chapter 1: Exploring Data

Raymond Auerbach PhD Candidate, Yale University Gerstein and Snyder Labs August 30, 2012

Statistical Methods and Reasoning for the Clinical Sciences

Getting started with Eviews 9 (Volume IV)

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Mechanism of splicing

Prokaryotes and eukaryotes alter gene expression in response to their changing environment

WELCOME! Lecture 11 Thommy Perlinger

Application of Resampling Methods in Microarray Data Analysis

Identification of Tissue Independent Cancer Driver Genes

Formulating and Evaluating Interaction Effects

Diurnal Pattern of Reaction Time: Statistical analysis

Interpretation of Data and Statistical Fallacies

Statistical reports Regression, 2010

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

A. Incorrect! It s not correct. Synergism of cytokines refers to two or more cytokines acting together.

Experimental Studies. Statistical techniques for Experimental Data. Experimental Designs can be grouped. Experimental Designs can be grouped

Summer II Class meets intermittently when I will give instructions, as we met in Maymester 2018.

Transcriptional control in Eukaryotes: (chapter 13 pp276) Chromatin structure affects gene expression. Chromatin Array of nuc

Applied Analysis of Variance and Experimental Design. Lukas Meier, Seminar für Statistik

Cancer Biology How a cell responds to DNA Damage

MEA DISCUSSION PAPERS

Exploration and Exploitation in Reinforcement Learning

Pancreatic Cancer Research and HMGB1 Signaling Pathway

SUPPLEMENTARY FIGURES: Supplementary Figure 1

Understandable Statistics

Exercises: Differential Methylation

p53 cooperates with DNA methylation and a suicidal interferon response to maintain epigenetic silencing of repeats and noncoding RNAs

A Statistical Framework for Classification of Tumor Type from microrna Data

Page 2. When sirna binds to mrna, name the complementary base pairs holding the sirna and mrna together. One of the bases is named for you. ...with...

Comparative Neuroanatomy (CNA) Evaluation Post-Test

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

A Case Study: Two-sample categorical data

Is Cognitive Science Special? In what way is it special? Cognitive science is a delicate mixture of the obvious and the incredible

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Estimating Heterogeneous Choice Models with Stata

Regulation. 1. Short term control 8-1

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

CHAPTER - 6 STATISTICAL ANALYSIS. This chapter discusses inferential statistics, which use sample data to

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Evidence of a Pathway of Reduction in Bacteria: Reduced Quantities of Restriction Sites Impact trna Activity in a Trial Set

Introduction of Genome wide Complex Trait Analysis (GCTA) Presenter: Yue Ming Chen Location: Stat Gen Workshop Date: 6/7/2013

Regression Equation. November 29, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

The effects of multimedia learning on children with different special education needs

Transcription:

Experimental Design For Microarray Experiments Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit Copyright 2002

Complexity of Genomic data the functioning of cells is a complex and highly structured process tools are being developed that allow us to explore this functioning in a multitude of different ways these include expression of RNA, expression of proteins and many other processes

Complexity of Genomic Data in the next slide we show a stylized biochemical pathway (adapted from Wagner, 2001) there are transcription factors, protein kinase and protein phosphatase reactions

An example of the interactions between some genes (adapted from Wagner 2001) transcription factor protein kinase protein phosphatase inactive transcription factor inactive P protein P active active DNA Gene 1 Gene 2 Gene 3 Gene 4 Gene 5

Overview Wagner (2001) suggests that the holy grail of functional genomics is the reconstruction of genetic networks in this tutorial we examine some methods for doing this in factorial genome wide RNA expression experiments such experiments are easy to carry out and are becoming widespread, tools for analyzing them are badly needed

Overview while much of the early microarray data have been observational there are many different experiments that can be carried out we consider some simple factorial experiments and their analysis we assume that there are two factors of interest, F 1 and F 2

Factorial Experiments we can obtain expression data on the balanced application of the factors, under the four conditions nothing F 1 alone F 2 alone F 1 and F 2

Factorial Experiments if more factors are of interest then fractionally replicated factorial designs should be considered biological replication, while not essential is certainly helpful

Factorial Experiments the observed data consist of measured levels of mrna at each of these conditions on patients or model organisms such as cell lines, yeast or mice the questions of interest are typically which genes are directly affected by the two factors F 1 and F 2

Factorial Experiments we do not just observe changes in the genes that have been directly affected by the factors (primary targets) we also observe changes in any other genes whose expression levels are affected by changes in the primary targets (secondary targets)

Gene Effects a factor can either inhibit or enhance the production of mrna for any gene the inhibition or enhancement of mrna production for any given gene can affect mrna production for other genes either through inhibition or enhancement

Targets we define a target of a factor to be a gene whose expression of mrna is altered by the presence of the factor a primary target is a target that is directly affected by the factor a secondary target is a target whose expression is altered only via the effects of some other gene (can be traced back to one or more primary targets)

Factorial Experiments these experiments can be contrasted with those proposed by Wagner (2001) he proposed perturbing each gene in the genome of interest and observing the gene specific effects in our experiments we observe genome wide changes and hence less specific information

Factorial Experiments here we consider carrying out very few experiments the two methods can be complimentary since the results of the genome wide study could be used to design several single gene experiments

Some Examples of Experiments methylation: inhibits transcription of specific genes if a factor that demethylates the genome were available then one could, in principle determine which genes were methylated (or affected by mythylated genes) however we could not determine which genes were primary and which were secondary targets

Some Examples many cellular reactions are carried out using energy that is provided by the ADP-ATP phosphorylation mechanism if a simple mechanism was available for halting this mechanism then that could be used as a factor in these experiments and information on genes whose transcription is affected by phosphorylation could be identified

Some Examples the addition of a second factor (say one such as cyclohexamide, CX, that inhibits translation) will often allow us to isolate the primary factors from the secondary factors a simple (but not quite accurate) way to think of the data is as follows N F 1 (N forms a baseline for just F 1 ) F 2 F 1 +F 2 (F 2 forms a baseline for F 1 +F 2 )

Inference if the effect of F 1 is the same in the presence and absence of F 2 then it is possibly a primary candidate this is especially true in the case of CX (since it has halted most translation) we can similarly find potential primary targets of F 2 by reversing the argument

Limitations while we may identify genes that are potentially primary targets and those that are potentially secondary targets we cannot identify specific gene gene interactions the experiments proposed by Wagner could do that the use of relevant meta-data, biological and publication, seems pertinent and could help resolve some of the interactions

Limitations a direct corollary of the preceding limitations is the fact that we cannot identify feed back loops that is genes, or sets of genes whose regulation is self controlled we can observe the effects but not attribute them

Complications complications include the fact that the both F 1 and F 2 will have effects on the cells and their functioning other than those we are interested in we could see effects due to either of them because of chemical interactions etc. for simplicity we will assume this does not happen

An Experiment we now consider a two factor experiment involving CX in detail

The Experiment there are two factors, E is known affect transcription of various genes (some known, some unknown) CX is known to stop all translation (with very few exceptions) the design is a classical factorial design with two factors and we are interested in the main effects and interactions

The Experiment we identify as targets all genes whose expression of mrna is affected by the application of E a target can be either primary or secondary primary if E directly affects expression of mrna secondary if mrna production is affected by some other gene (can be traced back to a primary target)

Scenario 1 assume that there are two related genes, B and D neither is expressed initially, but E causes B to be expressed and this in turn causes D to be expressed the addition of CX by itself may not affect expression of either B or D conditions with CX and E present will have elevated levels of mrna B and low levels of mrna D

No Factors applied B D GeneB is not active Gene D is not active

E only MRNA B B MRNA D Transcription Translation E B B D B is a Primary Target of E D is a Secondary Target of E Production of mrna B is enhanced by E Production of mrna D is enhanced by B

Interpretation: Scenario 1 for both genes B and D we expect to see significant regression coefficients for the presence of E note that while we show a direct relationship between the expression of B and of D we cannot detect such a relationship from these data (its purpose is purely pedagogical)

E and CX both present CX MRNA B Transcription No mrna D No Translation E B D B is a Primary Target Production of mrna B is enhanced by E Production of mrna D is decreased (prevented)

Interpretation: Scenario 1 in the presence of both CX and E we see increased expression of mrna B but not of mrna D this will be one of the principles we can use to differentiate between primary targets of E (such as B) and secondary targets of E (such as D)

Interpretation: Scenario 1 Nothing E CX E and CX mrna B Low High Low(?) High mrna D Low High Low (?) Low

Suppressors: Scenario 2 we now consider a similar setting where the effect of the gene being enhanced by E is to suppress the other gene D initially mrna D is being produced and mrna B is not the addition of E causes the production of mrna B and hence the inhibition of mrna D

Suppressors: Scenario 2 CX by itself may reduce production of mrna D CX and E together will yield levels of mrna B that are high, and levels of mrna D that are the same as those observed with CX alone

Normal Conditions mrna D B D B is not active Production of mrna D

Introduction of E Transcription mrna B Translation No mrna D B E B D B is a Primary Target of E D is a Secondary Target of E Production of mrna B is enhanced by E Production of mrna D is suppressed by B

Both E and CX present CX mrna B mrna D Transcription Production of E B Protein B is halted D B is a Primary Target of E D is a Secondary Target of E Production of mrna B is enhanced by E Production of mrna D is restored

Interpretation:Scenario 2 Nothing E CX E and CX mrna B Low High Low (a) High mrna D High Low High(b) High(b)

Interpretation: Scenario 2 the level of mrna D when both CX and E are present should be the same as the amount that is present when CX alone is present this could be different than the amount when both factors are absent mrna D could be translationally controlled and so it will be affected by CX

One more example: Scenario 3 here genes B and D are active, protein B is enhancing production of D E inhibits production of mrna B, which in turn affects production of D CX alone decreases production of mrna D, B may be unchanged CX and E together will result in decreases in the levels of both mrna B and mrna D

Normal State mrna B B mrna D Transcription B B D B is active Production of mrna D is enhanced by B

Addition of E E B D B is suppressed by E Production of mrna D is also suppressed

Addition of CX CX mrna B Transcription B D Production of mrna B Production of mrna D is halted

Addition of E and CX CX E B D Production of mrna B is halted Production of mrna D is halted

Interpretation: Scenario 3 Nothing E CX E and CX mrna B High Low High Low mrna D High Low Low Low

The Experiment a microarray experiment can detect changes in the level of mrna and for both mrna B and mrna D but there is a difference, B is a primary target of E, while D is a secondary target of E

Inference we are experimenting with a closed, functioning system there is no true baseline these two facts complicate the analysis and inference in many ways

Inference if gene X is any target for E the level of mrna X might not change when E is added mrna X might already be being made as fast as possible, so addition of E has no effect (if we had a true baseline we could eliminate this) production of mrna X might already be suppressed by some other compound

Inference the introduction of CX provides a form of baseline since (among other things) CX halts translation we should be able to use the presence or absence of CX to find out about primary versus secondary targets

The Experiment if we assume that there is a linear model for the observed expression value (possibly on transformed data) it is: y ig = β x β x β x x g Eg 1i CXg 2i ECX :, g 1i 2i µ + + + + ε ig where i indexes chips and g indexes genes x 1 indicates the presence of E and x 2 indicates the presence of CX

The Experiment for any gene we can interpret the coefficients in the linear model as follows the parameter β E can be interpreted as the effect of E genes for which β E is different from zero are potential targets as noted previously not all targets will have β E different from zero

The Experiment the parameter β CX can be interpreted as the effect due to CX if β CX is different from zero indicates that production of mrna is translationally regulated the interpretation of β E:CX is more difficult

The Experiment we now refer back to the preceding scenarios to determine sets of conditions that will allow us to identify both primary and secondary targets

Scenario 1 Primary Secondary β 1 > 0 > 0 β 2 = 0 = 0 β 3 = 0 - β 1

Scenario 2 Primary Secondary β 1 > 0 < 0 β 2 = 0 = 0 β 3 = 0 - β 1

Scenario 3 Primary Secondary β 1 < 0 < 0 β 2 = 0 β 1 < 0 ( ) β 3 = 0 -β 1

Primary Targets consider the case where we have only CX and CX+E since CX halts all translation then any differences between the condition where CX alone is present and CX+E is present should indicate primary targets of E

Primary Targets this is equivalent to testing the hypothesis: H 0 : µ+β E +β CX +β E:CX = µ+β CX another equivalent hypothesis is H 0 : µ+β E +β CX +β E:CX = µ+β CX

Primary Targets genes for which the hypothesis H 0 : µ+β E +β CX +β E:CX = µ+β CX is rejected are candidates for primary targets those with β Ε different from zero but for which we do not reject H 0 are secondary targets this holds for all Scenarios discussed above

Primary Targets we can identify primary targets in at least two different ways fold change, look at ratio of the means of the CX arrays with the CX+E arrays use a linear model and estimate the contrasts (possibly then estimate the ratio)

Secondary Targets a secondary target should have the property that β 1 is not zero this means that E had some observed effect on expression of the gene and that we did not determine that it was a primary target

Other information what other information is available from the experiment? it seems likely that some inference may be drawn from the relationship between β E and β E:CX, their signs and their significance levels

Limitations while we can identify primary and secondary targets there is no way to determine the relationship between any two genes a corollary of this is that it is not possible to identify feedback loops using these data

Gene Filtering some reduction by filtering out genes that are not expressed or that are not affected by the factors will help reduce the computation this is problematic since we have only 2 observations at each level of the factors our approach was to compute an average for each data pair

Gene Filtering thus for any gene we have four averages (a i ) if the maximum of the four averages for a given gene was less than 100 the gene was filtered and not analysed further

Outlier Detection the detection of outliers in factorial experiments is difficult the residuals from the fit of the linear model must satisfy a number of constraints and hence are not suitable for outlier detection however, outlier detection is important since the presence of outliers will inflate the estimated variance and hence decrease our ability to detect significant effects

Examples of single outliers 800 400 Expression 0-400 -800-1200 -1600 ce ce ce ce Ce Ce CE CE Treatment

Examples of single outliers 2400 2000 Expression 1600 1200 800 400 0 ce ce ce ce Ce Ce CE CE Treatment

Outlier Detection when there are replicate chips a simple but effective procedure can be employed Miller and Scholtens (xxxx) propose using the following process put in some pictures of the outliers/etc tables indicating the preliminary results

Outlier Detection we presume that the expression at some set of experimental conditions is Normally distributed with mean µ and variance σ 2 so that the difference d i = x i1 x i2, is N(µ, 2σ 2 ) then the ratio,, is F 1,3 and we d 2 i 2 j j i d / 3

Outlier Detection we use a p-value of 4*P(F 1,3 >f) to adjust for the fact that we have used the maximum of the d i s in our calculation

Relevance in most cases there is some literature on genes that are likely to be affected by the different factors it is prudent to obtain this information and examine its consistency with the experimental data

Relevance there is a great deal of metadata available this includes references in published literature relationships through protein protein interactions known promoter inhibitor relationships these data can all be used to further explore and understand the experimental data

References How to reconstruct a large genetic network in fewer than n 2 easy steps, Wagner, A., Bioinformatics, 2001, 1183 1197.