Lecture 21. RNA-seq: Advanced analysis

Size: px

Start display at page:

Download "Lecture 21. RNA-seq: Advanced analysis"

William Walker
5 years ago
Views:

1 Lecture 21 RNA-seq: Advanced analysis

2 Experimental design

3 Introduction An experiment is a process or study that results in the collection of data. Statistical experiments are conducted in situations in which researchers can manipulate the conditions of the experiment and can control the factors that are irrelevant to the research objectives.

4 Statistical design of experiments Experimental design is the process of planning a study to meet specified objectives. Planning an experiment properly is very important in order to ensure that the right type of data and a sufficient sample size and power are available to answer the research questions of interest as clearly and efficiently as possible.

5 Designing an experiment Perform the following steps when designing an experiment: 1. Define the problem and the questions to be addressed 2. Define the population of interest 3. Determine the need for sampling 4. Define the experimental design

6 Define problem Before data collection begins, specific questions that the researcher plans to examine must be clearly identified. In addition, a researcher should identify the sources of variability in the experimental conditions. One of the main goals of a designed experiment is to partition the effects of the sources of variability into distinct components in order to examine specific questions of interest. The objective of designed experiments is to improve the precision of the results in order to examine the research hypotheses.

7 Define population A population is a collective whole of people, animals, plants, or other items that researchers collect data from. Before collecting any data, it is important that researchers clearly define the population, including a description of the members. The designed experiment should designate the population for which the problem will be examined. The entire population for which the researcher wants to draw conclusions will be the focus of the experiment.

8 Determine the need for sampling A sample is one of many possible sub-sets of units that are selected from the population of interest. In many data collection studies, the population of interest is assumed to be much larger in size than the sample so, potentially, there are a very large (usually considered infinite) number of possible samples. The results from a sample are then used to draw valid inferences about the population.

9 Determine the need for sampling A random sample is a sub-set of units that are selected randomly from a population. A random sample represents the general population or the conditions that are selected for the experiment because the population of interest is too large to study in its entirety. Using techniques such as random selection after stratification or blocking is often preferred.

10 Determine the need for sampling Determining the sample size requires some knowledge of the observed or expected variance among sample members in addition to how large a difference among treatments you want to be able to detect. Another way to describe this aspect of the design stage is to conduct a prospective power analysis, which is a brief statement about the capability of an analysis to detect a practical difference. A power analysis is essential so that the data collection plan will work to enhance the statistical tests primarily by reducing residual variation, which is one of the key components of a power analysis study.

11 Define experimental design Defining the experimental design consists of the following steps: 1. Identify the experimental unit. 2. Identify the types of variables. 3. Define the treatment structure. 4. Define the design structure.

12 Experimental units An Experimental (or sampling) unit is the person or object that will be studied by the researcher. This is the smallest unit of analysis in the experiment from which data will be collected (e.g. patient, mouse, plant, or cell line).

13 Experimental units An entity receiving an independent application of a treatment is called an experimental unit. An experimental run is the process of applying a particular treatment combination to an experimental unit and recording its response. A replicate is an independent run carried out on a different experimental unit under the same conditions.

14 Example: Two pots Experimental unit: plant on the pot No replication

15 Types of variables A data collection plan considers how four important variables: background, constant, uncontrollable, and primary, fit into the study. The explanatory variables are referred to as factors. Inconclusive results are likely to result if any of these classifications are not adequately defined. It is important to consider all the relevant variables before the final data collection plan is approved in order to maximize confidence in the final results.

16 Background variables Background variables can be identified and measured yet cannot be controlled; they will influence the outcome of an experiment. Background variables will be treated as covariates in the model rather than primary variables.

17 Primary variables Primary variables are the variables of interest to the researcher. Primary variables are independent variables that are possible sources of variation in the response. These variables comprise the treatment and design structures and are referred to as factors. When background variables are used in an analysis, better estimates of the primary variables should result because the sources of variation that are supplied by the covariates have been removed.

18 Constant variables Constant variables can be controlled or measured but, for some reason, will be held constant over the duration of the study. This action increases the validity of the results by reducing extraneous sources of variation from entering the data. For this data collection plan, some of the variables that will be held constant include: the use of standard operating procedures the use of one operator for each measuring device all measurements taken at specific times and locations

19 Uncontrollable variables Uncontrollable variables are those variables that are known to exist, but conditions prevent them from being manipulated, or it is very difficult (due to cost or physical constraints) to measure them. The experimental error is due to the influential effects of uncontrollable variables, which will result in less precise evaluations of the effects of the primary and background variables. The design of the experiment should eliminate or control these types of variables as much as possible in order to increase confidence in the final results.

20 Explanatory and response variables XX YY - Explanatory variables - Factors - Response variables

21 Factors - Noise factor - Blocking factor ZZ Treatment factor or design factor XX YY Response variables Levels: XX = xx Treatment combination or treatment: a particular combination of factor levels (e.g. xx 1, xx 2 if there are two treatment factors)

22 Levels The levels of the primary factors represent the range of the inference space relative to a study. The levels of the primary factors can represent the entire range of possibilities or a random sub-set. It is also important to recognize and define when combinations of levels of two or more treatment factors are illogical or unlikely to exist.

23 Fixed effects Fixed effects treatment factors are usually considered to be "fixed" in the sense that all levels of interest are included in the study because they are selected by some non-random process, they consist of the whole population of possible levels, or other levels were not feasible to consider as part of the study. The fixed effects represent the levels of a set of precise hypotheses of interest in the research. A fixed factor can have only a small number of inherent levels; for example, the only relevant levels for gender are male and female. A factor should also be considered fixed when only certain values of it are of interest, even though other levels might exist. Treatment factors can also be considered "fixed" as opposed to "random" because they are the only levels about which you would want to make inferences.

24 Three basic principles of experimental design Replication Randomization Blocking

25 Replication By replication we mean an independent repeat run of each treatment combination. Replication is essential for estimating experimental error. If a treatment condition appears more than one time, it is defined to be replicated. Misconceptions about the number of replications have often occurred in experiments where sub-samples or repeated observations on a unit have been mistaken as additional experimental units.

26 Randomization By randomization we mean that both the assignment of treatments to units and the order in which the individual runs of the experiments are to be performed are randomly determined. A completely randomized design is an experimental design in which treatments are assigned to all units by randomization.

27 Example: Randomized Experimental unit: plant on the pot 4 replicates for each treatment

28 Blocking Most experimental designs require experimental units to be allocated to treatments either randomly or randomly with constraints, as in blocked designs. Blocks are groups of experimental units that are formed to be as homogeneous as possible with respect to the block characteristics. The term block comes from the agricultural heritage of experimental design where a large block of land was selected for the various treatments, that had uniform soil, drainage, sunlight, and other important physical characteristics. Homogeneous clusters improve the comparison of treatments by randomly allocating levels of the treatments within each block.

29 Blocking Blocking is an experimental design strategy used to reduce or eliminate the variability transmitted from nuisance factors, which may influence the response variable but in which we are not directly interested. Blocking is the grouping of experimental units that have similar properties. Within each block, treatments are randomly assigned to experimental units. The resulting design is called a randomized block design. This design enables more precise estimates of the treatment effects because comparisons between treatments are made among homogeneous experimental units in each block.

30 Blocking ZZ XX YY

31 Blocking example Blocking removes the variation in response among chambers, allowing more precise estimates and more powerful tests of the treatment effects.

32 Design structure The design structure consists of those factors that define the blocking of the experimental units into clusters. The types of commonly used design structures: Completely randomized design Randomized complete block design Factorial design

33 Completely randomized design Subjects are assigned to treatments completely at random.

34 Randomized complete block design Subjects are divided into b blocks according to demographic characteristics. Subjects in each block are then randomly assigned to treatments so that all treatment levels appear in each block.

35 Factorial design Many experiments in biology investigate more than one treatment factor, because: 1. answering two questions from a single experiment rather than just one makes more efficient use of time, supplies, and other costs 2. the factors might interact.

36 Factorial design An experiment having a factorial design investigates all treatment combinations of two or more treatment factors. A factorial design can measure interactions between factors. An interaction between two (or more) explanatory variables means that the effect of one variable on the response depends on the state of the other variable.

37 Factorial design XX 2 XX 1 YY

38 Analyzing data

39 Regression Regression is a method that 1. predicts the average values of a response variable from values of explanatory variables (focusing on regression to represent relationships between variables) 2. summarizes how the average values of a response variable vary over subpopulations defined by functions of explanatory variables (focusing on regression as a comparison of average outcomes)

40 Ordinary linear regression yy ii = ββ 0 + ββ 1 xx iii + + ββ pp 1 xx iiii 1 + εε ii, εε ii NN 0, σσ 2, ii = 1,, n EE[yy ii ] = ββ 0 + ββ 1 xx iii + + ββ pp 1 xx iiii 1, ii = 1,, n EE[yy XX] = XXXX

41 Three components of the GLMs 1. Random component: yy = yy 1,, yy nn T and its probability distribution. The observations yy ii are treated as independent 2. Systematic component (or linear predictor): XXXX, where XX is a nn pp model matrix and ββ = ββ 0,, ββ pp 1 T 3. Link function: a function gg applied to each component of EE[yy] that relates it to the linear predictor, gg(ee yy ) = XXXX

42 Random component of a GLM The random component of a GLM consists of a response variable yy with independent observations yy = yy 1,, yy nn T having probability density or mass function for a distribution in the exponential family. By restricting GLMs to exponential family distributions, we obtain: 1. General expressions for the model likelihood equations 2. Asymptotic distributions of estimators for model parameters 3. An algorithm for fitting the models.

43 Systematic component of a GLM The linear predictor of a GLM relates parameters {ηη ii } pertaining to {EE yy ii } to the explanatory variables using a linear combination of them ηη ii = pp 1 jj=0 ββ jj xx iiii where xx ii0 = 1. This expression is linear in the parameters. In matrix form, we express the linear predictor as ηη = XXXX where ηη = ηη 1,, ηη nn T, ββ = ββ 0,, ββ pp 1 T is the column vector of model parameters, and XX is the nn pp matrix of explanatory variable values. The matrix XX is called the model matrix or design matrix.

44 Link function of a GLM The link function connects the random component with the linear predictor. Let μμ ii = EE yy ii. The GLM links ηη ii to μμ ii by ηη ii = gg(μμ ii ), where the link function gg is a monotonic, differentiable function: gg μμ ii = pp 1 jj=0 ββ jj xx iiii

45 Linear model: a GLM with identity link function The link function gg μμ ii = μμ ii is called the identity link function. A GLM that uses the identity link function is called a linear model. This GLM has pp 1 μμ ii = ββ jj xx iiii jj=0 The standard version of the linear model, which we refer to as the ordinary linear model, assumes that the observations have constant variance, called homoscedasticity: yy ii = pp 1 jj=0 ββ jj xx iiii + εε ii, where EE εε ii = 0 and VVVVVV εε ii = σσ 2. The ordinary normal linear model further assumes that εε ii NN(0, σσ 2 ).

46 Important GLMs Random component Systematic component Link function Model Normal Continuous Identity Regression Normal Categorical Identity ANOVA Normal Mixed Identity ANCOVA Binomial Mixed Logit Logistic regression Binomial Mixed Probit and others Binary regression Multinomial Mixed Generalized logit Multinomial response Poisson Mixed Log Poisson loglinear

47 Example: RNA-seq

48 Multiple factors Experiments with more than one factor influencing the counts can be analyzed using design formula that include the additional variables. In fact, DESeq2 can analyze any possible experimental design that can be expressed with fixed effects terms (multiple factors, designs with interactions, designs with continuous variables, splines, and so on are all possible). By adding variables to the design, one can control for additional variation in the counts. For example, if the condition samples are balanced across experimental batches, by including the batch factor to the design, one can increase the sensitivity for finding differences due to condition. There are multiple ways to analyze experiments when the additional variables are of interest and not just controlling factors.

49 Including type

50 Accounting for type We can account for the different types of sequencing, and get a clearer picture of the differences attributable to the treatment. As condition is the variable of interest, we put it at the end of the formula. Thus the results function will by default pull the condition results unless contrast or name arguments are specified. Then we can rerun DESeq.

51 Accounting for type

52 Accounting for type

53 Accounting for type

54 Accounting for type It is also possible to retrieve the log2 fold changes, p values and adjusted p values of the type variable. The contrast argument of the function results takes a character vector of length three: the name of the variable, the name of the factor level for the numerator of the log2 ratio, and the name of the factor level for the denominator.

55 Accounting for type

56 Gene Ontology

57 Annotating and exporting results Our result table only contains information about Ensembl gene IDs, but alternative gene names may be more informative for collaborators. Bioconductor s annotation packages help with mapping various ID schemes to each other.

58 Annotating and exporting results

59 Annotating and exporting results

60 Running topgo

61 Running topgo

62 Running topgo

63 Running topgo

64 Downregulated GO

65 Upregulated GO

66 Published results The top 10 most significant terms are shown for downregulated (D) and upregulated (E) genes, respectively.

RNA-seq. Design of experiments

RNA-seq. Design of experiments RNA-seq Design of experiments Experimental design Introduction An experiment is a process or study that results in the collection of data. Statistical experiments are conducted in situations in which researchers