RNA-seq Differential analysis
Data transformations
Count data transformations In order to test for differential expression, we operate on raw counts and use discrete distributions differential expression. For other downstream analyses e.g. for visualization or clustering it might be useful to work with transformed versions of the count data. The most obvious choice of transformation is the logarithm. Since count values for a gene can be zero in some conditions, some advocate the use of pseudocounts, i.e. transformations of the form: yy = log 2 nn + nn 0 where nn represents the count values and nn 0 is a positive constant.
normtransform(), log 2 nn + 1
rlog and vst We discuss two alternative approaches that offer more theoretical justification and a rational way of choosing the parameter equivalent to nn 0. The regularized logarithm or rlog incorporates a prior on the sample differences (Love, Huber, and Anders 2014), and the other uses the concept of variance stabilizing transformations (VST) (Tibshirani 1988; Huber et al. 2003; Anders and Huber 2010). Both transformations produce transformed data on the log2 scale which has been normalized with respect to library size.
rlog and vst The point of these two transformations is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. Both rlog and VST use the experiment-wide trend of variance over mean, in order to transform the data to remove the experiment-wide trend. Note that we do not require or desire that all the genes have exactly the same variance after transformation. Indeed, you will see that after the transformations the genes with the same mean do not have exactly the same standard deviations, but that the experiment-wide trend has flattened. It is those genes with variance above the trend which will allow us to cluster samples into interesting groups.
Blind dispersion estimation The two functions, rlog and vst have an argument blind, for whether the transformation should be blind to the sample information specified by the design formula. When blind equals TRUE (the default), the functions will reestimate the dispersions using only an intercept. This setting should be used in order to compare samples in a manner wholly unbiased by the information about experimental groups, for example to perform sample QA (quality assurance).
Blind dispersion estimation However, blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes will have large differences in counts which are explainable by the experimental design, and one wishes to transform the data for downstream analysis. In this case, using blind dispersion estimation will lead to large estimates of dispersion, as it attributes differences due to experimental design as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula.
Extracting transformed values The assay function is used to extract the matrix of normalized values.
rlog The function rlog, stands for regularized log, transforming the original count data to the log2 scale by fitting a model with a term for each sample and a prior distribution on the coefficients which is estimated from the data. This is the same kind of shrinkage (sometimes referred to as regularization, or moderation) of log fold changes used by the DESeq and nbinomwaldtest. The resulting data contains elements defined as: log 2 qq iiii = ββ iii + ββ iijj
rlog qq iiii : a parameter proportional to the expected true concentrati on of fragments for gene ii and sample jj. ββ iii : an intercept which does not undergo shrinkage. ββ iiii : the sample-specific effect which is shrunk toward zero ba sed on the dispersion-mean trend over the entire dataset.
VST The closed-form expression for the variance stabilizing transformation is used by variancestabilizingtransformation.
Effects of transformations on the variance
Effects of transformations on the variance
Data visualization
Quality assessment Data quality assessment and quality control (i.e. the removal of insufficiently good data) are essential steps of any data analysis. These steps should typically be performed very early in the analysis of a new data set, preceding or in parallel to the differential expression testing. We define the term quality as fitness for purpose. Our purpose is the detection of differentially expressed genes, and we are looking in particular for samples whose experimental treatment suffered from an anormality that renders the data points obtained from these particular samples detrimental to our purpose.
Heatmap of the count matrix
Heatmap of the count matrix
Heatmap of the sample-to-sample distances
PCA plot of the samples
PCA plot of the samples
Design of experiments
Statistical design of experiments The process of planning the experiment so that appropriate data will be collected and analyzed by statistical methods, resulting in valid and objective conclusions.
Explanatory and response variables XX YY - Explanatory variables - Factors - Response variables
Factors - Noise factor - Blocking factor ZZ Treatment factor or design factor XX YY Response variables Levels: XX = xx Treatment combination or treatment: a particular combination of factor levels (e.g. xx 1, xx 2 if there are two treatment factors)
Three basic principles of experimental design Randomization Replication Blocking
Randomization By randomization we mean that both the assignment of treatments to units and the order in which the individual runs of the experiments are to be performed are randomly determined. A completely randomized design is an experimental design in which treatments are assigned to all units by randomization.
Replication By replication we mean an independent repeat run of each treatment combination.
Experimental units An entity receiving an independent application of a treatment is called an experimental unit. An experimental run is the process of applying a particular treatment combination to an experimental unit and recording its response. A replicate is an independent run carried out on a different experimental unit under the same conditions.
Example: Two pots Experimental unit: plant on the pot No replication
Example: Randomized Experimental unit: plant on the pot 4 replicates for each treatment
Blocking Blocking is an experimental design strategy used to reduce or eliminate the variability transmitted from nuisance factors, which may influence the response variable but in which we are not directly interested. Blocking is the grouping of experimental units that have similar properties. Within each block, treatments are randomly assigned to experimental units. The resulting design is called a randomized block design. This design enables more precise estimates of the treatment effects because comparisons between treatments are made among homogeneous experimental units in each block.
Blocking ZZ XX YY
Blocking example Blocking removes the variation in response among chambers, allowing more precise estimates and more powerful tests of the treatment effects.
Blinding The process of concealing information from participants and researchers about which of them receive which treatments is called blinding. Single-blind experiment: participants area unaware of the treatment they have been assigned. It prevents participants from responding differently according to their knowledge of their treatment. Double-blind experiment: researchers administering the treatments and measuring the response are also unaware of which subjects are receiving which treatment.
Factorial design Many experiments in biology investigate more than one treatment factor, because: 1. answering two questions from a single experiment rather than just one makes more efficient use of time, supplies, and other costs 2. the factors might interact.
Factorial design An experiment having a factorial design investigates all treatment combinations of two or more treatment factors. A factorial design can measure interactions between factors. An interaction between two (or more) explanatory variables means that the effect of one variable on the response depends on the state of the other variable.
Factorial design XX 2 XX 1 YY
A unified model: general linear model EE[yy] = ββ 0 + ββ 1 xx 1 + + ββ pp 1 xx pp 1
Basic linear models Model formula Model Design yy~xx Linear regression Dose-response yy~t One-way ANOVA Completely randomized yy~t + b Two-way ANOVA Randomized block yy~t 1 + t 2 + t 1 t 2 Two-way, fixed-effect ANOVA Factorial design yy~tt + xx ANCOVA Observation study with one known noise factor yy~xx 1 + xx 2 + xx 1 xx 2 Multiple linear regression Dose-response xx: numerical, t: categorical treatment factor, b: categorical blocking factor
Randomized complete block design How does fish abundance affects the abundance and diversity of prey species?
Design 3mm 3mm 30 fish 90 fish Control Low High
Data: Zooplankton diversity in three fish abundance treatments 1 2 3 4 5 Control 4.1 3.2 3.0 2.3 2.5 Low 2.2 2.4 1.5 1.3 2.6 High 1.3 2.0 1.0 1.0 1.6
Model: yy~t + b yy ii = ββ 0 + ββ 1 tt ii + ββ 2 b i + εε ii H0: Mean zooplankton diversity is the same in every abundance treatment yy~b H1: Mean zooplankton diversity is not the same in every abundance treatment yy~t + b
Fitting the model to data
Adjusting for a known confounding factor
Adjusting for a known confounding factor Mole rats are the only known mammals with distinct social castes. - A single queen and a small number of males are the only reproducing individuals in a colony. - Workers gather food, defend the colony, care for the young, and maintain the burrows. - Two worker castes in the Damaraland mole rat: - Frequent workers : do almost all of the work in the colony - Infrequent workers : do little work except on rare occasions after rains
Adjusting for a known confounding factor To assess the physiological differences between the two types of workers, researchers compared daily energy expenditures of wild mole rats during a dry season. Known noise factor: Energy expenditure appears to vary with body mass in both groups, but infrequent workers are heavier than frequent workers Research question: How different is mean daily energy expenditure between the two groups when adjusted for differences in body mass?
Data
Data
Model: yy~tt + xx H0: Castes do not differ in energy expenditure yy~xx H1: Castes differ in energy expenditure yy~tt + xx
Fitting the model to data
Multi-factor designs
Multiple factors Experiments with more than one factor influencing the counts can be analyzed using design formula that include the additional variables. In fact, DESeq2 can analyze any possible experimental design that can be expressed with fixed effects terms (multiple factors, designs with interactions, designs with continuous variables, splines, and so on are all possible). By adding variables to the design, one can control for additional variation in the counts. For example, if the condition samples are balanced across experimental batches, by including the batch factor to the design, one can increase the sensitivity for finding differences due to condition. There are multiple ways to analyze experiments when the additional variables are of interest and not just controlling factors.
Including type
Accounting for type We can account for the different types of sequencing, and get a clearer picture of the differences attributable to the treatment. As condition is the variable of interest, we put it at the end of the formula. Thus the results function will by default pull the condition results unless contrast or name arguments are specified. Then we can rerun DESeq.
Accounting for type
Accounting for type
Accounting for type
Accounting for type It is also possible to retrieve the log2 fold changes, p values and adjusted p values of the type variable. The contrast argument of the function results takes a character vector of length three: the name of the variable, the name of the factor level for the numerator of the log2 ratio, and the name of the factor level for the denominator.
Accounting for type