RNA-seq. Differential analysis

Similar documents
RNA-seq. Design of experiments

Lecture 21. RNA-seq: Advanced analysis

Unit 1 Exploring and Understanding Data

9 research designs likely for PSYC 2100

Ecological Statistics

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Principles of Experimental Design

Principles of Experimental Design

Study of cigarette sales in the United States Ge Cheng1, a,

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

IAPT: Regression. Regression analyses

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Lecture 1 An introduction to statistics in Ichthyology and Fisheries Science

Business Statistics Probability

Experimental Studies. Statistical techniques for Experimental Data. Experimental Designs can be grouped. Experimental Designs can be grouped

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Chapter 1: Exploring Data

Understandable Statistics

QA 605 WINTER QUARTER ACADEMIC YEAR

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Package AbsFilterGSEA

Simple Linear Regression the model, estimation and testing

User Guide. Association analysis. Input

Still important ideas

Where does "analysis" enter the experimental process?

Problem 1) Match the terms to their definitions. Every term is used exactly once. (In the real midterm, there are fewer terms).

Biostatistics 2 nd year Comprehensive Examination. Due: May 31 st, 2013 by 5pm. Instructions:

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Statistics 2. RCBD Review. Agriculture Innovation Program

Assignment #6. Chapter 10: 14, 15 Chapter 11: 14, 18. Due tomorrow Nov. 6 th by 2pm in your TA s homework box

Chapter 8 Statistical Principles of Design. Fall 2010

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

Statistical Techniques. Meta-Stat provides a wealth of statistical tools to help you examine your data. Overview

STATISTICAL CONCLUSION VALIDITY

Biostatistics for Med Students. Lecture 1

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Still important ideas

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

INTRODUCTION TO STATISTICS

lab exam lab exam Experimental Design Experimental Design when: Nov 27 - Dec 1 format: length = 1 hour each lab section divided in two

MULTIPLE REGRESSION OF CPS DATA

Hypothesis Testing. Richard S. Balkin, Ph.D., LPC-S, NCC

CHILD HEALTH AND DEVELOPMENT STUDY

On the purpose of testing:

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Use of the Quantitative-Methods Approach in Scientific Inquiry. Du Feng, Ph.D. Professor School of Nursing University of Nevada, Las Vegas

CHAPTER 3 METHOD AND PROCEDURE

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

EXECUTIVE SUMMARY DATA AND PROBLEM

RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB

VARIABLES AND MEASUREMENT

Diurnal Pattern of Reaction Time: Statistical analysis

Identification of Tissue Independent Cancer Driver Genes

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Role of Statistics in Research

RNA-seq: filtering, quality control and visualisation. COMBINE RNA-seq Workshop

Hierarchical Linear Models: Applications to cross-cultural comparisons of school culture

Threats and Analysis. Shawn Cole. Harvard Business School

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Intro to SPSS. Using SPSS through WebFAS

Statistical analysis supporting the development of the guidance on dermal absorption

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Correlation and regression

Randomized Block Designs 1

(C) Jamalludin Ab Rahman

Stat Wk 9: Hypothesis Tests and Analysis

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Statistics and Probability

chapter 1 - fig. 2 Mechanism of transcriptional control by ppar agonists.

CHAPTER TWO REGRESSION

Clincial Biostatistics. Regression

Caffeine & Calories in Soda. Statistics. Anthony W Dick

Part 8 Logistic Regression

REGRESSION MODELLING IN PREDICTING MILK PRODUCTION DEPENDING ON DAIRY BOVINE LIVESTOCK

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Results. NeuRA Worldwide incidence April 2016

HOW STATISTICS IMPACT PHARMACY PRACTICE?

investigate. educate. inform.

Abstract Title Page Not included in page count. Title: Analyzing Empirical Evaluations of Non-experimental Methods in Field Settings

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Exercises: Differential Methylation

IE 361 Module 31. Patterns on Control Charts Part 1. Reading: Section 3.4 Statistical Methods for Quality Assurance. ISU and Analytics Iowa LLC

How to analyze correlated and longitudinal data?

The essential focus of an experiment is to show that variance can be produced in a DV by manipulation of an IV.

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

6. Unusual and Influential Data

BOOTSTRAPPING CONFIDENCE LEVELS FOR HYPOTHESES ABOUT QUADRATIC (U-SHAPED) REGRESSION MODELS

Daniel Boduszek University of Huddersfield

Lesson 9: Two Factor ANOVAS

Find the slope of the line that goes through the given points. 1) (-9, -68) and (8, 51) 1)

Statistical reports Regression, 2010

Completely randomized designs, Factors, Factorials, and Blocking

Design of Experiments & Introduction to Research

CONSORT 2010 checklist of information to include when reporting a randomised trial*

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Transcription:

RNA-seq Differential analysis

Data transformations

Count data transformations In order to test for differential expression, we operate on raw counts and use discrete distributions differential expression. For other downstream analyses e.g. for visualization or clustering it might be useful to work with transformed versions of the count data. The most obvious choice of transformation is the logarithm. Since count values for a gene can be zero in some conditions, some advocate the use of pseudocounts, i.e. transformations of the form: yy = log 2 nn + nn 0 where nn represents the count values and nn 0 is a positive constant.

normtransform(), log 2 nn + 1

rlog and vst We discuss two alternative approaches that offer more theoretical justification and a rational way of choosing the parameter equivalent to nn 0. The regularized logarithm or rlog incorporates a prior on the sample differences (Love, Huber, and Anders 2014), and the other uses the concept of variance stabilizing transformations (VST) (Tibshirani 1988; Huber et al. 2003; Anders and Huber 2010). Both transformations produce transformed data on the log2 scale which has been normalized with respect to library size.

rlog and vst The point of these two transformations is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low. Both rlog and VST use the experiment-wide trend of variance over mean, in order to transform the data to remove the experiment-wide trend. Note that we do not require or desire that all the genes have exactly the same variance after transformation. Indeed, you will see that after the transformations the genes with the same mean do not have exactly the same standard deviations, but that the experiment-wide trend has flattened. It is those genes with variance above the trend which will allow us to cluster samples into interesting groups.

Blind dispersion estimation The two functions, rlog and vst have an argument blind, for whether the transformation should be blind to the sample information specified by the design formula. When blind equals TRUE (the default), the functions will reestimate the dispersions using only an intercept. This setting should be used in order to compare samples in a manner wholly unbiased by the information about experimental groups, for example to perform sample QA (quality assurance).

Blind dispersion estimation However, blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes will have large differences in counts which are explainable by the experimental design, and one wishes to transform the data for downstream analysis. In this case, using blind dispersion estimation will lead to large estimates of dispersion, as it attributes differences due to experimental design as unwanted noise, and will result in overly shrinking the transformed values towards each other. By setting blind to FALSE, the dispersions already estimated will be used to perform transformations, or if not present, they will be estimated using the current design formula.

Extracting transformed values The assay function is used to extract the matrix of normalized values.

rlog The function rlog, stands for regularized log, transforming the original count data to the log2 scale by fitting a model with a term for each sample and a prior distribution on the coefficients which is estimated from the data. This is the same kind of shrinkage (sometimes referred to as regularization, or moderation) of log fold changes used by the DESeq and nbinomwaldtest. The resulting data contains elements defined as: log 2 qq iiii = ββ iii + ββ iijj

rlog qq iiii : a parameter proportional to the expected true concentrati on of fragments for gene ii and sample jj. ββ iii : an intercept which does not undergo shrinkage. ββ iiii : the sample-specific effect which is shrunk toward zero ba sed on the dispersion-mean trend over the entire dataset.

VST The closed-form expression for the variance stabilizing transformation is used by variancestabilizingtransformation.

Effects of transformations on the variance

Effects of transformations on the variance

Data visualization

Quality assessment Data quality assessment and quality control (i.e. the removal of insufficiently good data) are essential steps of any data analysis. These steps should typically be performed very early in the analysis of a new data set, preceding or in parallel to the differential expression testing. We define the term quality as fitness for purpose. Our purpose is the detection of differentially expressed genes, and we are looking in particular for samples whose experimental treatment suffered from an anormality that renders the data points obtained from these particular samples detrimental to our purpose.

Heatmap of the count matrix

Heatmap of the count matrix

Heatmap of the sample-to-sample distances

PCA plot of the samples

PCA plot of the samples

Design of experiments

Statistical design of experiments The process of planning the experiment so that appropriate data will be collected and analyzed by statistical methods, resulting in valid and objective conclusions.

Explanatory and response variables XX YY - Explanatory variables - Factors - Response variables

Factors - Noise factor - Blocking factor ZZ Treatment factor or design factor XX YY Response variables Levels: XX = xx Treatment combination or treatment: a particular combination of factor levels (e.g. xx 1, xx 2 if there are two treatment factors)

Three basic principles of experimental design Randomization Replication Blocking

Randomization By randomization we mean that both the assignment of treatments to units and the order in which the individual runs of the experiments are to be performed are randomly determined. A completely randomized design is an experimental design in which treatments are assigned to all units by randomization.

Replication By replication we mean an independent repeat run of each treatment combination.

Experimental units An entity receiving an independent application of a treatment is called an experimental unit. An experimental run is the process of applying a particular treatment combination to an experimental unit and recording its response. A replicate is an independent run carried out on a different experimental unit under the same conditions.

Example: Two pots Experimental unit: plant on the pot No replication

Example: Randomized Experimental unit: plant on the pot 4 replicates for each treatment

Blocking Blocking is an experimental design strategy used to reduce or eliminate the variability transmitted from nuisance factors, which may influence the response variable but in which we are not directly interested. Blocking is the grouping of experimental units that have similar properties. Within each block, treatments are randomly assigned to experimental units. The resulting design is called a randomized block design. This design enables more precise estimates of the treatment effects because comparisons between treatments are made among homogeneous experimental units in each block.

Blocking ZZ XX YY

Blocking example Blocking removes the variation in response among chambers, allowing more precise estimates and more powerful tests of the treatment effects.

Blinding The process of concealing information from participants and researchers about which of them receive which treatments is called blinding. Single-blind experiment: participants area unaware of the treatment they have been assigned. It prevents participants from responding differently according to their knowledge of their treatment. Double-blind experiment: researchers administering the treatments and measuring the response are also unaware of which subjects are receiving which treatment.

Factorial design Many experiments in biology investigate more than one treatment factor, because: 1. answering two questions from a single experiment rather than just one makes more efficient use of time, supplies, and other costs 2. the factors might interact.

Factorial design An experiment having a factorial design investigates all treatment combinations of two or more treatment factors. A factorial design can measure interactions between factors. An interaction between two (or more) explanatory variables means that the effect of one variable on the response depends on the state of the other variable.

Factorial design XX 2 XX 1 YY

A unified model: general linear model EE[yy] = ββ 0 + ββ 1 xx 1 + + ββ pp 1 xx pp 1

Basic linear models Model formula Model Design yy~xx Linear regression Dose-response yy~t One-way ANOVA Completely randomized yy~t + b Two-way ANOVA Randomized block yy~t 1 + t 2 + t 1 t 2 Two-way, fixed-effect ANOVA Factorial design yy~tt + xx ANCOVA Observation study with one known noise factor yy~xx 1 + xx 2 + xx 1 xx 2 Multiple linear regression Dose-response xx: numerical, t: categorical treatment factor, b: categorical blocking factor

Randomized complete block design How does fish abundance affects the abundance and diversity of prey species?

Design 3mm 3mm 30 fish 90 fish Control Low High

Data: Zooplankton diversity in three fish abundance treatments 1 2 3 4 5 Control 4.1 3.2 3.0 2.3 2.5 Low 2.2 2.4 1.5 1.3 2.6 High 1.3 2.0 1.0 1.0 1.6

Model: yy~t + b yy ii = ββ 0 + ββ 1 tt ii + ββ 2 b i + εε ii H0: Mean zooplankton diversity is the same in every abundance treatment yy~b H1: Mean zooplankton diversity is not the same in every abundance treatment yy~t + b

Fitting the model to data

Adjusting for a known confounding factor

Adjusting for a known confounding factor Mole rats are the only known mammals with distinct social castes. - A single queen and a small number of males are the only reproducing individuals in a colony. - Workers gather food, defend the colony, care for the young, and maintain the burrows. - Two worker castes in the Damaraland mole rat: - Frequent workers : do almost all of the work in the colony - Infrequent workers : do little work except on rare occasions after rains

Adjusting for a known confounding factor To assess the physiological differences between the two types of workers, researchers compared daily energy expenditures of wild mole rats during a dry season. Known noise factor: Energy expenditure appears to vary with body mass in both groups, but infrequent workers are heavier than frequent workers Research question: How different is mean daily energy expenditure between the two groups when adjusted for differences in body mass?

Data

Data

Model: yy~tt + xx H0: Castes do not differ in energy expenditure yy~xx H1: Castes differ in energy expenditure yy~tt + xx

Fitting the model to data

Multi-factor designs

Multiple factors Experiments with more than one factor influencing the counts can be analyzed using design formula that include the additional variables. In fact, DESeq2 can analyze any possible experimental design that can be expressed with fixed effects terms (multiple factors, designs with interactions, designs with continuous variables, splines, and so on are all possible). By adding variables to the design, one can control for additional variation in the counts. For example, if the condition samples are balanced across experimental batches, by including the batch factor to the design, one can increase the sensitivity for finding differences due to condition. There are multiple ways to analyze experiments when the additional variables are of interest and not just controlling factors.

Including type

Accounting for type We can account for the different types of sequencing, and get a clearer picture of the differences attributable to the treatment. As condition is the variable of interest, we put it at the end of the formula. Thus the results function will by default pull the condition results unless contrast or name arguments are specified. Then we can rerun DESeq.

Accounting for type

Accounting for type

Accounting for type

Accounting for type It is also possible to retrieve the log2 fold changes, p values and adjusted p values of the type variable. The contrast argument of the function results takes a character vector of length three: the name of the variable, the name of the factor level for the numerator of the log2 ratio, and the name of the factor level for the denominator.

Accounting for type