Lecture 21. RNA-seq: Advanced analysis

Similar documents
RNA-seq. Design of experiments

RNA-seq. Differential analysis

Experimental Studies. Statistical techniques for Experimental Data. Experimental Designs can be grouped. Experimental Designs can be grouped

Use of GEEs in STATA

investigate. educate. inform.

Experimental Design for Immunologists

Biostatistics 2 nd year Comprehensive Examination. Due: May 31 st, 2013 by 5pm. Instructions:

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

How to analyze correlated and longitudinal data?

f WILEY ANOVA and ANCOVA A GLM Approach Second Edition ANDREW RUTHERFORD Staffordshire, United Kingdom Keele University School of Psychology

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Ecological Statistics

In this module I provide a few illustrations of options within lavaan for handling various situations.

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

9.0 L '- ---'- ---'- --' X

Measurement Error in Nonlinear Models

QA 605 WINTER QUARTER ACADEMIC YEAR

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Data Analysis Using Regression and Multilevel/Hierarchical Models

Correlation and regression

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Detecting Anomalous Patterns of Care Using Health Insurance Claims

Biostatistics II

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Where does "analysis" enter the experimental process?

Part 8 Logistic Regression

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Selection and Combination of Markers for Prediction

Chapter 5: Field experimental designs in agriculture

Unit 1 Exploring and Understanding Data

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Intelligent Systems. Discriminative Learning. Parts marked by * are optional. WS2013/2014 Carsten Rother, Dmitrij Schlesinger

Statistics 2. RCBD Review. Agriculture Innovation Program

Generalized Estimating Equations for Depression Dose Regimes

The Practice of Statistics 1 Week 2: Relationships and Data Collection

THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

STATISTICAL CONCLUSION VALIDITY

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

9 research designs likely for PSYC 2100

11/24/2017. Do not imply a cause-and-effect relationship

Why Mixed Effects Models?

MTH 225: Introductory Statistics

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

Exercises: Differential Methylation

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

isc ove ring i Statistics sing SPSS

Testing Statistical Models to Improve Screening of Lung Cancer

Applied Analysis of Variance and Experimental Design. Lukas Meier, Seminar für Statistik

Incorporating Infecting Pathogen Counts In Vaccine Trials. Dean Follmann National Institute of Allergy and Infectious Diseases

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

Poisson regression. Dae-Jin Lee Basque Center for Applied Mathematics.

Chapter 2 Planning Experiments

A SAS Macro for Adaptive Regression Modeling

Logistic regression. Department of Statistics, University of South Carolina. Stat 205: Elementary Statistics for the Biological and Life Sciences

CROSS-VALIDATION IN GROUP-BASED LATENT TRAJECTORY MODELING WHEN ASSUMING A CENSORED NORMAL MODEL. Megan M. Marron

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

Lecture 14: Adjusting for between- and within-cluster covariates in the analysis of clustered data May 14, 2009

IAPT: Regression. Regression analyses

A Monte Carlo Simulation Study for Comparing Power of the Most Powerful and Regular Bivariate Normality Tests

The essential focus of an experiment is to show that variance can be produced in a DV by manipulation of an IV.

Principles of Experimental Design

Principles of Experimental Design

Quantitative Methods in Managment. An introduction to GLMs and measurement theory

Reproducibility is necessary but insufficient for addressing brain science issues.

Reliability of Ordination Analyses

MS&E 226: Small Data

Certificate Courses in Biostatistics

VALIDITY OF QUANTITATIVE RESEARCH

Chapter 1: Exploring Data

Index. Springer International Publishing Switzerland 2017 T.J. Cleophas, A.H. Zwinderman, Modern Meta-Analysis, DOI /

Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model

Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see

Designing Experiments... Or how many times and ways can I screw that up?!?

Clincial Biostatistics. Regression

Modeling Binary outcome

Modeling unobserved heterogeneity in Stata

Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha

Biostatistics for Med Students. Lecture 1

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Causal Research Design- Experimentation

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

Study Guide #2: MULTIPLE REGRESSION in education

Package inctools. January 3, 2018

Estimating drug effects in the presence of placebo response: Causal inference using growth mixture modeling

Logistic Regression Predicting the Chances of Coronary Heart Disease. Multivariate Solutions

CLASSICAL AND. MODERN REGRESSION WITH APPLICATIONS

Intro to SPSS. Using SPSS through WebFAS

Package speff2trial. February 20, 2015

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

Structural Equation Modeling (SEM)

ANCOVA with Regression Homogeneity

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA

Transcription:

Lecture 21 RNA-seq: Advanced analysis

Experimental design

Introduction An experiment is a process or study that results in the collection of data. Statistical experiments are conducted in situations in which researchers can manipulate the conditions of the experiment and can control the factors that are irrelevant to the research objectives.

Statistical design of experiments Experimental design is the process of planning a study to meet specified objectives. Planning an experiment properly is very important in order to ensure that the right type of data and a sufficient sample size and power are available to answer the research questions of interest as clearly and efficiently as possible.

Designing an experiment Perform the following steps when designing an experiment: 1. Define the problem and the questions to be addressed 2. Define the population of interest 3. Determine the need for sampling 4. Define the experimental design

Define problem Before data collection begins, specific questions that the researcher plans to examine must be clearly identified. In addition, a researcher should identify the sources of variability in the experimental conditions. One of the main goals of a designed experiment is to partition the effects of the sources of variability into distinct components in order to examine specific questions of interest. The objective of designed experiments is to improve the precision of the results in order to examine the research hypotheses.

Define population A population is a collective whole of people, animals, plants, or other items that researchers collect data from. Before collecting any data, it is important that researchers clearly define the population, including a description of the members. The designed experiment should designate the population for which the problem will be examined. The entire population for which the researcher wants to draw conclusions will be the focus of the experiment.

Determine the need for sampling A sample is one of many possible sub-sets of units that are selected from the population of interest. In many data collection studies, the population of interest is assumed to be much larger in size than the sample so, potentially, there are a very large (usually considered infinite) number of possible samples. The results from a sample are then used to draw valid inferences about the population.

Determine the need for sampling A random sample is a sub-set of units that are selected randomly from a population. A random sample represents the general population or the conditions that are selected for the experiment because the population of interest is too large to study in its entirety. Using techniques such as random selection after stratification or blocking is often preferred.

Determine the need for sampling Determining the sample size requires some knowledge of the observed or expected variance among sample members in addition to how large a difference among treatments you want to be able to detect. Another way to describe this aspect of the design stage is to conduct a prospective power analysis, which is a brief statement about the capability of an analysis to detect a practical difference. A power analysis is essential so that the data collection plan will work to enhance the statistical tests primarily by reducing residual variation, which is one of the key components of a power analysis study.

Define experimental design Defining the experimental design consists of the following steps: 1. Identify the experimental unit. 2. Identify the types of variables. 3. Define the treatment structure. 4. Define the design structure.

Experimental units An Experimental (or sampling) unit is the person or object that will be studied by the researcher. This is the smallest unit of analysis in the experiment from which data will be collected (e.g. patient, mouse, plant, or cell line).

Experimental units An entity receiving an independent application of a treatment is called an experimental unit. An experimental run is the process of applying a particular treatment combination to an experimental unit and recording its response. A replicate is an independent run carried out on a different experimental unit under the same conditions.

Example: Two pots Experimental unit: plant on the pot No replication

Types of variables A data collection plan considers how four important variables: background, constant, uncontrollable, and primary, fit into the study. The explanatory variables are referred to as factors. Inconclusive results are likely to result if any of these classifications are not adequately defined. It is important to consider all the relevant variables before the final data collection plan is approved in order to maximize confidence in the final results.

Background variables Background variables can be identified and measured yet cannot be controlled; they will influence the outcome of an experiment. Background variables will be treated as covariates in the model rather than primary variables.

Primary variables Primary variables are the variables of interest to the researcher. Primary variables are independent variables that are possible sources of variation in the response. These variables comprise the treatment and design structures and are referred to as factors. When background variables are used in an analysis, better estimates of the primary variables should result because the sources of variation that are supplied by the covariates have been removed.

Constant variables Constant variables can be controlled or measured but, for some reason, will be held constant over the duration of the study. This action increases the validity of the results by reducing extraneous sources of variation from entering the data. For this data collection plan, some of the variables that will be held constant include: the use of standard operating procedures the use of one operator for each measuring device all measurements taken at specific times and locations

Uncontrollable variables Uncontrollable variables are those variables that are known to exist, but conditions prevent them from being manipulated, or it is very difficult (due to cost or physical constraints) to measure them. The experimental error is due to the influential effects of uncontrollable variables, which will result in less precise evaluations of the effects of the primary and background variables. The design of the experiment should eliminate or control these types of variables as much as possible in order to increase confidence in the final results.

Explanatory and response variables XX YY - Explanatory variables - Factors - Response variables

Factors - Noise factor - Blocking factor ZZ Treatment factor or design factor XX YY Response variables Levels: XX = xx Treatment combination or treatment: a particular combination of factor levels (e.g. xx 1, xx 2 if there are two treatment factors)

Levels The levels of the primary factors represent the range of the inference space relative to a study. The levels of the primary factors can represent the entire range of possibilities or a random sub-set. It is also important to recognize and define when combinations of levels of two or more treatment factors are illogical or unlikely to exist.

Fixed effects Fixed effects treatment factors are usually considered to be "fixed" in the sense that all levels of interest are included in the study because they are selected by some non-random process, they consist of the whole population of possible levels, or other levels were not feasible to consider as part of the study. The fixed effects represent the levels of a set of precise hypotheses of interest in the research. A fixed factor can have only a small number of inherent levels; for example, the only relevant levels for gender are male and female. A factor should also be considered fixed when only certain values of it are of interest, even though other levels might exist. Treatment factors can also be considered "fixed" as opposed to "random" because they are the only levels about which you would want to make inferences.

Three basic principles of experimental design Replication Randomization Blocking

Replication By replication we mean an independent repeat run of each treatment combination. Replication is essential for estimating experimental error. If a treatment condition appears more than one time, it is defined to be replicated. Misconceptions about the number of replications have often occurred in experiments where sub-samples or repeated observations on a unit have been mistaken as additional experimental units.

Randomization By randomization we mean that both the assignment of treatments to units and the order in which the individual runs of the experiments are to be performed are randomly determined. A completely randomized design is an experimental design in which treatments are assigned to all units by randomization.

Example: Randomized Experimental unit: plant on the pot 4 replicates for each treatment

Blocking Most experimental designs require experimental units to be allocated to treatments either randomly or randomly with constraints, as in blocked designs. Blocks are groups of experimental units that are formed to be as homogeneous as possible with respect to the block characteristics. The term block comes from the agricultural heritage of experimental design where a large block of land was selected for the various treatments, that had uniform soil, drainage, sunlight, and other important physical characteristics. Homogeneous clusters improve the comparison of treatments by randomly allocating levels of the treatments within each block.

Blocking Blocking is an experimental design strategy used to reduce or eliminate the variability transmitted from nuisance factors, which may influence the response variable but in which we are not directly interested. Blocking is the grouping of experimental units that have similar properties. Within each block, treatments are randomly assigned to experimental units. The resulting design is called a randomized block design. This design enables more precise estimates of the treatment effects because comparisons between treatments are made among homogeneous experimental units in each block.

Blocking ZZ XX YY

Blocking example Blocking removes the variation in response among chambers, allowing more precise estimates and more powerful tests of the treatment effects.

Design structure The design structure consists of those factors that define the blocking of the experimental units into clusters. The types of commonly used design structures: Completely randomized design Randomized complete block design Factorial design

Completely randomized design Subjects are assigned to treatments completely at random.

Randomized complete block design Subjects are divided into b blocks according to demographic characteristics. Subjects in each block are then randomly assigned to treatments so that all treatment levels appear in each block.

Factorial design Many experiments in biology investigate more than one treatment factor, because: 1. answering two questions from a single experiment rather than just one makes more efficient use of time, supplies, and other costs 2. the factors might interact.

Factorial design An experiment having a factorial design investigates all treatment combinations of two or more treatment factors. A factorial design can measure interactions between factors. An interaction between two (or more) explanatory variables means that the effect of one variable on the response depends on the state of the other variable.

Factorial design XX 2 XX 1 YY

Analyzing data

Regression Regression is a method that 1. predicts the average values of a response variable from values of explanatory variables (focusing on regression to represent relationships between variables) 2. summarizes how the average values of a response variable vary over subpopulations defined by functions of explanatory variables (focusing on regression as a comparison of average outcomes)

Ordinary linear regression yy ii = ββ 0 + ββ 1 xx iii + + ββ pp 1 xx iiii 1 + εε ii, εε ii NN 0, σσ 2, ii = 1,, n EE[yy ii ] = ββ 0 + ββ 1 xx iii + + ββ pp 1 xx iiii 1, ii = 1,, n EE[yy XX] = XXXX

Three components of the GLMs 1. Random component: yy = yy 1,, yy nn T and its probability distribution. The observations yy ii are treated as independent 2. Systematic component (or linear predictor): XXXX, where XX is a nn pp model matrix and ββ = ββ 0,, ββ pp 1 T 3. Link function: a function gg applied to each component of EE[yy] that relates it to the linear predictor, gg(ee yy ) = XXXX

Random component of a GLM The random component of a GLM consists of a response variable yy with independent observations yy = yy 1,, yy nn T having probability density or mass function for a distribution in the exponential family. By restricting GLMs to exponential family distributions, we obtain: 1. General expressions for the model likelihood equations 2. Asymptotic distributions of estimators for model parameters 3. An algorithm for fitting the models.

Systematic component of a GLM The linear predictor of a GLM relates parameters {ηη ii } pertaining to {EE yy ii } to the explanatory variables using a linear combination of them ηη ii = pp 1 jj=0 ββ jj xx iiii where xx ii0 = 1. This expression is linear in the parameters. In matrix form, we express the linear predictor as ηη = XXXX where ηη = ηη 1,, ηη nn T, ββ = ββ 0,, ββ pp 1 T is the column vector of model parameters, and XX is the nn pp matrix of explanatory variable values. The matrix XX is called the model matrix or design matrix.

Link function of a GLM The link function connects the random component with the linear predictor. Let μμ ii = EE yy ii. The GLM links ηη ii to μμ ii by ηη ii = gg(μμ ii ), where the link function gg is a monotonic, differentiable function: gg μμ ii = pp 1 jj=0 ββ jj xx iiii

Linear model: a GLM with identity link function The link function gg μμ ii = μμ ii is called the identity link function. A GLM that uses the identity link function is called a linear model. This GLM has pp 1 μμ ii = ββ jj xx iiii jj=0 The standard version of the linear model, which we refer to as the ordinary linear model, assumes that the observations have constant variance, called homoscedasticity: yy ii = pp 1 jj=0 ββ jj xx iiii + εε ii, where EE εε ii = 0 and VVVVVV εε ii = σσ 2. The ordinary normal linear model further assumes that εε ii NN(0, σσ 2 ).

Important GLMs Random component Systematic component Link function Model Normal Continuous Identity Regression Normal Categorical Identity ANOVA Normal Mixed Identity ANCOVA Binomial Mixed Logit Logistic regression Binomial Mixed Probit and others Binary regression Multinomial Mixed Generalized logit Multinomial response Poisson Mixed Log Poisson loglinear

Example: RNA-seq

Multiple factors Experiments with more than one factor influencing the counts can be analyzed using design formula that include the additional variables. In fact, DESeq2 can analyze any possible experimental design that can be expressed with fixed effects terms (multiple factors, designs with interactions, designs with continuous variables, splines, and so on are all possible). By adding variables to the design, one can control for additional variation in the counts. For example, if the condition samples are balanced across experimental batches, by including the batch factor to the design, one can increase the sensitivity for finding differences due to condition. There are multiple ways to analyze experiments when the additional variables are of interest and not just controlling factors.

Including type

Accounting for type We can account for the different types of sequencing, and get a clearer picture of the differences attributable to the treatment. As condition is the variable of interest, we put it at the end of the formula. Thus the results function will by default pull the condition results unless contrast or name arguments are specified. Then we can rerun DESeq.

Accounting for type

Accounting for type

Accounting for type

Accounting for type It is also possible to retrieve the log2 fold changes, p values and adjusted p values of the type variable. The contrast argument of the function results takes a character vector of length three: the name of the variable, the name of the factor level for the numerator of the log2 ratio, and the name of the factor level for the denominator.

Accounting for type

Gene Ontology

Annotating and exporting results Our result table only contains information about Ensembl gene IDs, but alternative gene names may be more informative for collaborators. Bioconductor s annotation packages help with mapping various ID schemes to each other.

Annotating and exporting results

Annotating and exporting results

Running topgo

Running topgo

Running topgo

Running topgo

Downregulated GO

Upregulated GO

Published results The top 10 most significant terms are shown for downregulated (D) and upregulated (E) genes, respectively.