RNA-seq: filtering, quality control and visualisation. COMBINE RNA-seq Workshop

Similar documents
Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Tutorial: RNA-Seq Analysis Part II: Non-Specific Matches and Expression Measures

Nature Genetics: doi: /ng Supplementary Figure 1. Clinical timeline for the discovery WES cases.

Predicting clinical outcomes in neuroblastoma with genomic data integration

Exercises: Differential Methylation

Elms, Hayes, Shelburne 1

RNA-seq. Differential analysis

Supplementary Figures and Tables

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

Unit 1 Exploring and Understanding Data

SUPPLEMENTARY APPENDIX

Small RNAs and how to analyze them using sequencing

Clustering mass spectrometry data using order statistics

ChIP-seq hands-on. Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

Author's response to reviews

RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

Understandable Statistics

Exercise Verify that the term on the left of the equation showing the decomposition of "total" deviation in a two-factor experiment.

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Statistics: A Brief Overview Part I. Katherine Shaver, M.S. Biostatistician Carilion Clinic

APPENDIX D REFERENCE AND PREDICTIVE VALUES FOR PEAK EXPIRATORY FLOW RATE (PEFR)

Framework for Defining Evidentiary Standards for Biomarker Qualification: Minimal Residual Disease (MRD) in Multiple Myeloma (MM)

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Mostly Harmless Simulations? On the Internal Validity of Empirical Monte Carlo Studies

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

Peak-calling for ChIP-seq and ATAC-seq

Technical note on seasonal adjustment for Gross domestic product (Expenditure)

Knowledge Discovery and Data Mining I

High-Throughput Sequencing Course

Discussion Meeting for MCP-Mod Qualification Opinion Request. Novartis 10 July 2013 EMA, London, UK

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

User Guide. Association analysis. Input

A Statistical Framework for Classification of Tumor Type from microrna Data

Basic Statistics 01. Describing Data. Special Program: Pre-training 1

Supplementary Figures

CHAPTER 9: Producing Data: Experiments

Digitizing the Proteomes From Big Tissue Biobanks

DeconRNASeq: A Statistical Framework for Deconvolution of Heterogeneous Tissue Samples Based on mrna-seq data

Experimental Design For Microarray Experiments. Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Nature Structural & Molecular Biology: doi: /nsmb.2419

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

Supplementary Figure 1: Experimental design. DISCOVERY PHASE VALIDATION PHASE (N = 88) (N = 20) Healthy = 20. Healthy = 6. Endometriosis = 33

Types of Data. Systematic Reviews: Data Synthesis Professor Jodie Dodd 4/12/2014. Acknowledgements: Emily Bain Australasian Cochrane Centre

Profiling of the Exosomal Cargo of Bovine Milk Reveals the Presence of Immune- and Growthmodulatory Non-coding RNAs (ncrna)

Package AbsFilterGSEA

RNA-seq Introduction

Exploring chromatin regulation by ChIP-Sequencing

# Assessment of gene expression levels between several cell group types is a common application of the unsupervised technique.

DNA Sequence Bioinformatics Analysis with the Galaxy Platform

International Journal of Research in Science and Technology. (IJRST) 2018, Vol. No. 8, Issue No. IV, Oct-Dec e-issn: , p-issn: X

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Supplementary information for: Human micrornas co-silence in well-separated groups and have different essentialities

A novel and universal method for microrna RT-qPCR data normalization

Predominant contribution of cis-regulatory divergence in the evolution of mouse alternative splicing

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Supplemental Figure S1. PLAG1 kidneys contain fewer glomeruli (A) Quantitative PCR for Igf2 and PLAG1 in whole kidneys taken from mice at E15.

Identification of Tissue Independent Cancer Driver Genes

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Georgina Salas. Topics EDCI Intro to Research Dr. A.J. Herrera

Sensory Cue Integration

Chapter 20: Test Administration and Interpretation

Observational studies; descriptive statistics

Appendix B1. Details of Sampling and Weighting Procedures

Effective Implementation of Bayesian Adaptive Randomization in Early Phase Clinical Development. Pantelis Vlachos.

On the Possible Pitfalls in the Evaluation of Brain Computer Interface Mice

4. Model evaluation & selection

Tutorial 3: MANOVA. Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016

MULTIPLE REGRESSION OF CPS DATA

Methods for assessing consistency in Mixed Treatment Comparison Meta-analysis

Serum mirna signature diagnoses and discriminates murine colitis subtypes and predicts ulcerative colitis in humans

Human breast milk mirna, maternal probiotic supplementation and atopic dermatitis in offsrping

War and Relatedness Enrico Spolaore and Romain Wacziarg September 2015

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Bayesian Inference for Single-cell ClUstering and ImpuTing (BISCUIT) Elham Azizi

Discontinuous Traits. Chapter 22. Quantitative Traits. Types of Quantitative Traits. Few, distinct phenotypes. Also called discrete characters

QUANTIFYING THE EFFECT OF SETTING QUALITY CONTROL STANDARD DEVIATIONS GREATER THAN ACTUAL STANDARD DEVIATIONS ON WESTGARD RULES

Lecture 8 Understanding Transcription RNA-seq analysis. Foundations of Computational Systems Biology David K. Gifford

World Congress on Breast Cancer

Chapter 1: Explaining Behavior

BIMM 143. RNA sequencing overview. Genome Informatics II. Barry Grant. Lecture In vivo. In vitro.

Introduction to SPSS. Katie Handwerger Why n How February 19, 2009

Overview Detection epileptic seizures

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES

TEB. Id4 p63 DAPI Merge. Id4 CK8 DAPI Merge

Diurnal Pattern of Reaction Time: Statistical analysis

NGS ONCOPANELS: FDA S PERSPECTIVE

RNA-seq. Design of experiments

omiras: MicroRNA regulation of gene expression

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging

Dr. Allen Back. Sep. 30, 2016

Stats 95. Statistical analysis without compelling presentation is annoying at best and catastrophic at worst. From raw numbers to meaningful pictures

Iso-Seq Method Updates and Target Enrichment Without Amplification for SMRT Sequencing

SUPPLEMENTAL INFORMATION

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

MicroRNA roles in signaling during lactation: an insight from differential expression, time course and pathway analyses of deep sequence data

Transcription:

RNA-seq: filtering, quality control and visualisation COMBINE RNA-seq Workshop

QC and visualisation (part 1)

Slide taken from COMBINE RNAseq workshop on 23/09/2016 RNA-seq of Mouse mammary gland Basal cells Virgin Pregnant Lactating n=2 n=2 n=2 Luminal cells Virgin Pregnant Lactating n=2 n=2 n=2 Fu et al. (2015) EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival Nat Cell Biol

Slide taken from COMBINE RNAseq workshop on 23/09/2016 (some) questions we can ask Which genes are differentially expressed between basal and luminal cells? between basal and luminal in virgin mice? between pregnant and lactating mice? between pregnant and lactating mice in basal cells?

Reading in the data counts data and sample information Formatting the data clean it up so we can look at it easily

Filtering out lowly expressed genes Genes with very low counts in all samples provide little evidence for differential expression Often samples have many genes with zero or very low counts A. Raw data B. Filtered data Density 0.20 0.15 0.10 0.05 10_6_5_11 9_6_5_11 purep53 JMS8 2 JMS8 3 JMS8 4 JMS8 5 JMS9 P7c JMS9 P8c Density 0.20 0.15 0.10 0.05 10_6_5_11 9_6_5_11 purep53 JMS8 2 JMS8 3 JMS8 4 JMS8 5 JMS9 P7c JMS9 P8c 0.00 0.00 10 5 0 5 10 15 5 0 5 10 15 Log cpm Log cpm

Filtering out lowly expressed genes Testing for differential expression for many genes simultaneously adds to the multiple testing burden, reducing the power to detect DE genes. IT IS VERY IMPORTANT to filter out genes that have all zero counts or very low counts. We filter using CPM values rather than counts because they account for differences in sequencing depth between samples.

Filtering out lowly expressed genes CPM = counts per million, or how many counts would I get for a gene if the sample had a library size of 1M. For a given gene: Library size Count CPM 1M 1 1 10M 10 1 20M 10 0.5

Filtering out lowly expressed genes Use a CPM threshold to define expressed and unexpressed As a general rule, a good threshold can be chosen for a CPM value that corresponds to a count of 10. In our dataset, the samples have library sizes of 20 to 20 something million. Library size Count CPM 1M 1 1 10M 10 1 20M 10 0.5

Filtering out lowly expressed genes Use a CPM threshold to define expressed and unexpressed As a general rule, a good threshold can be chosen for a CPM value that corresponds to a count of 10. In our dataset, the samples have library sizes of 20 to 20 something million. Library size Count We CPM use a CPM 1M 1 threshold of 0.5! 1 10M 10 1 20M 10 0.5

Filtering out lowly expressed genes Use a CPM threshold to define expressed and unexpressed As a general rule, a good threshold can threshold be chosen of 1 works for a well in most cases. CPM value that corresponds to a count of 10. In our dataset, the samples have library sizes of 20 to 20 something million. Library size Count We CPM use a CPM 1M 1 threshold of 0.5! 1 10M 10 1 But if this is too hard to work out, a CPM 20M 10 0.5

Filtering out lowly expressed genes We keep any gene that is (roughly) expressed in at least one group. 12 samples, 6 groups, 2 replicates in each group. Keep if CPM > 0.5 in at least 2 out of 12 samples Basal cells Virgin Pregnant Lactating Luminal cells Virgin Pregnant Lactating

Filtering out lowly expressed genes We keep any gene that is (roughly) expressed in at least one group. 12 samples, 6 groups, 2 replicates in each group. Keep if CPM > 0.5 in at least 2 out of 12 samples Basal cells Virgin expressed Pregnant expressed Lactating expressed Luminal cells Virgin unexpressed Pregnant unexpressed Lactating unexpressed

Filtering out lowly expressed genes We keep any gene that is (roughly) expressed in at least one group. 12 samples, 6 groups, 2 replicates in each group. Keep if CPM > 0.5 in at least 2 out of 12 samples Basal cells Virgin unexpressed Pregnant expressed Lactating unexpressed Luminal cells Virgin unexpressed Pregnant unexpressed Lactating unexpressed

Filtering out lowly expressed genes We keep any gene that is (roughly) expressed in at least one group. 12 samples, 6 groups, 2 replicates in each group. Keep gene if CPM > 0.5 in at least 2 or more samples Basal cells Virgin unexpressed Pregnant expressed Lactating unexpressed Luminal cells Virgin unexpressed Pregnant unexpressed Lactating unexpressed

QC and visualisation (part 2)

MDS Plots A visualisation of a principle components analysis which looks at where the greatest sources of variation in the data come from. Distances represents the typical log2-fc observed between each pair of samples e.g. 6 units apart = 2^6 = 64-fold difference Unsupervised separation based on data, no prior knowledge of experimental design. Useful for an overview of the data. Do samples separate by experimental groups? Quality control Outliers?

QC and visualisation (part 3)

Normalisation for composition bias 10_6_5_11 9_6_5_11 purep53 JMS8 2 JMS8 3 JMS8 4 JMS8 5 JMS9 P7c JMS9 P8c 5 0 5 10 15 A. Example: Unnormalised data Log cpm 10_6_5_11 9_6_5_11 purep53 JMS8 2 JMS8 3 JMS8 4 JMS8 5 JMS9 P7c JMS9 P8c 5 0 5 10 15 B. Example: Normalised data Log cpm If we ran a DE analysis on Sample 1 and Sample 3, almost all genes will be downregulated in Sample 1!!

Normalisation for composition bias 10_6_5_11 9_6_5_11 purep53 JMS8 2 JMS8 3 JMS8 4 JMS8 5 JMS9 P7c JMS9 P8c 5 0 5 10 15 A. Example: Unnormalised data Log cpm 10_6_5_11 9_6_5_11 purep53 JMS8 2 JMS8 3 JMS8 4 JMS8 5 JMS9 P7c JMS9 P8c 5 0 5 10 15 B. Example: Normalised data Log cpm

Normalisation for composition bias TMM normalisation (Robinson and Oshlack, 2010) How do we make the expression of all the genes go UP in the one sample? Scaling factors E.g. scale library size by 0.1 so effective library size is 1M. Library size Count CPM 10M 10 1 1M 10 10 Scaling factor <1 makes the CPM larger.

Voom

Variance of log-cpm depends on mean of log-cpm SEQC (tech var only) Mouse (low bio var) Simulations (mod bio var) Nigerian (high bio var) D melanogaster (systematically dif) Average log 2 (count + 0.5) 25

RNA-seqdata is discrete has non-constant mean-variance trend Voom Transform to log-counts per million Remove mean-var dependence through the use of precision weights Normal dist. assumes that the data is continuous has constant variance

Variance weights Obtain variance estimates for each observation using mean-var trend. Assign inverse variance weights to each observation. Weights remove mean-variance trend from the data. Before: trended variance After: constant variance Qtr-root variance Log2- variance Mean log-count Mean log-cpm