Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Similar documents
RNA-seq: filtering, quality control and visualisation. COMBINE RNA-seq Workshop

Understandable Statistics

Bayesian Inference for Single-cell ClUstering and ImpuTing (BISCUIT) Elham Azizi

Unit 1 Exploring and Understanding Data

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

Experimental Design For Microarray Experiments. Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

Sum of Neurally Distinct Stimulus- and Task-Related Components.

Numerous hypothesis tests were performed in this study. To reduce the false positive due to

Cancer outlier differential gene expression detection

Nature Getetics: doi: /ng.3471

Methods Research Report. An Empirical Assessment of Bivariate Methods for Meta-Analysis of Test Accuracy

Nature Methods: doi: /nmeth.3115

RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Nature Neuroscience: doi: /nn Supplementary Figure 1

Search settings MaxQuant

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

Supplementary information for: Human micrornas co-silence in well-separated groups and have different essentialities

An Empirical Assessment of Bivariate Methods for Meta-analysis of Test Accuracy

MOST: detecting cancer differential gene expression

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Applied Machine Learning in Biomedicine. Enrico Grisan

SUPPLEMENTAL MATERIAL

Systematic Reviews and meta-analyses of Diagnostic Test Accuracy. Mariska Leeflang

VARIATION IN MEASUREMENT OF HIV RNA VIRAL LOAD

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Supplementary Figure 1: Features of IGLL5 Mutations in CLL: a) Representative IGV screenshot of first

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

ST440/550: Applied Bayesian Statistics. (10) Frequentist Properties of Bayesian Methods

Chapter 1: Exploring Data

Linear Regression in SAS

Title: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items

(b) empirical power. IV: blinded IV: unblinded Regr: blinded Regr: unblinded α. empirical power

Nature Immunology: doi: /ni Supplementary Figure 1

Data Analysis Using Regression and Multilevel/Hierarchical Models

Behavioral generalization

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Early Learning vs Early Variability 1.5 r = p = Early Learning r = p = e 005. Early Learning 0.

Computerized Mastery Testing

User Guide. Association analysis. Input

Analysis of gene expression in blood before diagnosis of ovarian cancer

Package AbsFilterGSEA

Identification of Tissue Independent Cancer Driver Genes

ChIP-seq data analysis

Supplemental material: Interference between number magnitude and parity: Discrete representation in number processing

4. Model evaluation & selection

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

3. Model evaluation & selection

STATISTICS & PROBABILITY

Introduction. Introduction

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

Supplementary. properties of. network types. randomly sampled. subsets (75%

Supplementary Figures

Regression Discontinuity Analysis

Global estimation of child mortality using a Bayesian B-spline bias-reduction method

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

fl/+ KRas;Atg5 fl/+ KRas;Atg5 fl/fl KRas;Atg5 fl/fl KRas;Atg5 Supplementary Figure 1. Gene set enrichment analyses. (a) (b)

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

A novel approach to estimation of the time to biomarker threshold: Applications to HIV

Supplementary Figure 1: High-throughput profiling of survival after exposure to - radiation. (a) Cells were plated in at least 7 wells in a 384-well

DeconRNASeq: A Statistical Framework for Deconvolution of Heterogeneous Tissue Samples Based on mrna-seq data

Supplementary Figures

Nature Genetics: doi: /ng Supplementary Figure 1. Workflow of CDR3 sequence assembly from RNA-seq data.

OUTLIER SUBJECTS PROTOCOL (art_groupoutlier)

BOOTSTRAPPING CONFIDENCE LEVELS FOR HYPOTHESES ABOUT QUADRATIC (U-SHAPED) REGRESSION MODELS

Supplementary appendix

Observational studies; descriptive statistics

Numerous hypothesis tests were performed in this study. To reduce the false positive due to

AP Statistics. Semester One Review Part 1 Chapters 1-5

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Statistical Assessment of the Global Regulatory Role of Histone. Acetylation in Saccharomyces cerevisiae. (Support Information)

Business Statistics Probability

HERMES Time and Workflow Primary Paper. Statistical Analysis Plan

Bayesian integration in sensorimotor learning

RNA-seq. Differential analysis

Ordinal Data Modeling

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Supplementary Online Content

Estimation of effect sizes in the presence of publication bias: a comparison of meta-analysis methods

Psychology Research Process

Insulin Secretion and Hepatic Extraction during Euglycemic Clamp Study: Modelling of Insulin and C-peptide data

Validation of consistency of Mendelian sampling variance in national evaluation models

T. R. Golub, D. K. Slonim & Others 1999

A About Facebook 2. B Data linking and controls 2. C Sampling rates 5. D Activity categories 6. E Models included in Figure 4 9

Essentials in Bioassay Design and Relative Potency Determination

NEW METHODS FOR SENSITIVITY TESTS OF EXPLOSIVE DEVICES

Pitfalls in Linear Regression Analysis

Evaluation of logistic regression models and effect of covariates for case control study in RNA-Seq analysis

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer

Mostly Harmless Simulations? On the Internal Validity of Empirical Monte Carlo Studies

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

The Importance of Coverage Uniformity Over On-Target Rate for Efficient Targeted NGS

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

Transcription:

Supplement to SCnorm: robust normalization of single-cell RNA-seq data Supplementary Note 1: SCnorm does not require spike-ins, since we find that the performance of spike-ins in scrna-seq is often compromised, and many labs do not use them for normalization 2. Specifically, spike-ins are not routinely representative of the full range of expression, show substantial bias, and are often spiked-in at much higher concentrations than targeted (Supplementary Figures 14-15). However, if good spikeins are available, performance of SCnorm may be improved in the post-normalization scaling step, which is required when multiple conditions are available. Since spike-ins are added in equal concentrations and are biologically inactive, between condition scale factors can be computed over the spike-ins alone, as detailed in Methods. Since good spike-ins are expected to be equivalently expressed (not DE) between conditions, we expect this approach will be more accurate than using the full set of target genes in the rescaling step, especially when the overall proportion of DE genes is very high (e.g. over 50%). Supplementary Note 2: While quantile regression proved to be more flexible and more robust to outliers relative to a generalized linear model based approach, we recognize that the non-linear log transformation introduces a bias in the count-depth relationship for small counts. Consequently, in addition to quantile regression, count-depth relationships are also assessed on untransformed data using generalized linear model regression with a negative binomial model using the glm.nb function in R. Supplementary Note 3: Like other methods for normalization 3 7, SCnorm leaves zeros unchanged. Consequently, the goal of SCnorm is to remove the effect of sequencing depth (and perhaps gene-specific features) among the non-zero counts. To do so, the 1

count-depth relationship must be estimated prior to adjustment using only non-zero count data (Supplementary Figure 12). MAST is commonly used to identify DE genes in scrna-seq data. The user has the option to test for DE on the non-zero count data (continuous component), the zeros (discrete component), or both which combines evidence from the continuous and discrete tests. A recent method, scdd 8, is similar in that tests for zeros and non-zeros are conducted separately. When two biological conditions are being compared, SCnorm rescales the normalized estimates so that the two conditions have similar means overall among the non-zero counts. Other normalization methods provide normalized estimates of expression that have similar means among all counts, which is problematic for foldchange calculations and DE testing (as shown in Figure 2). See Supplementary Figure 13 for further detail. 2

Supplementary Figure 1: Estimated count-depth relationships in bulk and single-cell datasets before and after normalization. Results are structurally identical to those shown in Figure 1, but using negative binomial generalized linear regression instead of median quantile regression to calculate gene-specific slopes. 3

Supplementary Figure 2: Fold-changes and ROC curves for SIM I. For each simulated dataset, genes are divided into four equally sized groups based on their median expression among non-zero un-normalized measurements. In each group, the genespecific difference between estimated fold-change and true fold-change is calculated for SCnorm, SCnorm.SI, MR, TPM, and scran. Boxplots of these estimates are shown in panel (a) for 100 simulations of SIM I with K=1. Panel (b) shows ROC curves for detection of differentially expressed (DE) genes for 100 simulations of SIM I with K=1. The solid line is the average over the 100 iterations, and the dashed lines represent ROC curves for five randomly chosen iterations. Panels (c) and (d) are structurally identical, for SIM I with K=4. 4

Supplementary Figure 3: Fold-changes and ROC curves for SIM II. For each scenario in SIM II, panels (a) (d) show the gene-specific difference calculated between estimated fold-change of non-zero counts and true fold-change for 100 simulations. Boxplots of the averages are shown for data normalized by SCnorm, TPM and scran. MR cannot be evaluated in these simulations as each gene contains at least one zero and so no genes pass the MR filter. Motivation for considering non-zero counts to calculate fold-change is discussed in Supplementary Note 3. Panels (e) (h) are structurally identical to (a) (d), but with fold-changes calculated with zeros included. Panels (i) (k) show ROC curves for detection of differentially expressed (DE) genes for 100 simulations of SIM II, scenario 2 (panel (a)), 3 (panel (b)), and 4 (panel(c)) for data normalized by SCnorm, TPM, and scran. The solid line is the average over the 100 5

iterations, and the dashed lines represent ROC curves for five randomly chosen iterations. Supplementary Figure 4: Fold-changes and DE genes calculated from the H9 case study data. For each gene, the fold-change of non-zero counts between the H9-4M and H9-1M groups was computed for data following normalization via SCnorm, MR, TPM, scran, SCDE, and BASiCS. Box-plots of gene-specific fold-changes are shown in panel (a) for data normalized by each method. The number of genes identified as DE using MAST is shown in panel (b). Genes are divided into four equally sized expression groups based on their median among non-zero un-normalized expression measurements and results are shown as a function of expression group. Motivation for considering non-zero counts to calculate fold-change is discussed in Supplementary Note 3. 6

Supplementary Figure 5: ROC curves for a comparison of S vs. G2/M in the H1- FUCCI data. For this evaluation, we subsampled cells from the S and G2/M H1-FUCCI case study data. For the subsampled cells, there are negligible differences in cellular detection rates (CDRs) between the two conditions and there is on average a 1.5 fold increase in sequencing depth (details in Methods). Without differences in CDR, we would expect an EE gene expressed at level x in S to be expressed at level 1.5*x in G2/M. Given this, we define a gold standard DE list to be those genes showing a fold change bigger than a threshold (or smaller than one over that threshold), adjusting for the expected increase in expression due to increased sequencing depth. MAST was applied as detailed in Methods to identify DE genes; and thresholds equal to 1.5, 2, 2.5, and 3 are shown here. 7

Supplementary Figure 6: Count-depth relationship of H1-1M data. For each gene, median quantile regression was used to estimate the count-depth relationship before normalization and after normalization via SCnorm, MR, TPM, scran, SCDE, and BASiCS. Shown are densities of slopes within each of ten equally sized gene groups where a gene s group membership is determined by its median expression among nonzero un-normalized measurements. 8

Supplementary Figure 7: Count-depth relationship of H1-4M data. Results are structurally identical to those shown in Supplementary Figure 6, but for the H1-4M data. 9

Supplementary Figure 8: Count-depth relationship for 5 publically available datasets. Results are structurally identical to those shown in Supplementary Figure 6, but for five publicly available datasets (a) prior to normalization and (b) normalized by SCnorm. 10

Supplementary Figure 9: Estimated count-depth relationship of H1-1M data. Results are structurally identical to those shown in Supplementary Figure 6, but using negative binomial generalized linear regression instead of median quantile regression to calculate gene-specific slopes. 11

Supplementary Figure 10: Count-depth relationship of H1-4M data. Results are structurally identical to those shown in Supplementary Figure 7, but using negative binomial generalized linear regression instead of median quantile regression to calculate gene-specific slopes. 12

Supplementary Figure 11: Count-depth relationship for five publically available datasets. Results are structurally identical to those shown in Supplementary Figure 8(b), but using negative binomial generalized linear regression instead of median quantile regression to calculate gene-specific slopes. 13

Supplementary Figure 12: Motivation for using quantile regression on non-zero data. Panel (a) shows true expression of one hypothetical gene in 100 cells. All variation shown is considered biological. Panel (b) shows ideal measured expression for the same gene as a function of sequencing depth where ideal measured expression is free of technical artifacts. Panel (c) shows the same gene where some cells show zero counts due to technical artifacts. Panel (d) shows expression vs. depth and estimated regression fits for quantile regression (blue) and negative binomial generalized linear model regression (red) without (solid) and with (dashed) the zero counts included. SCnorm leaves zeros unchanged and corrects for the count-depth relationship among non-zeros, which is more accurately summarized by the regression fits on non-zero data. 14

Supplementary Figure 13: Effect of normalization on non-zero means. Panel (a) shows measured expression for a hypothetical EE gene that is sequenced in two conditions. The sequencing depths in the first condition (red) are smaller than those in the second condition (blue). Panel (b) shows the same gene where counts have been normalized using a global scale factor based approach. Note that global scale factor based approaches provide normalized estimates of expression that have similar means among all counts which results in non-zero counts having different mean expression levels if there are differences in the proportion of zeros across conditions. Methods to identify DE genes in scrna-seq data such as MAST and scdd test on non-zero and zero counts separately and, consequently, would identify this EE gene as DE. Panel (c) shows counts normalized by SCnorm which provides normalized estimates of expression that have similar means among non-zero counts. 15

Supplementary Figure 14: The proportion of spike-in expression counts to the total expression counts for each cell is shown for four publicly available datasets and the H1-1M and H1-4M datasets. Cells are ordered by sequencing depth. 16

Supplementary Figure 15: Box-plots of log read counts in each cell are shown separately for endogenous genes (left panel) and spike-ins (right panel) for four publicly available datasets and the H1 case study data. Counts smaller than one are not shown. Cells are ordered by sequencing depth. 17

Supplementary Figure 16: Results from Figure 2, with NODES included. 18

Supplementary Figure 17: Results are structurally identical to Figure 3, but with NODES included. Misclassification rates for SCnorm, MR, TPM, scran, SCDE, and NODES averaged across the three cell cycle phases are 0.12, 0.21, 0.22, 0.22, 0.24, and.31, respectively. Note that these rates differ from those shown in Figure 3 because NODES removes genes and cells prior to normalization, and here we restrict to the genes and cells retained by NODES to facilitate comparing across methods. 19

Supplementary Figure 18: Results from Supplementary Figure 5, with NODES included. 20

Supplementary Figure 19: Summary statistics of simulated and case-study data. The empirical cumulative distribution functions of the gene-specific variances and genespecific means are shown in black in panels (a) and (b), respectively, for one SIM I dataset with. Shown in red are the empirical cumulative distribution functions of the gene-specific variances and gene-specific means for the genes sampled from the H1-1M and H1-4M datasets and used to simulate the SIM I data. Panels (c) and (d) are structurally identical, for one SIM I dataset with. Variances and means are computed on log non-zero expression measurements. 21

References 1. Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480 (2011). 2. Lin, Y. et al. Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC Genomics 17, 28 (2016). 3. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010). 4. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). 5. L. Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize singlecell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016). 6. Kharchenko, P. V, Silberstein, L. & Scadden, D. T. Bayesian approach to singlecell differential expression analysis. Nat. Methods 11, 740 742 (2014). 7. Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian Analysis of Single-Cell Sequencing Data. PLOS Comput. Biol. 11, e1004333 (2015). 8. Korthauer, K. D. et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 17, 222 (2016). 22