Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Supplement to SCnorm: robust normalization of single-cell RNA-seq data Supplementary Note 1: SCnorm does not require spike-ins, since we find that the performance of spike-ins in scrna-seq is often compromised, and many labs do not use them for normalization 2. Specifically, spike-ins are not routinely representative of the full range of expression, show substantial bias, and are often spiked-in at much higher concentrations than targeted (Supplementary Figures 14-15). However, if good spikeins are available, performance of SCnorm may be improved in the post-normalization scaling step, which is required when multiple conditions are available. Since spike-ins are added in equal concentrations and are biologically inactive, between condition scale factors can be computed over the spike-ins alone, as detailed in Methods. Since good spike-ins are expected to be equivalently expressed (not DE) between conditions, we expect this approach will be more accurate than using the full set of target genes in the rescaling step, especially when the overall proportion of DE genes is very high (e.g. over 50%). Supplementary Note 2: While quantile regression proved to be more flexible and more robust to outliers relative to a generalized linear model based approach, we recognize that the non-linear log transformation introduces a bias in the count-depth relationship for small counts. Consequently, in addition to quantile regression, count-depth relationships are also assessed on untransformed data using generalized linear model regression with a negative binomial model using the glm.nb function in R. Supplementary Note 3: Like other methods for normalization 3 7, SCnorm leaves zeros unchanged. Consequently, the goal of SCnorm is to remove the effect of sequencing depth (and perhaps gene-specific features) among the non-zero counts. To do so, the 1

count-depth relationship must be estimated prior to adjustment using only non-zero count data (Supplementary Figure 12). MAST is commonly used to identify DE genes in scrna-seq data. The user has the option to test for DE on the non-zero count data (continuous component), the zeros (discrete component), or both which combines evidence from the continuous and discrete tests. A recent method, scdd 8, is similar in that tests for zeros and non-zeros are conducted separately. When two biological conditions are being compared, SCnorm rescales the normalized estimates so that the two conditions have similar means overall among the non-zero counts. Other normalization methods provide normalized estimates of expression that have similar means among all counts, which is problematic for foldchange calculations and DE testing (as shown in Figure 2). See Supplementary Figure 13 for further detail. 2

Supplementary Figure 1: Estimated count-depth relationships in bulk and single-cell datasets before and after normalization. Results are structurally identical to those shown in Figure 1, but using negative binomial generalized linear regression instead of median quantile regression to calculate gene-specific slopes. 3

Supplementary Figure 2: Fold-changes and ROC curves for SIM I. For each simulated dataset, genes are divided into four equally sized groups based on their median expression among non-zero un-normalized measurements. In each group, the genespecific difference between estimated fold-change and true fold-change is calculated for SCnorm, SCnorm.SI, MR, TPM, and scran. Boxplots of these estimates are shown in panel (a) for 100 simulations of SIM I with K=1. Panel (b) shows ROC curves for detection of differentially expressed (DE) genes for 100 simulations of SIM I with K=1. The solid line is the average over the 100 iterations, and the dashed lines represent ROC curves for five randomly chosen iterations. Panels (c) and (d) are structurally identical, for SIM I with K=4. 4

Supplementary Figure 3: Fold-changes and ROC curves for SIM II. For each scenario in SIM II, panels (a) (d) show the gene-specific difference calculated between estimated fold-change of non-zero counts and true fold-change for 100 simulations. Boxplots of the averages are shown for data normalized by SCnorm, TPM and scran. MR cannot be evaluated in these simulations as each gene contains at least one zero and so no genes pass the MR filter. Motivation for considering non-zero counts to calculate fold-change is discussed in Supplementary Note 3. Panels (e) (h) are structurally identical to (a) (d), but with fold-changes calculated with zeros included. Panels (i) (k) show ROC curves for detection of differentially expressed (DE) genes for 100 simulations of SIM II, scenario 2 (panel (a)), 3 (panel (b)), and 4 (panel(c)) for data normalized by SCnorm, TPM, and scran. The solid line is the average over the 100 5

iterations, and the dashed lines represent ROC curves for five randomly chosen iterations. Supplementary Figure 4: Fold-changes and DE genes calculated from the H9 case study data. For each gene, the fold-change of non-zero counts between the H9-4M and H9-1M groups was computed for data following normalization via SCnorm, MR, TPM, scran, SCDE, and BASiCS. Box-plots of gene-specific fold-changes are shown in panel (a) for data normalized by each method. The number of genes identified as DE using MAST is shown in panel (b). Genes are divided into four equally sized expression groups based on their median among non-zero un-normalized expression measurements and results are shown as a function of expression group. Motivation for considering non-zero counts to calculate fold-change is discussed in Supplementary Note 3. 6

Supplementary Figure 5: ROC curves for a comparison of S vs. G2/M in the H1- FUCCI data. For this evaluation, we subsampled cells from the S and G2/M H1-FUCCI case study data. For the subsampled cells, there are negligible differences in cellular detection rates (CDRs) between the two conditions and there is on average a 1.5 fold increase in sequencing depth (details in Methods). Without differences in CDR, we would expect an EE gene expressed at level x in S to be expressed at level 1.5*x in G2/M. Given this, we define a gold standard DE list to be those genes showing a fold change bigger than a threshold (or smaller than one over that threshold), adjusting for the expected increase in expression due to increased sequencing depth. MAST was applied as detailed in Methods to identify DE genes; and thresholds equal to 1.5, 2, 2.5, and 3 are shown here. 7

Supplementary Figure 6: Count-depth relationship of H1-1M data. For each gene, median quantile regression was used to estimate the count-depth relationship before normalization and after normalization via SCnorm, MR, TPM, scran, SCDE, and BASiCS. Shown are densities of slopes within each of ten equally sized gene groups where a gene s group membership is determined by its median expression among nonzero un-normalized measurements. 8

Supplementary Figure 7: Count-depth relationship of H1-4M data. Results are structurally identical to those shown in Supplementary Figure 6, but for the H1-4M data. 9

Supplementary Figure 8: Count-depth relationship for 5 publically available datasets. Results are structurally identical to those shown in Supplementary Figure 6, but for five publicly available datasets (a) prior to normalization and (b) normalized by SCnorm. 10

Supplementary Figure 9: Estimated count-depth relationship of H1-1M data. Results are structurally identical to those shown in Supplementary Figure 6, but using negative binomial generalized linear regression instead of median quantile regression to calculate gene-specific slopes. 11

Supplementary Figure 10: Count-depth relationship of H1-4M data. Results are structurally identical to those shown in Supplementary Figure 7, but using negative binomial generalized linear regression instead of median quantile regression to calculate gene-specific slopes. 12

Supplementary Figure 11: Count-depth relationship for five publically available datasets. Results are structurally identical to those shown in Supplementary Figure 8(b), but using negative binomial generalized linear regression instead of median quantile regression to calculate gene-specific slopes. 13

Supplementary Figure 12: Motivation for using quantile regression on non-zero data. Panel (a) shows true expression of one hypothetical gene in 100 cells. All variation shown is considered biological. Panel (b) shows ideal measured expression for the same gene as a function of sequencing depth where ideal measured expression is free of technical artifacts. Panel (c) shows the same gene where some cells show zero counts due to technical artifacts. Panel (d) shows expression vs. depth and estimated regression fits for quantile regression (blue) and negative binomial generalized linear model regression (red) without (solid) and with (dashed) the zero counts included. SCnorm leaves zeros unchanged and corrects for the count-depth relationship among non-zeros, which is more accurately summarized by the regression fits on non-zero data. 14

Supplementary Figure 13: Effect of normalization on non-zero means. Panel (a) shows measured expression for a hypothetical EE gene that is sequenced in two conditions. The sequencing depths in the first condition (red) are smaller than those in the second condition (blue). Panel (b) shows the same gene where counts have been normalized using a global scale factor based approach. Note that global scale factor based approaches provide normalized estimates of expression that have similar means among all counts which results in non-zero counts having different mean expression levels if there are differences in the proportion of zeros across conditions. Methods to identify DE genes in scrna-seq data such as MAST and scdd test on non-zero and zero counts separately and, consequently, would identify this EE gene as DE. Panel (c) shows counts normalized by SCnorm which provides normalized estimates of expression that have similar means among non-zero counts. 15

Supplementary Figure 14: The proportion of spike-in expression counts to the total expression counts for each cell is shown for four publicly available datasets and the H1-1M and H1-4M datasets. Cells are ordered by sequencing depth. 16

Supplementary Figure 15: Box-plots of log read counts in each cell are shown separately for endogenous genes (left panel) and spike-ins (right panel) for four publicly available datasets and the H1 case study data. Counts smaller than one are not shown. Cells are ordered by sequencing depth. 17

Supplementary Figure 16: Results from Figure 2, with NODES included. 18

Supplementary Figure 17: Results are structurally identical to Figure 3, but with NODES included. Misclassification rates for SCnorm, MR, TPM, scran, SCDE, and NODES averaged across the three cell cycle phases are 0.12, 0.21, 0.22, 0.22, 0.24, and.31, respectively. Note that these rates differ from those shown in Figure 3 because NODES removes genes and cells prior to normalization, and here we restrict to the genes and cells retained by NODES to facilitate comparing across methods. 19

Supplementary Figure 18: Results from Supplementary Figure 5, with NODES included. 20

Supplementary Figure 19: Summary statistics of simulated and case-study data. The empirical cumulative distribution functions of the gene-specific variances and genespecific means are shown in black in panels (a) and (b), respectively, for one SIM I dataset with. Shown in red are the empirical cumulative distribution functions of the gene-specific variances and gene-specific means for the genes sampled from the H1-1M and H1-4M datasets and used to simulate the SIM I data. Panels (c) and (d) are structurally identical, for one SIM I dataset with. Variances and means are computed on log non-zero expression measurements. 21

References 1. Risso, D., Schwartz, K., Sherlock, G. & Dudoit, S. GC-content normalization for RNA-Seq data. BMC Bioinformatics 12, 480 (2011). 2. Lin, Y. et al. Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC Genomics 17, 28 (2016). 3. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010). 4. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). 5. L. Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize singlecell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016). 6. Kharchenko, P. V, Silberstein, L. & Scadden, D. T. Bayesian approach to singlecell differential expression analysis. Nat. Methods 11, 740 742 (2014). 7. Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian Analysis of Single-Cell Sequencing Data. PLOS Comput. Biol. 11, e1004333 (2015). 8. Korthauer, K. D. et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 17, 222 (2016). 22