RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB

RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB CSF-NGS January 22, 214 Contents 1 Introduction 1 2 Experimental Details 1 3 Results And Discussion 1 3.1 ERCC spike ins............................................ 1 3.2 RNA alignment............................................ 7 3.3 Differential Expression........................................ 11 1 Introduction It was necessary to establish a quick but reliable protocol for the preparation of Illumina sequencing libraries with total RNA as starting material to offer as a service for the CSF-NGS users. The current in house protocol ( standard ) was deemed as taking too long, because hands on time is the major cost factor. Many different protocols and kits for mrna preparation are currently available. (citations). 2 Experimental Details Liver (Clontech cat. nr 63663) and Kidney (Clontech cat. nr 636612) mouse total RNA was used as source material. ERCC spike ins 1 or 2 (life-technologiesi catalog number 445674) were added to each source tube and the combined sample was split to triplicates and each triplicate was prepared separately with either the in-house standard protocol (), the ogen sense kit (ogen catalog number 1.8) or the NEB kit (NEBNext Ultra Directional RNA Library Prep Kit for Illumina, catalog number E742). The resulting samples were sequenced on a HiSeq2 SE with a read length of 5. The reads were 5 trimmed for the ogen preparation method, adaptors were removed with cutadapt, the rrna reads and ERCC spike ins were removed by alignment with bowtie, and the remaining reads were aligned to the mouse mm1 genome and transcriptome using tophat 1.4.1. The aligned tophat reads were counted per gene with HTSeq-count and the counts were used for differential gene expression estimation using DESeq. 3 Results And Discussion 3.1 ERCC spike ins Splitting the sample with the spike in mix already added into the three technical replicates per condition allowed us to assess the reliability of the preparation. By normalizing against the total number of obtained reads we estimated the coefficient of variation per condition and preparation to between.1 ( kidney) and.18 ( liver) (Table??). Normalizing agains the ERCC aligned reads allows to calculate a dose-response regression line (Table??. All experiments show a high correlation with the NEB prepared spike-ins showing the lowest variance. TODO: calculate range of detection. 1

Table 1: summary of results Criteria Lexogen NEB Standard hands on time a 5-6h 1d 3d detection of differential gene expression very good very good very good variance per gene low very low low strandiness very high high high variance of gene coverage spiky b very low very low 5 coverage lag low detectable detectable rrna depletion c exhaustive low low ease of automatication not known protocols available d low duplication medium b very low low dynamic range good good good a including RNA QC, excluding library QC b expected due to priming method, no influence on differential expression c only one round of poly-a enrichment d for Hamilton STAR robot 2

prep condition mean sd cv liver.44.8.18 kidney.49.3.7 liver.57.4.6 kidney 1.31.8.6 liver.68.4.6 kidney.77.1.1 1.2 percent ERCC/total.9 replicate 1 2 3 condition liver kidney.6 prep Figure 1: Relative abundance of ERCC-Spike Ins compared to total number of reads. Reads were aligned with bowtie against the ERCC genes and the mouse rdna cluster and the uniquely aligning reads were counted. 3

prep condition intercept slope R2 liver 1.3.861.864 kidney.86.877.888 liver.967.95.899 kidney 1.1.892.893 liver.67.931.96 kidney.384.948.966 2 15 log2(rpkm) 1 5 2 15 1 5 liver kidney legend expected lm loess ND 5 5 1 15 5 5 1 15 5 5 1 15 log2(expected attomoles) Figure 2: ERCC dose response. The sum of all the uniquely aligning reads per ERCC gene was normalized by the length of the gene and the total number of reads aligning uniquely to the ERCC controls and the resulting rpkm values were plotted against the expected number of RNA-molecules. Linear regression parameters (top) and scatter plot (bottom) with expected counts (blue line), regression line (red line), loess curve (green line) and undetected genes at an arbitrary rpkm value (yellow dots). 4

y =.13 +.9 x r 2 =.691 y =.56 + 1 x r 2 =.77 y =.23 +.97 x r 2 =.81 2 log2(l/k) model expected lm 2 1 1 2 1 1 2 1 1 2 expected log2(l/k) Figure 3: ERCC fold-change response. The ratios of the mean rpm log2 ratio per condition were plotted against the expected log2 ratio. Expected counts (red line) and regression line (green line) are indicated. 5

3 average coverage per million reads 2 1 3 2 1 6 4 2 15 1 5 (,5] (5,1] (1,2] (2,92] prep 25 5 75 1 position % Figure 4: ERCC coverage across genes. The genes were binned per preparation method by their rank (top 5, 6-1, 11-2, rest) and the average coverage per million reads per bin is plotted against the length normalized genes. 6

3.2 RNA alignment 7

Alignment Distribution liver kidney liver kidney liver kidney absolute counts 3e+7 2e+7 1e+7 V1 Cleaned Cut NM R U U1 e+ U2 15956 15957 15958 15959 1596 15961 15969 1597 15971 15972 15973 15974 sample id 16153 16154 16155 16156 16157 16158 percent of total 6 4 2 replicate 1 2 3 condition liver kidney Cleaned Cut NM R U U1 U2 Cleaned Cut NM R U U1 U2 align type Cleaned Cut NM R U U1 U2 Figure 5: Alignment Statistics. The reads were 5 trimmed for the ogen preparation method, adaptors were removed with cutadapt, the rrna reads and ERCC spike ins were removed by alignment with bowtie, and the remaining reads were aligned to the mouse mm1 genome and transcriptome using tophat 1.4.1. Absolute counts (top panel) and relative percentage (bottom panel) of each alignment category (Cut: small adaptor truncated reads removed, Cleaned: reads aligning to ERCC or rrna, U-U3: unique alignments with -3 mismatches, R: reads aligning repetitively, NM: reads not aligning) are shown. 8

1 cumulative percent of uniquely aligned reads 75 5 25 preparation replicate 1 2 3 1 1 X plicates Figure 6: Xplicates. Uniquely aligned reads were binned by number of overlaps at each position and the cumulative sum was calculated with increasing number of duplication. 9

1.5 unstranded same opposite 1. normalized mean coverage.5. 1.5 1..5. 1.5 1..5 condition kidney liver replicate 1 2 3. 25 5 75 1 25 5 75 1 25 5 75 1 bin Figure 7: Coverage across cdna. Mean coverage across all length normalized genes (cdna). 1

3.3 Differential Expression 11

1 5 1 5 5 1 5 1 1 5 1 5.95 5 1 1 5 5 5.93.95 5 1 5 1 Scatter Plot Matrix Figure 8: Scatter plot matrix of log2 fold changes per preparation. The log2 fold changes of the comparison liver/kidney of each preparation were plotted against each other (upper triangle) and the spearman rank correlation was calculated (lower triangle). 12

13 dispersion 1e 4 1e 2 1e+ 1e+1 1e+5 1e+1 1e+3 l 1e+5 1e+1 1e+3 k 1e+5 mean 1e+1 1e+3 l 1e+5 1e+1 1e+3 k 1e+5 1e+1 Figure 9: The variance of each condition was estimated by deseq estimatedispersions with the model fit indicated in red. 1e+3 k 1e+3 l 1e+5

adj.p <.1 ; abs(log2fc) > 1 adj.p <.1 ; abs(log2fc) > 5 625 617 6434 adj.p <.1 ; abs(log2fc) > 5 118 119 1258 adj.p <.1 ; abs(log2fc) > 1 1127 1155 1215 169 194 392 Figure 1: Venn Diagrams of significantly differentially expressed genes under different cutoffs. 14