Methods: Biological Data

Size: px

Start display at page:

Download "Methods: Biological Data"

Elizabeth Tracey King
5 years ago
Views:

1 Transcriptome analysis of short read Illumina RNA sequencing: investigating baseline variability in gene expression levels and splice variants among human brain and Lymphoblastoid samples Abstract Understanding baseline variability in gene expression levels and splice variants is essential for interpreting many studies. Although recent advances in RNA sequencing enable the analyses of transcript variation at unprecedented resolution, not much effort has been done to assess baseline variation between individuals. Biological variability in gene expression has been shown not to be eliminated by sequencing technology 13 and we show the same accounts to splice variation. We analyzed RNA-seq data from two studies using sufficient biological variants (replicates). We found Cerebellum tissue data and Lymphoblastoid cell-line data from unrelated human individuals. We show insight in variation between individuals of Cerebellum tissue samples and Lymphoblastoid cell-line samples for differential splice variant expression and differential gene expression. In comparison to the Lymphoblastoid cell-line samples, we find for the Cerebellum tissue samples little differential expression of genes and splice variants. Also we find ~75% of the differentially expressed genes are differentially expressed by one of the samples. Although these can be biologically functional, they can also possibly represent baseline variance of the Cerebellum cells. In contrast to Cerebellum cells, we find many differentially expressed genes and a large variation in differentially expressed splice variants for Lymphoblastoid cell-line samples. We find three samples of the pool of samples to be very much alike and could be annotated as a subgroup. As Lymphoblastoid cells are stem-cell like that differentiate into three subgroups (B lymphocytes, T lymphocytes and Natural Killer cells), we suggest the three samples could, speaking of genotype, already be one of them. Our results enforce significant use of biological replicates. The desired saturation level, which is partial to sample type, should be taken into account before deciding the sequencing depth. Introduction As a deep-sequencing tool, RNA-seq is able to accurately detect many types of RNA: mrnas, small non-coding RNAs including mirnas (micro) and Piwi-interacting RNAs (pirnas)/21u-rna sequences, rrnas (ribosomal), trnas (transfer), snornas (small nucleolar). Analysis of gene expression by sequencing is highly reproducible and more sensitive than micro-arrays 2. Although the current RNA-seq-based approaches for studying srnas is limited in its ability to provide an absolutely quantitative view of the transcripts 1,2. RNA-seq has been used to interrogate transcriptomes of yeast, Caenorhabditis elegans, Drosophila melanogaster, mouse and human tissues and stem cells 3,4,5,6,7,8,9,10,11,12. The 1

2 studies show the technology enables to: find most annotated genes, splicing isoforms and alternative transcript start sites (of human brains 11 ), find many previously unknown splice variants (of embryonic and neonatal mouse cortex12), quantify gene expression levels, significantly major changes in expression during development and between males and hermaphrodites of Caenorhabditis elegans and novel mirna candidates and Piwiinteracting RNAs (pirnas)/21u-rnas 7, alternate splicing and novel transcripts in Drosophila melanogaster 8, identification of potential genes involved in stress resistance (of C. elegans 10 ) and reveal sequences and expression levels (of known and novel mirna genes involved in human embryonic stem cells 9 ). Since its initiation last decade RNA Sequencing (RNA-Seq) has with its massively parallel cdna sequencing shown to make firm advances in genomics and especially transcriptomics. Many new previously unknown coding and non-coding RNA species had been found 7,8,9,12 and we have come to appreciate more the complexity of the transcriptome. Different methodologies and new analysis tools have been developed. Unfortunately many studies using RNA-seq have only used few if any biological replicates in their study, perhaps because of the costs. A recent study shows RNA-seq does not eliminate biological variability 13 and we want to show not only accounts for gene expression but also for splice variants. In this research we will investigate technical and biological variance. The influence of sequencing depth on technical variability will also be investigated. We used data from the studies of Wang et al. 14 (2008), and Pickrell et al. 15 (2010). Both studies used high-throughput sequencing by RNA-seq using Illumina technology and provided us with data from a fair amount of biological variants. Wang et al. focus on finding genes involved in neurogenesis and Pickrell et al. on finding mechanisms underlying gene expression. With their data we will investigate differential splice variant and gene expression: 1) How do they differ from each other? 2) How do they differ among biological variants of the same cell-type or tissue? 3) Is one biological replicate as used in many studies using Next Generation RNA Sequencing enough? This study provides a critical look on RNA-seq analysis and methodology. The results will be discussed in context with technical and biological variability. Methods: Biological Data As RNA-seq is still a new tool there are not many datasets available to our choosing and information about the data is not always complete. Important for this study is that the samples are uniquely annotated to individuals and are Homo sapiens. Sample ID's are provided on the SRA website ( Favorable knowledge for further analyses are age and gender of the individuals for each sample, but not all information is available for the data sets provided and used in this study. We obtained data of seven Cerebellum samples which came from human males 14. The Lymphoblastoid samples came from a very large pool of sixty-nine Nigerian individuals 15. Sample information is shown in the next two tables. 2

3 Table 1 RNA-seq data (a,b) Basic sample information of the Cerebellum tissue samples and Lymphoblastoid cell-line samples are described. SRA is their annotation given by the online Sequence Read Archive. Number of reads depicts how 'deep' the sample is sequenced. The sample data is acquired from sequencing cdna single end reads. The number of bases per read shows the length of the reads. All reads are unrelated humans and of Cerebellum samples they are all male and the Lymphoblastoid cell-lines are from Nigerians. a b Data manipulation and non-junction mapping The uploaded FastQ files, initially in Illumina format, need to be 'groomed' with FastQ Groomer so that all data, before further manipulation, are in Sanger format. Essentially the goal is to find differentially expressed splice variants and genes. The tool TopHat 16 first maps non-junction reads using Bowtie, which is an ultra-fast short-read mapping program 17. Bowtie indexes the reference genome, human (Homo sapiens): hg 19 full, using a technique borrowed from data-compression: the Borrows-Wheeler transform 18,19. TopHat finds junctions by mapping reads to the reference in two phases. First off, all reads are mapped to the reference genome using Bowtie. See Figure 1. Bowtie takes into account that the 5' of a read contains fewer sequencing errors than the 3' end 20 and allows for so called 'multi-reads' from genes with multiple copies to be reported, but discards low-complexity reads. All reads that do not map to the reference genome are set aside as Initially UnMapped reads (IUM reads). TopHat assembles the mapped reads using the assembly module in Maq 21. 3

4 Figure 1 TopHat and Bowtie (non-)junction mapping. Reads are mapped using Bowtie. Initially UnMapped (IUM) reads are firstly set aside. Mapping read sequences are searched for flanking potential donor/acceptor splice sites with Maq. These are joined to form potential splice junctions, for which IUM reads are indexed and aligned to. Junction mapping To map the IUM reads to splice junctions, TopHat first enumerates all canonical donor and acceptor sites within island sequences (as well as their reverse complements), defined by a default conservative parameter describing the allowed coverage gap between exons. Next, it considers all pairings of these sites that could form canonical (GT-AG) introns between neighboring (but not necessarily adjacent) islands. Each possible intron is checked against the IUM reads for reads that span the splice junction. See Figure 2. In order to detect junctions without sacrificing performance and specificity, the TopHat algorithm looks for introns within islands that are deeply sequenced. It can be set more sensitive to find splice junctions in order to find more splice junctions, in expense of running time. For each junction, the average depth of a read coverage is computed for the left and right flanking regions of the junction separately. The number of alignments crossing the junction is divided by the coverage of the more deeply covered side to obtain an estimate for the minor isoform frequency. 4

Figure 2 Mapping IUM reads. Detecting introns within islands. In this island the intron of one splice variant is overlapped by the 5'-UTR of another transcript.

5 Figure 2 Mapping IUM reads. Detecting introns within islands. In this island the intron of one splice variant is overlapped by the 5'-UTR of another transcript. In case of a lack of a large coverage gab between two exons as in this picture, TopHat will look for introns within single islands to detect junctions. Both isoforms are found in mouse brain. Creating a pool To investigate variation of splicing and gene expression we need a pool that serves as a normalized background. For each tissue, Cerebellum and Lymphoblastoid, we created a pool of all samples where each splice junction and each gene is represented, but are normalized as a group average. Transcript abundance The next step is estimating transcript abundance. We use algorithms that are not restricted by prior gene annotations and account for alternative transcription and splicing, allowing for simultaneous transcript discovery and abundance estimation for RNA-seq data. For finding the minimal set of transcript that is supported by fragment read alignments, we use a comparative transcriptome assembly algorithm. To find maximum matching, compatibilities among fragments are represented in a weighted bipartite graph 14,22. See Fig.3. Abundances are reported in FPKM (IsoForm-level relative abundance in Reads Per Kilo-base of exon model per Million mapped reads). In these units, the relative abundances of transcripts are described in terms of expected biological objects (fragments) observed from an RNA-seq experiment. Confidence intervals for estimates are obtained using a Bayesian enference method based on importance sampling from the posterior distribution. 5

6 Figure 3 Overview of CuffLinks. (a) The algorithm inputs mapped reads, as for instance by TopHat. (b) Incompatible fragments are identified and assembled with the other fragments in an overlap graph to find possible isoforms (c). The minimal set of isoforms that cover all fragments are found. (d) A statistical model estimates transcript abundances. The probability of each splice variant is estimated by incorporating the probability of the accompanied possible transcript length by annotating them to different isoforms. (e) The abundances that best explain the observed fragments are produced numerically and shown as a pie chart. Differential expression between individual samples and their respective pool By comparing FPKM values of expressed genes or isoforms differential expression is found between samples. When for example the expression of a particular gene is significantly different (significantly higher or lower FPKM values) then the gene will show up to be differentially expressed, with a False Discovery Rate (FDR) of 5%. In this study we are interested in finding differential expression of genes and splice variants of individual samples compared to a group of samples, which we call the pool of samples. For the Cerebellum tissue samples we created a pool of samples, including all the Cerebellum samples depicted in Table 1. As for the Lymphoblastoid cell-line samples we made two pools, one including the Lymphoblastoid samples 1-7, and the other including the Lymphoblastoid samples Also for these pools abundance estimates are produced 6

7 as explained in the previous paragraph. This means the FPKM values of the pools are comparable to individual samples. In this study we compare individual samples against the pool they are part of. For example: sample 1 of the Cerebellum tissue samples will be compared to find differential expression of genes or splice variants to the pool of Cerebellum tissue samples 1-7 (the sample investigated is also included in the pool it is compared to). Differential expression of splice variants and genes are annotated to loci and gene ids and estimates are given numerically. In this study we only used significant (FDR = 0.05) differentially expressed genes and spliced variants for further study. Characterization of Differential Expression For further analyses of differential expression we counted for each significantly differentially expressed splice variant or gene the number of times it is found among the samples of the same cell type. Having seven samples in a pool means a splice variant or gene can be found differentially expressed in one to seven samples. We made a Perl script to run this calculation (in supplementary information). With this information it is possible to determine how unique a differentially expressed gene or splice variant is and get an idea how these are regulated. In case of finding differential expression in all seven out of seven samples we often find (over) expression of a splice variant or gene for one sample and no expression by the other samples. Finding differential expression in six out of seven samples means one of the samples has expression levels close to the estimated group average. Finding differential expression for five out of seven means two of the samples have expression close to the estimated group average, and so on. Software Most tools (Bowtie, TopHat, CuffLinks) used in this study are part of a popular pipeline for RNA-seq data, which is provided by the free web-based (also for local use) Galaxy ( A protocol is provided with a robust default setup, but also proves freely flexible 16,23. In sense of time management datasets were acquired via the online and free Sequence Read Archive (SRA) ( Here the raw data files are uploaded in archived SRA files for free and shared use in the community. There are tools available to convert the SRA files to many different file types. Galaxy imports FastQ (Sanger) files in different formats, depending on the instrument used to obtain the reads. The included quality data indicates how certain the given bases of the reads are and this is useful for further analyses of the RNA-seq data. The FastQ files obtained from SRA were intentionally only acquired from data by Illumina instruments, to minimize technical variation. The RNA-seq data of the Cerebellum came from the study by Wang 14 and the Lymphoblastoid RNA-seq data from the study by Pickrell 15. Pools were made with SAM Tools provided by the Galaxy site by combining the BAM files, which are one of the output files of the tool TopHat. The BAM files together represent all the accepted hits of all the samples included. 7

8 Results Output For Lymphoblastoid cell-line samples we found loci where all samples differentially express splice variants compared to the samples taken together as a pool. Figure 4 shows a locus of where for all seven samples splicing variants are found differentially expressed. Finding all seven samples differentially expressed in a certain locus does not mean all samples express splice variants (or genes when looking for differential gene expression). Some samples express the splice variants and others don't. This picture shows expression only qualitatively, not quantitatively, which we will show later. Figure 4 Example of a loci on chromosome 1 of Lymphoblastoid cells with all samples finding differentially expressed splice variants This figure does not show quantitatively, but does qualitatively which transcripts (accepted hits) and splice junctions are found in which sample from the pool of seven Lymphoblastoid cell-line samples, compared to the first row which shows the accepted hits of the pooled samples accumulatively. Samples one to seven are shown here as data 11 to 17 respectively. The first row indicates the accepted hits on the reference genome of the pooled samples of Lymphoblastoid cells. Of each sample the accepted hits (black boxes) are shown mapped on the genome. A horizontal gray line between the hits depicts a possible splice junction, where the adjacent accepted hit is possibly part of a splice variant. The splice junctions found for each sample in the specified loci are shown in the next row. Note that some samples did not find splice junctions in this region of the genome (chromosome 1: 155,935, ,948,387 human genome19). 8

Table 2 shows when finding differential gene expression in seven samples in a particular locus this often means six samples don't have any gene expression in this locus and one sample has.

Million mapped reads) values are depicted for the seven samples of the Cerebellum for the locus where is found differential gene expression for all seven samples of the pool.

9 Table 2 shows when finding differential gene expression in seven samples in a particular locus this often means six samples don't have any gene expression in this locus and one sample has. Table 2 Example of differential expression of splice variants in all seven Cerebellum samples for a particular locus FPKM (IsoForm-level relative abundance in Reads Per Kilo-base of exon model per Million mapped reads) values are depicted for the seven samples of the Cerebellum for the locus where is found differential gene expression for all seven samples of the pool. The gene id is also included which can be used to find more information about this particular gene. Looking for gene on finds this is a heat shock factor as well as other information. Sometimes finding differential gene expression in seven samples means any number of the seven samples find gene expression, but all have differential expression with regard to the FPKM of the pool. Table 3 shows an example where all seven Lymphoblastoid cellline samples find differential expression in a particular locus and four of them have expression and three have not. Table 3 Example of differential expression of splice variants in all seven Lymphoblastoid samples for a particular locus FPKM (IsoForm-level relative abundance in Reads Per Kilo-base of exon model per Million mapped reads) values are depicted for the seven samples of the Lymphoblastoid for the locus where is found differential gene expression for all seven samples of the pool. The gene id is also included which can be used to find more information about this particular gene. 9

Total Significant Differential Splice Variants Expression There is a significant difference between the total significant differentially expressed spliced variants found in Cerebellum tissue samples

Figure 5 Total Significant Differential Splice Variant Expression Total significant differential splice variant expression is shown for Cerebellum tissue samples and Lymphoblastoid cell-line samples.

10 Total Significant Differential Splice Variants Expression There is a significant difference between the total significant differentially expressed spliced variants found in Cerebellum tissue samples and Lymphoblastoid cell-line samples, as depicted in Figure 5. Figure 5 Total Significant Differential Splice Variant Expression Total significant differential splice variant expression is shown for Cerebellum tissue samples and Lymphoblastoid cell-line samples. The pool of Cerebellum samples consists of seven individual samples, whereas we have a pool of seven but also ten samples for Lymphoblastoid samples. Among Cerebellum tissue samples we find 24 to 35 significantly differentially expressed splice variants. For Lymphoblastoid cell-line samples we are interested in how pool size will affect results for finding differential expression and the results for a pool of seven samples as well as a pool of ten Lymphoblastoid cell-line samples is depicted in Figure 5. Total significant differentially expressed splice variants are more numerous in Lymphoblastoid cell-line samples for both pools compared to Cerebellum samples. On average we found 29 in Cerebellum samples and 226 and 242 in Lymphoblastoid samples of the pool of seven samples and ten samples respectively, a factor ~8 difference. We wondered if our pools of seven samples were sufficient to represent the population. By doing so we created a pool of seven samples and a pool of ten for Lymphoblastoid cell-line samples. Figure 5 shows pool size makes a difference in the number of significantly differentially expressed splice variants but the difference is small. Figure 6 Total Significant Differential Gene Expression Total significant differential gene expression is shown for Cerebellum tissue samples and Lymphoblastoid cell-line samples. The pool of Cerebellum samples consists of seven individual samples, whereas we have a pool of seven but also ten samples for Lymphoblastoid samples. 10

11 The total number found differentially expressed genes is more numerous for Lymphoblastoid samples compared to Cerebellum. On average we found 820 in Cerebellum samples and 5022 and 7198 in Lymphoblastoid samples of the pool of seven samples and ten samples respectively, a factor ~8-9 difference. Increasing the pool from seven to ten samples doesn't make a significant difference for the samples individually, with the exception of sample 4, SRR This sample finds many more differentially expressed genes with the bigger pool, where three different Lymphoblastoid cell-line samples are added. This suggests sample 4 is proportionally much more different from the three newly introduced samples compared to the other samples in the original pool. In this study we will not go into further detail on this particular sample. Overall the conclusion is the total amount of differentially expressed genes or spliced variants found in the samples individually doesn't greatly change by increasing the pool and a pool of seven samples for both cell types suffices for this study. Discovery Rate Bias: Number of Differentially Expressed Splice Variants or Genes Found Against Total Number of Reads of the Sample The study by Wang 14 describes the importance of finding many transcripts to find a big fraction of transcribed genes. The number of reads found in the Cerebellum samples are all close to 2.5 million and on average 4 million reads are found in the Lymphoblastoid samples. The first has a standard deviation of 5% and the latter a whopping 57%. This makes it reasonable to investigate if the number of reads are enumerate enough and if it influences the results. Wang suggests a breaking point in the number of reads to finding total fraction of transcribed genes and spliced variants. They suggest a breaking point around 1 million reads to finding ~100% of transcribed genes 14. The smallest sample used in this study has 1.3 million reads (Lymphoblastoid) and should be enough, but we check to be sure. 11

12 Figure7 Discovery Rate Bias - Saturation a b c d e f (a-f) Discovery rates vary among the samples (indicated as points in the graphs), and also differs when the pool size changes. Discovery rates of finding differential expression of genes or splice variants in Lymphoblastoid or Cerebellum samples are depicted as percentages, as total amount of differential expression per sample is divided by the total reads of the sample. Discovery rate of differential expression of splice variants and genes are shown for the pool of seven Cerebellum samples (a,b), the pool of seven Lymphoblastoid samples (c,d) and the pool of ten Lymphoblastoid samples (e,f). In Figure 7 the discovery rate drop when the number of reads of the sample increases, this picture is a fairly clear for the Cerebellum discovery rates and much more clear when looking at the discovery rates found in the Lymphoblastoid pools. Important is to look at the scales of the figures. The discovery rates are much smaller for Cerebellum samples in finding differentially expressed splice variants than for finding differentially expressed splice variants or genes in Lymphoblastoid samples. This suggests the number of reads for Cerebellum samples is more sufficient than the same number of reads for Lymphoblastoid samples. Only the three Lymphoblastoid samples with the most reads have roughly the same discovery rates and therefore are equally saturated. This suggests it makes a large difference which type of cell or tissue is being sequenced to find equal 12

We find saturation levels are better for the Cerebellum samples and we expect saturation is not optimal for the Lymphoblastoid samples.

13 saturation levels. Saturation levels should be checked when acquiring data and researchers should decide which saturation levels they find reasonable in practical sense and account for that in their study. We find saturation levels are better for the Cerebellum samples and we expect saturation is not optimal for the Lymphoblastoid samples. Because we want to make a comparison of variability between individuals within different pools we decided to go with two pools of seven samples, one of Cerebellum tissue samples and the other of Lymphoblastoid cell-line samples. We could not make the pool of Cerebellum samples bigger as there were no extra available. We continued with the original pool of seven Lymphoblastoid samples to minimize confusion, although we would now have chosen a pool of Lymphoblastoid samples with the most reads per sample due to reasons previously explained. Luckily this would only result in swapping sample 5 with sample 8 as the three added samples to create the pool of ten are, together with sample 5, in the top 4 with least reads per sample. Differential Splice Variant Variance Further analyses of differential splicing between samples of Cerebellum tissue or Lymphoblastoid cell-lines shows biological variance in more detail. Figure 8 Occurrence of Differential Splice Variant Expression The ratio of differentially expressed splice variants of the particular cell type that is found in a particular maximum number of sample(s) is shown in the figure. Table 4 Occurrence of Differential Splice Variant Expression The ratio of differentially expressed splice variants of the particular cell type that is found in a particular maximum number of sample(s) is shown. The table also shows in numbers how many differentially expressed splice variants are found in how many samples at the most. 13

4, splice variants are often significantly differential expressed in 3 out of 7 Lymphoblastoid cell-line samples.

14 As shown in Figure 8 and Table 4, most differential expressed splice variants are only differentially expressed in one single sample. Cerebellum splice variants are, on average, found differentially expressed in 2 out of 7 samples, with an average of 2.2. With an average of 3.4, splice variants are often significantly differential expressed in 3 out of 7 Lymphoblastoid cell-line samples. In contrast to differentially expressed splice variants found in Cerebellum samples, differentially expressed splice variants found in Lymphoblastoid samples are also found to be differentially expressed in 6 out of 7 or all samples, in regard to the group norm. A representation of the found hits and junctions of the Lymphoblastoid samples for a gene where all seven samples differentially express splicing variants is found in Figure 4. These results suggest there to be more variation among Lymphoblastoid cells than among Cerebellum cells. Furthermore we want to know how much different the samples are to one another. The next figure and table shed more light on the relative differential splice variant expression. Figure 9 Relative Differential Splice Variant Expression among Cerebellum tissue samples and Lymphoblastoid cell-line samples The ratio of samples an average differentially expressed splice variant from the samples one to seven is found in is shown with their standard deviation error bars and average ratio. Table 5 Relative Differential Splice Variant Expression among Cerebellum tissue samples and Lymphoblastoid cell-line samples In numbers the ratio of samples an average differentially expressed splice variant from the samples one to seven is found in is shown. Figure 5 shows for samples one to seven, the expected ratio of samples an average splice variant from the sample is found in. For example, an average differential splice variant found in Cerebellum sample 1 is found in 32% of Cerebellum samples (which includes 14

Differential Gene Expression Variance In contrast to Cerebellum tissue, many differentially expressed genes are found in the Lymphoblastoid samples, an average of 820 and 5022 respectively.

15 sample 1). This would be 32% * 7 = ~2.25 samples, which makes two samples in total. Standard deviation (stdev) for Lymphoblastoid samples is more than six times bigger than the stdev of the Cerebellum samples. Differential Gene Expression Variance In contrast to Cerebellum tissue, many differentially expressed genes are found in the Lymphoblastoid samples, an average of 820 and 5022 respectively. We are interested in how these are distributed among the samples individually. See Figure 10 and Table 6. Figure 10 Occurrence of Differential Gene Expression The ratio of differentially expressed genes of the particular cell type that is found in a particular maximum number of sample(s) is shown in the figure. Table 6 Occurrence of Differential Gene Expression The table shows numerically how many differentially expressed genes are found in how many samples at the most. For both Cerebellum tissue and Lymphoblastoid cells, most differentially expressed genes are only found in one sample, which means these differential expressed genes seem to be unique for these samples. As explained earlier (Table 2), finding differential expression in seven samples often means finding expression of the gene or splice variant in just one of the samples and none for the remaining six samples. For Cerebellum tissue this means the differential expression found for all seven samples (18%) can often be attributed to only one of the seven samples, much like finding differential expression in only one of the Cerebellum samples (58%). This makes for ~75% of differentially expressed genes to be attributed by a single sample of the pool. Lymphoblastoid cells have a lower rate of differential expression being attributed by just a single sample (57%). In comparison to Cerebellum tissue samples differential expression in a particular locus is often found to be attributed by two (28%) or three 15

(11%) samples. Note that in a pool of seven finding differential expression of two samples in a particular locus means five of the seven samples find expression close to the FPKM of the pool.

16 (11%) samples. Note that in a pool of seven finding differential expression of two samples in a particular locus means five of the seven samples find expression close to the FPKM of the pool. The two samples finding differentially expressed can have a FPKM significantly lower, higher or one higher and one lower than the FPKM of the pool. Figure 11 Relative Differential Gene Expression among Cerebellum tissue samples and Lymphoblastoid cell-line samples The ratio of samples an average differentially expressed gene from the samples one to seven is found in is shown with their standard deviation error bars and average ratio. Table 7 Relative Differential Gene Expression among Cerebellum tissue samples and Lymphoblastoid cell-line samples The ratio of samples an average differentially expressed gene from the samples one to seven is found in is shown numerically. The standard deviations of Cerebellum and Lymphoblastoid cells are small, 3.6% and 4.7% respectively, Figure 11 and Table 7. This suggests all individuals samples of Cerebellum are about equally different from each other; this also goes for Lymphoblastoid cells. This gives confidence the results are reliable. 16

Table 8 Differential Gene Expression Grouping of Lymphoblastoid Samples Shown for the different differential gene expression of Lymphoblastoid samples are how many times the groups of samples find

17 Table 8 Differential Gene Expression Grouping of Lymphoblastoid Samples Shown for the different differential gene expression of Lymphoblastoid samples are how many times the groups of samples find differential gene expression together. The first three columns show results for genes where groups of two samples find differential gene expression, the next three where groups of three samples find differential gene expression and the last three where groups of four samples find differential gene expression. Note that the genes only show up in one of the columns as they are annotated by maximum number of samples that find differential gene expression as in Fig.10 and 11. Table 8 shows how many times groups of samples find differential gene expression for the same genes. For example sample 5 and 7 (5+7) find differential gene expression uniquely together for 1882 genes, for which the other samples find expression FPKM levels close to the pool FPKM. The most striking feature in the table is that ~50% of the genes that are found differentially expressed in three samples are in the group of sample For virtually each of the 1882 genes the three samples show no expression and therefore show up as differentially expressed. Also when looking at genes that are differentially spliced in two or four samples these three samples stand out the most, and clearly distinguish themselves from the other samples. Samples 1 to 4 do not seem to group together, group have a ratio of having differentially expressed genes to be uniquely ascribed to them of only 0.33% of differentially expressed genes found by in four out of seven samples. 17

Table 9 Differential Gene Expression Grouping of Cerebellum Samples Shown for the different differential gene expression of Cerebellum samples are how many times the groups of samples find

18 Table 9 Differential Gene Expression Grouping of Cerebellum Samples Shown for the different differential gene expression of Cerebellum samples are how many times the groups of samples find differential gene expression together. For example sample 1 and 7 find differential gene expressions together for 28 genes, for which the other samples find expression FPKM levels close to the pool FPKM. The first three columns show results for genes where groups of two samples find differential gene expression, the next three where groups of three samples find differential gene expression and the last three where groups of four samples find differential gene expression. Note that the genes only show up in one of the columns as they are annotated by maximum number of samples that find differential gene expression as in Figure 10 and 11. Table 9 shows the same kind of data as in table 8 for the Cerebellum samples. As no figures stand out in this table, there seems to be no clear grouping of samples. 18

Table 10 Differential Splice Variant Expression Grouping of Lymphoblastoid cellline samples Shown for the different differential splice variant expression of Lymphoblastoid samples are how many times

19 Table 10 Differential Splice Variant Expression Grouping of Lymphoblastoid cellline samples Shown for the different differential splice variant expression of Lymphoblastoid samples are how many times the groups of samples find differential gene expression together. For example sample 5 and 7 find differential splice variant expression together for 51 splice variants, for which the other samples find FPKM values close to the pool FPKM. The first three columns show results for splice variants where groups of two samples find differential gene expression, the next three where groups of three samples find differential splice variant expression and the last three where groups of four samples find differential splice variant expression. Note that the splice variants only show up in one of the columns as they are annotated by maximum number of samples that find differential splice variant expression as in Figure8 and 9. Table 10 shows as in table 8 that samples 5+7 and 5+6 group together well with regard to the same differential splice variant expression they find. Also the sample group group well when looking for differential splice variant expression for three samples. These three samples also group together well with sample 4. 19

20 Discussion For every study it s important to know how to interpret your results. When using few biological replicates in a research the results will probably tell more about the individual samples than about the species in general term. As sequencing provides detailed information about samples baseline variation will play a part in the variation of the results, but also technical variation needs to be investigated properly to provide a context for the reliability of the results. A recent study has shown biological variability in gene expression is not eliminated by sequencing technology 13, but not much research has been done in baseline variability. In this research we want to contribute to understand more how this affects results in Next-Gen Sequencing. In this study we want to understand how technical variation attributes to our results. The methodology of our research will be assessed and we look for baseline variation in Cerebellum tissue samples and Lymphoblastoid cell-line samples. We do this by investigating in detail the differential expression of genes and splice variants between samples and pools of samples, and also compare how a pool of Cerebellum samples compares to a pool of Lymphoblastoid samples. We also look how different pool sizes affect our results and how sequencing depth and saturation levels correlate. Quantification and qualification A lot of data about differential gene expression and spliced variants has been obtained by Next-Gen Sequencing with the Illumina instrument by the two studies. This study shows individual samples show detail of the state they are sampled in and reads, differentially expressed splice variants and differentially expressed genes need to be put into biological context. A single sample gives enough data in quantity, but for quality and quantity analyses context is highly important and biological replicates should be included in all studies. Technical Variation Transcriptomes of different tissues were already known to be highly variable within the same individual, but this study shows also individual samples have many unique characteristics when looking into detail of the differentially expressed splice variants and the differentially expressed genes. We tried to minimize as much as possible the technical variation by choosing our data mindfully; the samples we use are all deeply sequenced: more than 1 million reads per sample 14, and produced by Illumina instruments. But we nevertheless find saturation levels of our samples to vary and not the exclude technical variation as hoped. Also read depth of the samples varies from 1.2M to 7.8M reads per sample and possibly introduce technical variation. Saturation The discovery rates in Fig.7 show the discovery rates drop when increasing the number of reads of the samples. The read count of the samples is not the only variable in saturation, also the type of tissue or sample that is investigated makes a difference. To find equal levels of saturation one would need many more reads for Lymphoblastoid samples than for Cerebellum samples. We suggest this is due to larger biological variability of the Lymphoblastoid samples. When comparing different tissues researchers 20

21 should investigate saturation levels of their samples and decide which saturation levels they want for their samples, and decide with that information the sample sizes for each tissue individually, and keep read counts for individuals of the same tissue or cells the same. Mind that read counts should surpass the wanted threshold for finding genes as suggested by Wang et al 14. Total Significant Differential Expression We find many more differential expression of splice variants and genes for Lymphoblastoid cell-line samples than for Cerebellum tissue samples, shown in table 1. This also accounts to samples with about the same amount of reads. Even the Lymphoblastoid samples (5, 9, 10) which have fewer reads than the Cerebellum samples have many times over more differential expression found. The Lymphoblastoid samples with the most reads even find fewer differential expression in comparison to the other Lymphoblastoid samples, for example samples 1, 2 and 3 against samples 5, 6 and 7. Pool Size Pool size seems to matter little when comparing a pool of seven Lymphoblastoid samples to a pool of ten Lymphoblastoid samples. One sample, sample 4, does clearly find more differential gene expression with the larger pool. We decided to continue the study with the original group of seven Lymphoblastoid samples, as this is also the size of the Cerebellum pool and we can not increase this pool due to lack of available data. We suggest that sample 4 is, relative to the other six samples in the pool of seven, more different from the newly introduced samples in the larger pool, and we therefore find for this sample more differential gene expression. Differential Splice Variant Expression Figures 6, 8 and 9 show Lymphoblastoid samples to have more individual variability than Cerebellum samples, as total differential splice variant expression is substantially higher, although differentially expressed splice variants are less unique for the samples individually. The Lymphoblastoid samples also show to vary in the ratio of samples an average differentially spliced variant from the samples are found in. This suggests some samples are more like one another than the other. For the Lymphoblastoid samples we go into further investigation for this matter, but for Cerebellum cells we will not, as the total amount of differentially expressed splice variants found are too few. In table 10 we find grouping of Lymphoblastoid samples. Samples 5, 6 and 7 seem to find, relatively to the other groups, quite a few differentially expressed splice variants for the same splice variants. When investigating differential gene expression into detail this becomes more clear. Differential Gene Expression The Cerebellum samples find fewer differentially expressed genes than the Lymphoblastoid samples and also the differentially expressed genes can be uniquely attributed for ~75% to individual samples. In total the Cerebellum samples find about one-tenth (2212) differentially expressed genes in comparison to the Lymphoblastoid samples (21695), with a FDR of 0.05%. This suggests the Cerebellum samples are less 21

22 biologically variable than the Lymphoblastoid samples. Also when looking into detail in table 9 there do not appear to be any subgroups. Things are different for the Lymphoblastoid samples, not only do they find many more differentially expressed genes than the Cerebellum tissue samples, table 8 shows three samples to be much alike, sample 5, 6 and 7. Unfortunately we do not find another group clearly among these samples. If samples 1 to 4 would make up one group we would expect combinations of these samples to find a large ratio of differentially expressed genes together in subgroups, but this we don't find. Possibly these remaining four samples vary too much, genotipically speaking, to stand out in this table. Biological context might explain the biological variance we find in the Lymphoblastoid samples. They are known as cells with stem-cell like behavior that divert into three different types of cells: B lymphocytes, T lymphocytes and Natural Killer cells (Large granular lymphocytes) during Lymphopoiesis 23. The cells of the samples could already be into motion into dividing into one of these cells, maybe only genotipically. Further studies In this study we only investigated Lymphoblastoid cells and Cerebellum tissue, so our suggestions are, next to known biological context, mainly based on comparison between the two. Further study should include more different cell types and at least as many, preferable more, biological replicates per type. Sequencing depth should be in focus with the desired saturation of finding differential expression. Conclusions We have found a higher biological variance in the Lymphoblastoid cell-line samples than in the Cerebellum tissue samples. Arguably Cerebellum cells are subject of more regulation as we find ~75% of differential gene expression to be attributed by just one of the samples, noting that total differential gene expression is about one-tenth of that is found in the Lymphoblastoid samples. For the Cerebellum tissue samples this gives an idea about baseline variability, although we cannot investigate functional variability of the differential expression found in these samples individually in this study. The Lymphoblastoid cells in our sample are less alike, although three of the seven samples show a large comparison in differential expression of genes and splice variants, samples 5, 6 and 7. These three samples found the most sequencing depth of all samples and possibly therefore find most differential expression together because of this, confirming technical variance plays a significant role in RNA sequencing. Another possibility is that this variance can be explained by the biological context that Lymphoblastoid cells are stem-like. They normally divide into three subgroups: B lymphocytes, T lymphocytes and Natural Killer cells. Because of this biological context that possibly plays a role in the Lymphoblastoid samples we cannot go into detail of baseline variability of these cells. Concluding this research we find technical variance and biological variance can not be distinguished by the particular set-up of our study. We find sequencing read depth should 22

23 be adjusted specifically to cell-type to provide equal saturation levels for all samples. Also biological replicates should be enumerate enough to provide general conclusions as possibly subgroups are to be found as we did among the Lymphoblastoid cell-line samples. 23

24 Supplementary information Supplementary 1 Perl script for characterization of differential expression open(file, "data.txt") or die "could not open file!\n"; my $count = 0; my $cnt_line = 0; = (); my $new_line; LINE: while(<file>){ chomp; push(@same, $_) if $. == 1; if ( $_ == $same[-1] ){ push(@same, $_); $cnt_line++; }else{ foreach my $temp (@same){ print $temp. "\t". $cnt_line. "\n"; = (); $cnt_line = 1; push(@same, $_); } # $count++; # if ($count == 30){ # last LINE; # } } This is the Perl script used to calculate the characterization of differential expression. The script takes as input a value, in our case we used as input the gene_id number, and sets the default count to 1. Then it checks if the next row has the same gene_id, if so it adds 1 to the count, if not it will add the count to the rows counted. Then it will go to the next row, set the default count to 1 and start counting again and continue like this through the file. At the end of the file when the script starts at the last row with counting it will fail to read the next row (as there is not any) and so this row should be checked and counted manually if needed. 24

25 References 1 Ozsolak F., Milos P. M. RNA sequencing: advances, challenges and opportunities. Nature Genetics 12, (2010) 2 Marioni J. C., et al. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18, (2009) 3 Mortazavi A., Williams B. A., McCue K., Schaeffer L., Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods 5, (2008) 4 Nagalakshmi U, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320, (2008) 5 Sultan M., et al., A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, (2008) 6 Wilhelm B. T., et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resoultion. Nature 453, (2008) 7 Kato M., Lencastre de A., Pincus Z., Slack F. J. Dynamic expression of small non-coding RNAs, including novel micrornas and pirnas/21u-rnas, during Caenorhabditis elegans development. Genome Biology 10, R54 (2009) 8 Daines B., Wang H., Wang L., et al. The Drosophila melanogaster transcriptome by paired-end RNA sequencing. Genome Res 21, (2011) 9 Morin R. D., O'Connor M. D., Griffith M., et al. Application of massively parallel sequencing to microrna profiling and discovery in human embryonic stem cells. Genome Res 18, (2008) 10 Shin H., Lee H., Fejes A. P., Baillie D. L., Koo H., Jones S. J. M. Gene expression profiling of oxidative stress response of C. elegans aging defective AMPK mutants using massively parallel transcriptome sequencing. BMC Research 4, 34 (2011) 11 Twine N. A., Janitz K., Wilkins M. R., Janitz M. Whole transcriptome sequencing reveals gene expression and splicing differences in brain regions affected by Alzheimer's disease. PloS ONE 6(1), e16266 (2011) 12 Han X., Wu X., Chung W., Li T., Nekrutenko A., Altman N. S., Chen G., Ma H. Transcriptome of embryonic and neonatal mouse cortex by high-throughput RNA sequencing. PNAS 106(31), (2009) 13 Hansen K. D., Wu Z., Irizarry R. A., Leek J. T. Sequencing technology does not eliminate biological variability. Nature Biotechnology 29(7), (2011) 14 Wang, E. T., Sandberg R., Luo S., Khrebtukova I., Zhang L., Mayr C., Kingsmore S. F., Schrth G. P., Burge C. B. Alternative isoform regulation in human tissue transcriptomes. Nature 456, (2008) 15 Pickrell J. K., Marioni J. C., Pai A. A., Degner J. F., Engelhardt B. E., Nkadori E., Veyrieras J, Stephens M., Gilad Y., Pritchard J. K. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, (2010) 16 Trapnell C., Pachter L., Salzberg S. L. TopHat: discovering splice junctions with RNA-seq. BioInformatics 25(9), (2009) 17 Langmead B., et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009) 18 Burrows M., Wheeler D. A block sorting lossless data compression lagorithm. Technical Report 124, DEC, Digital Systems Research Center, Palo Alto, California (1994) 19 Ferragina P., Manzini G. An experimental study of an opportunistic index. Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms. Washington, D. C. USA, (2001) 20 Hiller L. W., et al. Whole-genome sequencing and variant discovery in C. elegans. Nat. Meth. 5, (2008) 21 Li H., et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18, (2008) 22 Haas, B.J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, (2003) 23 Wikipedia, Lymphopoeisis, (as of June 17, 2011, 16:04 GMT) 25

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq) RNA sequencing (RNA-seq) Module Outline MO 13-Mar-2017 RNA sequencing: Introduction 1 WE 15-Mar-2017 RNA sequencing: Introduction 2 MO 20-Mar-2017 Paper: PMID 25954002: Human genomics. The human transcriptome