A Statistical Framework for Classification of Tumor Type from microrna Data

Size: px

Start display at page:

Download "A Statistical Framework for Classification of Tumor Type from microrna Data"

Roy Young
5 years ago
Views:

1 DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 A Statistical Framework for Classification of Tumor Type from microrna Data JOSEFINE RÖHSS KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

3 A Statistical Framework for Classification of Tumor Type from microrna Data JOSEFINE RÖHSS Master s Thesis in Mathematical Statistics (30 ECTS credits) Master Programme in Applied and Computational Mathematics (120 credits) Royal Institute of Technology year 2016 Supervisor at Moderna Therapeutics:Hugh Salter Supervisor at KTH: Timo Koski Examiner: Timo Koski TRITA-MAT-E 2016:46 ISRN-KTH/MAT/E--16/46-SE Royal Institute of Technology SCI School of Engineering Sciences KTH SCI SE Stockholm, Sweden URL:

5 Abstract Hepatocellular carcinoma (HCC) is a type of liver cancer with low survival rate, not least due to the difficulty of diagnosing it in an early stage. The objective of this thesis is to build a random forest classification method based on microrna (and messenger RNA) expression profiles from patients with HCC. The main purpose is to be able to distinguish between tumor samples and normal samples by measuring the mirna expression. If successful, this method can be used to detect HCC at an earlier stage and to design new therapeutics. The micrornas and messenger RNAs which have a significant difference in expression between tumor samples and normal samples are selected for building random forest classification models. These models are then tested on paired samples of tumor and surrounding normal tissue from patients with HCC. The results show that the classification models built for classifying tumor and normal samples have high prediction accuracy and hence show high potential for using microrna and messenger RNA expression levels for diagnosis of HCC. Keywords: mirna, mrna, random forest, classification, HCC, diagnosis

7 Acknowledgements First of all, I would like to thank my supervisor at Moderna Therapeutics, Hugh Salter, who has given me the opportunity to write this thesis and introduced me to this interesting field. Thank you for all your enthusiasm for this project and all the help you have given me during the process. To all the people at Wallenberg Applied Bioinformatics Infrastructure (WABI) at SciLifeLab, and especially Pär Engström, thank you for inviting me to your group and for helping me with the microrna data. I would also like to thank my supervisor at the Department of Mathematical Statistics at the Royal Institute of Technology, Professor Timo Koski, for his feedback during the process. Finally, many thanks to Peter Modin for all the proofreading, mental support and for discussing this subject with me so many times. Stockholm, June 2016 Josefine Röhss iii

9 Contents 1 Introduction 1 2 Biological Background Messenger RNA MicroRNA Correlation between microrna and mrna RNA sequencing Modified RNA Theoretical Background Normalization methods T-test and Fold Change Pearson correlation Multiple testing Random Forest Sensitivity, specificity and positive predictive value Method Data preparation Data analysis Results 19 6 Discussion Performance of the Model Selection of Method Future Studies A Distribution plots 24 B Results random forest 29 B.1 Case 1: Tumor and normal samples B.2 Case 2: High and low AFP iv

11 Chapter 1 Introduction Hepatocellular carcinoma (HCC) is the sixth most common type of cancer worldwide and is one of the main causes of cancer-related deaths in the world [1]. The main reason for the large number of deaths is late detection of the tumor and hence diagnosis at a later stage of disease, at which time survival is measured in months [2]. The overall survival rate after 5 years is 5-9% from the time of diagnosis of HCC [1]. Currently no universal guidelines for diagnosis exist, but the serum marker alpha-fetoprotein (AFP) is commonly used for diagnostics and surveillance [3] since the majority of the HCC patients have an increased level of AFP [4]. However, this is not an effective method to use alone since far from all patients show increased levels of AFP at an early stage of disease and other chronic liver diseases such as Hepatitis C (HCV) can also cause increased AFP levels [5]. Since earlier diagnosis is crucial for obtaining a higher survival rate, the focus of much research revolves around diagnostic strategies to identify early HCC [3]. Recent studies have shown that microrna (mirna) and messenger RNA (mrna) expression levels can be used for detection of different types of cancer. Classification of human tumors using mirna expression profiles is currently an area of intense research and several classifiers have been developed for different types of tumors [6]. There is, however, no current study of expression levels of mirna and mrna that takes the relationship between these into account. Since mirna and mrna are not independent measures, the correlation between these has to be included in the analysis [7]. In this thesis, we use the expression levels of mirna and mrna from two different conditions for building a classification method that can predict a condition given expression values from a patient with HCC. We compare the results from four different inputs to the classification method; mirna expression levels, mrna expression levels, both mirna and mrna expression levels and both mirna and mrna expression levels after taking the potential correlation between these into account. There are two different pairs of conditions that are investigated. The first question is whether the built classification method can predict if the observed expression profiles come from a tumor sample or a non-tumorous sample. The hope is that the resulting method can be used to determine the condition of an undiagnosed patient based on the expression profile of mirna and mrna. The second question is whether there is a connection between the AFP level and the expression profiles of mirna and/or mrna. Our hope is that the classification model with conditions (response variables) "high AFP level" and "low AFP level" will be able to predict the AFP level based on a specific mirna (or mrna) pattern. Since AFP is a diagnostic tool that is currently used, this can be a step towards personalized treatment of HCC if the AFP level can be connected to a certain mirna pattern since mirna can be used to identify drug targets. 1

12 Chapter 2 Biological Background 2.1 Messenger RNA Messenger RNA (mrna) are single-stranded RNA molecules with the main purpose of carrying the genetic information transcribed (copied) from DNA in the nucleus to the site of protein synthesis in the ribosomes [8]. On the ribosomes the information is then translated into protein compositions by "reading" the mrna sequence and, according to this, using amino acids to build the proteins [9]. The transcription of DNA to mrna is also called gene expression. The single-stranded RNAs obtained from the transcription of the DNA are called transcripts, and the transcriptome is the full range of these transcripts in a cell [8]. Hence, by studying the transcriptome, we learn more about the expression of genes which increases our understanding of development and disease [10]. 2.2 MicroRNA MicroRNA (mirna) are single-stranded, non-coding small RNA molecules of an approximate length of 22 nucleotides. mirna regulate gene expression in many cellular processes by binding to the 3 untranslated region (a section located at the 3 end of the mrna that often contains regulatory regions) of targeted mrna and by doing that, repress mrna translation [11]. mirna genes can be found in many different places in the genome, often in regions that historically have been referred to as "junk DNA", such as introns, because their function was unknown [12]. Since the discovery of mirna in these regions, several studies comparing the expression of mirna in tumor tissue and corresponding non-tumorous tissue for different types of tumors have been published. These studies have indicated that the mirna expression of some mirnas are altered in specific tumors, implying that mirna may be involved in development of cancer and other diseases [12]. Because of this, mirna expression profiles have been suggested as possible biomarkers for identifying tumors [11]. 2.3 Correlation between microrna and mrna Since mirna repress translation for specific targeted mrnas, there are associations between the expression profiles of mirna and their target mrnas [14], making them dependent measures. An increase or decrease in the level of mirna causes the opposite reaction in the level of the target mrna, resulting in a correlation between their expressions [7]. If using both mirna and mrna expression profiles for building a classification method, the correlation between significantly differentially expressed mirna and mrna from the same class (e.g. tumor and normal samples) has to be taken into account. The correlation can be computed by pairwise Pearson correlation between all possible pairs of mirna and mrna from the same class. This results in a multiple testing problem which also has to be corrected. More about the computation of 2

13 Pearson correlation and correction of multiple testing is explained in the chapter Theoretical background. 2.4 RNA sequencing The cellular transcriptome constantly changes depending on several factors such as environment, development and disease. Studying the variations of the transcriptome in response to specific conditions or treatments can give a deeper understanding of how changes in transcriptional activity can reflect or contribute to disease [15]. RNA sequencing (RNA-seq) is one of the most efficient tools for examining the variations of the transcriptome and has been revolutionizing the study of gene expression profiles [16]. RNA-seq is useful for detecting both known and novel features and there are many different methods depending on the purpose of the experiment, such as sequencing mrna, ribosome profiling or sequencing of small RNA species such as mirna [16]. In the process of RNA-seq, mrna (and other RNAs) are converted to cdna (a DNA that is complementary to the mrna strand) that is used as the input to a sequencing library preparation. The process is explained in more detail below for microrna sequencing microrna sequencing microrna sequencing (mirna-seq) is an RNA-seq method for microrna. This method has become very popular and is useful e.g. for studying expression profiles of mirnas and discovering novel mirnas. The process of mirna-seq includes isolation of the mirnas and construction of multiplexed mirna libraries (collections of sequences). There are many different sequencing platforms which require different protocols, but they typically follow similar steps [17]. These steps are described below and can be seen in Figure Isolation of total RNA containing mirna In a given sample, all the RNA is extracted and isolated. There are several different methods for accomplishing this and more information can be found in [17]. 2. Adapter ligation For every RNA sample obtained from the isolation, DNA adapters are added to both ends of the small RNAs. mirnas are distinguished from other small RNAs, such as RNA degradation products, by a 5 phosphate and a 3 hydroxyl group. Based on these groups, the 3 adapter that is ligated to the small RNAs is designed for capturing micrornas. After that, mirnas are ligated to a 5 adapter which contains the binding site for sequencing primer [17]. 3. Reverse transcription to generate cdna libraries The resulting adapter ligated mirnas are reverse transcribed with a primer complementary to the 3 adapter to generate a cdna library [18]. 4. Amplification by PCR To provide a sufficient quantity for sequencing, the cdna library is amplified by PCR. The amplified library is gel extracted and quality-checked prior to sequencing [17]. To sequence multiple libraries at the same time, each mirna library can be barcoded by implementing a unique tag sequence as part of the adapter or PCR amplification primer during construction [18]. 5. Sequencing The sequencing process differ among platforms. The Illumina platform, which is used in this project, have free nucleotides which build up sequences matching the reads. These nucleotides are fluorescently labeled, with a specific color corresponding to each base (A, 3

14 C, G, T) so that the color of each added nucleotide can be seen and hence indicates what the sequence of bases is [19]. The output from Illumina sequencing (a popular sequencing platform) is FASTQ files with all the reads obtained and the corresponding quality information from the sequencing [20]. This data then has to be processed (removing the sequence adapter and mapped against the genome) and normalized before any analyses can be performed. Figure 2.1: The process of RNA-seq 4

15 2.4.2 FASTQ FASTQ format is a text-based format used for storing nucleotide sequences and the associated quality scores. A FASTQ file from a sequencing run consists of 4 lines for each read. The first line starts with a "@" and is followed by a unique identifier for the read, optionally followed by some additional sequence description [17]. The second line contains the read as a raw sequence of nucleotides. The third row starts with "+" and is then followed by the same identifier as in the first row and followed by the same additional information. The last row contains the quality scores for each nucleotide in the sequence in the second row denoting the probability that the corresponding base is sequenced incorrectly [21]. An example of a FASTQ file can be seen in Figure 2.2. Figure 2.2: An example of four lines, representing one read, in a FASTQ file. This is a sequence of length 35 with identifier "HCC.4". The range of the encoded quality scores depends on the software used for sequencing, in this figure Illumina sequencing has been used. Here "!" represents the lowest quality while "5" is in the middle of the range of scores [22]. 2.5 Modified RNA RNA modifications are changes to the chemical composition of RNA which can result in altered function or stability. Current studies are evaluating the possibilities to use modified RNA (mod- RNA) for treatment of a broad range of diseases since RNA-based therapeutics makes it possible to target proteins involved in diseases that with existing medicines have been difficult to target [13]. Binding sites for specific mirnas can be incorporated into a modrna. This will restrict the expression of the therapeutic RNA to cells lacking that specific mirna. Hence, if a mirna is found to be highly expressed in non-tumorous tissue but lacking in the tumors, a modrna containing binding sites to that mirna will be selectively targeted to the tumor cells relative to the non-tumorous cells. 5

16 Chapter 3 Theoretical Background 3.1 Normalization methods In order to discover important changes in expression from RNA-seq raw data, normalization of the data is an essential step in the preprocessing. The aim of normalization is to minimize systematic technical effects and experimental variation by correcting for differences in sample size and sequence lengths. Normalization of expression data is important in order to be able to compare different samples with each other and to ensure that a gene with the same expression level in two samples is not detected as differentially expressed [23]. There are several different methods for normalization of RNA-seq data and currently no universal guidelines exist for when to use which method. Since mirna is a relatively new field of research, normalization methods were first developed for mrna data. Some of these methods are applicable to mirna, while others have been modified or developed for mirna data [23]. In this thesis two different methods are used which are further described below Reads Per Kilobase per Million reads (RPKM) The most common normalization for expression data is to normalize to library size since two samples with different sizes of the library would produce proportionally different numbers of reads for the same gene [24]. Current RNA-seq analysis methods often use the total number of reads to normalize data between samples. This is done by scaling the number of reads in each library to a common value across all the libraries in the experiment [25]. Reads per kilobase per million reads (RPKM) is a simple method for library size scaling that divides the number of reads for a gene by the total library size [24]. The RPKM method also divides the number of reads by the length of the transcript in kilobases (number of base pairs divided by 1000) since longer transcripts are more likely to have sequences mapped to their region. This results in a higher number of reads which creates bias in comparisons between transcripts of different lengths [24]. Definition 3.1 (RPKM) Let C R be the total number of reads mapped to a specific mrna, R, in a sample. Further, let L R be the transcript length of R in kilobases and let N be the total number of reads in the sample. Then the measure reads per kilobase per million reads (RPKM) for R is defined as RP KM R = C R 10 9 L R N (3.1) The Trimmed Mean of M-values normalization method (TMM) Library size scaling methods, such as RPKM, are too simple for some biological applications [25]. One example is if a large number of mrnas (or mirnas) is unique to one of the experimental 6

17 conditions (e.g. tumor or normal sample), or if there is a small number of highly expressed outliers. This can affect the normalization of the remaining mrnas (or mirnas) and result in an analysis that is skewed towards one of the experimental conditions. If the normalization method does not adjust for this, the analysis of the expression profiles can be skewed towards one of the experimental conditions [25]. There are some more advanced normalization methods that account for the sampling properties of the RNA-seq data and hence prevent the problem of a skewed normalization. One such method is the trimmed mean of M-values normalization method (TMM). Definition 3.2 Assume that we have a library k with a total number of N k reads. Define Y gk as the observed number of gene g in library k, µ gk as the true, unknown expression level and L g as the length of gene g. The expected value of Y gk is then defined as where E[Y gk ] = µ gkl g S k N k (3.2) Here, S k is the unknown total RNA output of a sample. G S k = µ gk L g (3.3) g=1 Since the expression levels and true lengths of every gene are unknown, the total RNA production, S k, cannot be estimated directly. Instead, the relative RNA production of two samples, f k = S k /S k, is used. The TMM normalization method computes the overall expression levels of genes between samples under the assumption that the majority of the genes are not differentially expressed. The ratio of RNA production is estimated by using a weighted trimmed mean of the log expression ratios (trimmed mean of M values, TMM). To compute normalization factors across several samples, one sample is selected as a reference and then the TMM factor is calculated for each non-reference sample [25]. Definition 3.3 (TMM) The TMM normalization factor for gene g in sample k using reference sample r is calculated as w log 2 (T MM (r) r k ) = gk Mgk r w r (3.4) gk where the weights w r gk are computed by w r gk = N k Y gk N k Y gk + N r Y gr N r Y gr (3.5) and Mgk r is the overall expression level of gene g in library k with reference library r and defined as for Y gk 0 and Y gr 0 ( ) Mgk r Ygk /N k = log 2 Y gr /N r (3.6) 7

18 3.2 T-test and Fold Change The t-test and Fold Change (FC) are two methods that are commonly used for significance testing in expression profile analysis. These two methods have different benefits and disadvantages and it has therefore become increasingly common to use the significance threshold of both the t-test and FC in order to increase the confidence in the statistically significant result [26] T-test Two-population t-tests for equal means are statistical hypothesis tests which are commonly used for determining whether the mean of two populations significantly differs from each other [27]. The first step in the process of the t-test is to formulate the hypothesis to be tested, called the null hypothesis (H 0 ), and an appropriate alternative hypothesis (H 1 ). A common null hypothesis is that the means of two populations are equal. Since we will reject the null hypothesis both if the difference is larger than zero and if it is less than zero, this is called a two-tailed hypothesis test. After formulating the hypothesis, a significance threshold α has to be decided in advance (usually α = 0.05 or α = 0.01). In order to know when to reject H 0, a test statistic is computed and since there are different types of t-tests depending on the relation between the samples, the test statistic can be computed with different formulas. A p-value can be computed for any test statistic and is defined as the probability of obtaining a value at least as extreme as the observed, given that the null hypothesis is true. The null hypothesis is rejected if the p-value is lower then the significance threshold α. In this thesis, two types of t-tests are used, one for independent samples and one for paired samples. Both of these are testing the null hypothesis that two samples have identical means under the assumptions that the samples are normally distributed and have equal variance. Consider two populations with means µ 1 and µ 2 and standard deviations σ 1 and σ 2 where we want to test the null hypothesis that the means are equal, that is, H 0 : µ 1 = µ 2 and H 1 : µ 1 µ 2. Let X 1 and X 2 be two samples of size n drawn from these populations with sample means X 1 and X 2 and sample standard deviations S 1 and S 2. If X 1 and X 2 are independent, the test statistic (T ) computed in the t-test is defined as in equation 3.7. Definition 3.4 (Test statistic for independent samples) Consider the two random variables X 1 = {x (1) 1, x(2) 1,..., x(n) 1 } and X 2 = {x (1) 2, x(2) 2,..., x(n) 2 } with sample means X 1 and X 2 and sample standard deviations S 1 and S 2. The test statistic (T ) for testing the null hypothesis H 0 : µ 1 = µ 2 against the alternative hypothesis H 1 : µ 1 µ 2 is defined as T = X 1 X 2 S 2 1 +S 2 2 n (3.7) If the two samples are related (have matched observations), they are no longer independent and the test statistic has to be computed in another way. For two dependent populations X 1 and X 2, the difference between each pair of matched observations is first computed. The new sample, X D, consisting of the differences is treated as a random sample of size n from a population with mean µ D. The new null hypothesis for equal means between the populations is then H 0 : µ D = 0 and the test statistic is defined as in equation 3.9. Definition 3.5 (Test statistic for dependent samples) Consider two dependent random variables X 1 = {x (1) 1, x(2) 1,..., x(n) 1 } and X 2 = {x (1) 2, x(2) 2,..., x(n) 2 }. Define the new sample X D as the difference between the samples X 1 and X 2 8

19 x (i) D = x(i) 1 x(i) 2, i = 1,, n (3.8) with sample mean X D. The test statistic (T ) for testing the null hypothesis H 0 : µ D = 0 against the alternative hypothesis H 0 : µ D 0 can then be computed by T = X D X 2 S 2 1 +S 2 2 n (3.9) Fold Change Fold change (FC) is a measure describing how much a value changes between data sets and is often used in biological applications such as comparing the mrna expression level between two conditions. It is not commonly used for other purposes than biological since it is more heuristic than the t-test. The FC evaluates the ratio between observations (mrna expressions) of two different conditions and all mrnas that differ more than a predefined threshold are considered differentially expressed [28]. A common threshold is FC=2. This makes log 2 (FC) a useful measure, especially for graphics, since the absolute value of both a decrease and an increase in expression level will be equal to 1 at the cutoff. Definition 3.6 (Fold Change) For two samples X and Y, the (FC) is computed by F C = X Y (3.10) Then the log2-fold change is defined as log 2 (F C) = log 2 (X) log 2 (Y ) (3.11) T-test and Fold Change for testing differential expression The first analyses of expression data only used fold change to measure differential expression, usually with a fold change of 2 being the cutoff [26]. However, since the fold change cutoffs do not take the variance of the samples into account, this is not a suitable method to use alone [28]. Instead it became common to use a statistical test such as t-test, but there are some disadvantages with these methods as well. If some mrnas have a very low variance, the estimated sample variance can be skewed and result in a large t-statistic and therefore falsely selected as significant [28]. Hence, now the most common requirement is that differentially expressed genes satisfy both the p-value and fold-change criteria simultaneously [26]. 3.3 Pearson correlation Pearson correlation is a common measure of correlation which shows the linear relationship between two normally distributed data sets [29]. The Pearson correlation coefficient, denoted ρ for populations and r for sample data, is defined as in equations 3.12 and 3.13 respectively. Definition 3.7 (Pearson population correlation coefficient) For two random variables X and Y, the Pearson correlation coefficient is defined as ρ X,Y = Cov(X, Y ) σ X σ Y = E[(X µ X)(Y µ Y )] σ X σ Y (3.12) where Cov(X, Y ) = E[(X E[X])(Y E[Y ])] = E[XY ] E[X]E[Y ] is the covariance between X and Y. 9

20 Definition 3.8 (Pearson sample correlation coefficient) For two random variables X = x 1,..., x n and Y = y 1,..., y n with averages x and ȳ, the Pearson correlation coefficient is defined as r X,Y = ni=1 (x i x)(y i ȳ) ni=1 (x i x) 2 n i=1 (y i ȳ) 2 (3.13) The Pearson correlation coefficient is always between 1 and 1, where ±1 corresponds to a perfect linear relationship (high correlation). 3.4 Multiple testing Single hypothesis testing In single hypothesis testing we wish to test some null hypothesis H 0 against an alternative hypothesis H 1. There are two types of error which can occur in this process. A type I error, also called false positive, occurs when the null hypothesis is incorrectly rejected (H 0 is rejected when it is actually true). A type II error, also called false negative, occurs when the null hypothesis incorrectly fails to be rejected (H 0 is not rejected even though it is false). From the hypothesis test we obtain a p-value, and H 0 is rejected if this value is less than or equal to a significance threshold α, representing the acceptable type I error [30]. Hence, the p-value obtained from testing an individual hypothesis is used to control the false positive rate of the test [31] The problem of multiple testing The analysis of large data sets has increased significantly during the past years and it is common to test thousands of features in a genome data set simultaneously against some null hypothesis [32]. When testing a single hypothesis, a 5% chance that the result is a false positive is acceptable, but when testing multiple hypotheses this will result in numerous false positives. Therefore, in order to retain the same overall rate of false positives, we have to reduce the acceptable error α for each test [33]. One way of doing this is to use a family-wise error rate. Definition 3.9 (FWER) Let H 1,..., H m be a family of hypotheses and let p 1,..., p m denote the corresponding p-values. Then the family-wise error rate (FWER) can be defined as the probability of falsely rejecting at least one true null hypothesis H i. This can be written as FWER=Pr(rejecting at least one H i H i is true). A multiple testing procedure is said to control the FWER at a significance level of α if FWER α. There are many methods for controlling the FWER and one of the most well-known is the Bonferroni procedure [33]. Theorem 3.10 (Bonferroni Correction) Let H 1,..., H m be a family of hypotheses and let p 1,..., p m denote the corresponding p-values. The Bonferroni correction states that rejecting the null hypothesis for all p i α m controls the FWER. Proof If m is the number of hypotheses and m 0 is the number of these that are true, the FWER for the Bonferroni correction can be written as { m0 ( F W ER = P p i m) α } (3.14) i=1 10

21 since p i α m means that H i is rejected. By Boole s inequality we have that P Hence FWER α. { m0 i=1 ( p i m) α } m 0 i=1 { ( P p i α )} α m 0 m m m α m = α (3.15) In many situations, FWER is too strict and not only reduces the number of false positives, but also the number of true discoveries since biological data is derived from correlated networks [30]. In these situations there is a need of a method which identifies as many significant results as possible, while retaining a relatively low proportion of false positives False Discovery Rate (FDR) The false discovery rate (FDR) is a multiple testing procedure which reduces the number of false positives without reducing the number of true discoveries [34]. FDR is defined as the expected proportion of errors amongst the rejected hypotheses. Consider the problem of testing m null hypotheses of which m 0 are true and R is the number of rejected null hypotheses. This can be described by the following table: Declared not significant Declared significant Total True H 0 U V m 0 False H 0 T S m m 0 Total m R R m Table 3.1: The number of errors made when testing m null hypotheses. Here V, S, U and T are unobserved random variables, where V is the number of false positives (Type I error), S is the number of true positives, T is the number of false negatives (Type II error) and U is the number of true negatives [34]. If each null hypothesis is tested separately at rejection level α, R = R(α) is increasing with rate α which is not desirable. We introduce Q as the proportion of rejected null hypotheses which are falsely rejected. This can be written as Q = Then the FDR is defined as the expected Q. V V + S (3.16) Definition 3.11 (FDR) The false discovery rate (FDR) is defined as the expected proportion of rejected null hypotheses which are falsely rejected. If V is the number of false positives and S is the number of true positives, the FDR is calculated by [ ] V F DR = E V + S [ ] V = E R (3.17) We want to keep this value below a threshold q which is the number of acceptable false discoveries. The q-value is similar to the p-value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. The q-value of a particular feature can be described as the expected proportion of false positives among all features as or more extreme 11

22 than the observed one [32]. The procedure for keeping the FDR below q is called the Bonferroni procedure or the FDR procedure. Theorem 3.12 (FDR procedure) Consider testing the null hypotheses H 1, H 2,, H m based on the corresponding p-values p 1, p 2,, p m. Sort the p-values in increasing order p (1), p (2),, p (m) and denote H (i) as the null hypothesis corresponding to p (i). Let k be the largest i {1, 2,, m} for which p (i) iq m (3.18) Then reject all hypotheses H (1), H (2),, H (k) [34]. 3.5 Random Forest Random forest is a machine learning method that uses decision trees for performing regression and classification [35]. Training data, consisting of dependent variables with corresponding independent variables, is used to build the model which after the training can be used for prediction of unknown dependent variables for a set of independent variables Regression trees and classification trees A regression tree is a nonlinear predictive model used for predicting the values of continuous dependent variables, called response variables, from one or more predictor variables. The method starts by dividing the predictor space, the set of predictor variables, into j non-overlapping regions R 1,..., R j. For every observation that falls into region R j, the prediction for that observation is defined as the average of the response values for all training observations in the region R j. The regions R 1,..., R j are constructed by minimizing the residual sum of squares (RSS) defined as in equation Definition 3.13 (Residual Sum of squares, RSS) Given observations X 1,, X p and their corresponding response variables y 1,, y p, the residual sum of squares (RSS) is defined as J j=1 i R j (y i ŷ Rj ) 2 (3.19) where ŷ Rj is the average of all y i such that X i R j Since it is computationally infeasible to consider every possible division of the feature space into j regions, an approach called recursive binary splitting is used. This approach starts with all observations in one single region and then successively splits the predictor space by choosing the best split at that particular step in the process. The best split is found by selecting a predictor X j and a cut-point s such that the splitting of the predictor space into the regions {X X j < s} and {X X j s} results in the smallest RSS. This process is then repeated by splitting each region in the same manner until a predefined stop criterion is reached. Classification trees are similar to regression trees, with the difference of qualitative response variables (classes). Instead of using the average of the response variables in a region as prediction, we predict that each observation belongs to the most commonly occurring class of the training observations in that region. To construct the regions, a classification error rate is minimized instead of the RSS. 12

23 (a) Output of recursive binary splitting (b) Tree corresponding to the recursive binary splitting Figure 3.1: A two-dimensional example of recursive binary splitting with four cut-points. The output of the splitting (the resulting regions) can be seen in figure 3.1a. In figure 3.1b, a tree corresponding to this partition can be seen. Definition 3.14 (Classification error rate) The classification error rate is defined as the fraction of the training observations in that region that do not belong to the most common class. So for the m:th region, the classification rate E is defined as E = 1 max k (ˆp mk ) (3.20) where ˆp mk represents the proportion of training observations in the m:th region that belong to class k Bagging Bagging, also called Bootstrap Aggregation, is a method that reduces the variance in a statistical learning method. Since decision trees often suffer from high variance, this is a very useful method here [35]. The idea behind Bagging is that averaging a set of observations reduces the variance. Hence, to reduce the variance of a statistical learning method, separate prediction models can be built using several training sets from a population and then the average of these predictions is used. In other words, B sets of training data are created by repeatedly sampling from the original training set. Prediction models ˆf 1 (x),..., ˆf B (x) are built from each of these training sets and averaging them results in a single model ˆfbag (x) with reduced variance ˆf bag (x) = 1 B B ˆf b (x) (3.21) This is the procedure of Bagging in the context of regression. In classification, the class predicted by each of the B training sets for a given test observation is recorded. The majority vote is then used for predicting the class of that test observation, i.e. the most commonly predicted class is b=1 13

24 chosen as the predicted class Random Forest Random forest is a method similar to Bagging but with an improvement which prevents correlation. Random forest also builds decision trees from sets of training data, sampled from the original training set. The difference between the methods is the splitting in the process of building these decision trees. In Bagging, all p predictors are considered as candidates for every split. In Random Forest however, for each split a random sample of m predictors is chosen as candidates from the set of p predictors. From these m predictors, one is then chosen for the split. In other words there are p m predictors that are not even considered as candidates for each split. This will reduce the risk of using the same predictor for the first split in all the trees, which could happen if there is one very strong predictor compared to the others, resulting in highly correlated predictions and hence overfitted models. Random Forest is a common method to use for classification of samples using gene expression data because of its resemblance to the hierarchical structure of the cells and the ability to perform feature selection which is essential when dealing with high-dimensional data sets [44]. 3.6 Sensitivity, specificity and positive predictive value Sensitivity and specificity, also known as true positive rate and true negative rate, are two statistical measures of the performance of a classification test. Sensitivity measures the proportion of the actual positives that are correctly classified and specificity measures the proportion of the actual negatives that are correctly classified [36]. Definition 3.15 Suppose that a group of patients are tested for a disease. Then we have the following terminology: True positive (TP): the patient has the disease and the test is positive. False positive (FP): the patient does not have the disease but the test is positive. True negative (TN): the patient does not have the disease and the test is negative False negative (FN): the patient has the disease but the test is negative. Using this terminology, the sensitivity and specificity are defined as in equations 3.22 and Sensitivity = Specif icity = T P T P + F N T N T N + F P (3.22) (3.23) The positive predicted value (PPV) is another statistical measure commonly used for answering the question of how likely it is that a patient has a disease given that the test result is positive. This measures the proportion of the classified positives that are true and is the most commonly used measure by clinicians. Definition 3.16 Given the number of true positives (TP) and false positives (FP), the positive predicted value (PPV) is defined as in equation 3.24 P P V = T P T P + F P (3.24) 14

25 Chapter 4 Method 4.1 Data preparation Training data The data used as training data for building the classification method was obtained from The Cancer Genome Atlas (TCGA) [37] which is a publicly available data base started by the National Cancer Institute and the National Human Genome Research Institute to identify genomic changes in different types of cancer. The data contains expression data for mrna and microrna from 377 patients with liver hepatocellular carcinoma. Approximately 50 of these have data from both the tumor (called tumor sample) and the margin of the tumor (called normal sample) while for the others there are no samples from the margin. There is also clinical data available such as vital status at time of report and disease-specific diagnostic information. The clinical parameter of interest for this project is the AFP level. The expression data for both mirna and mrna has already been normalized by RPKM when we downloaded the data. If the expression data for a specific mrna or mirna is missing for a patient it has been set to zero. All mirnas and mrnas that have more than 80% zeros for any patient are excluded from the analysis since these are not as interesting for the analysis. We also exclude a few patients from the data set where the initial diagnosis of HCC was shown on pathology to be another type of cancer, such as colorectal carcinoma. This data will be used as training data for both the previously mentioned questions to be examined, but with some alternations in the preprocessing. To be able to create a classification method for the first question where we are differencing between tumor samples and normal samples, there has to be matched data (both tumor and normal samples) for all the patients. Since the data provided from TCGA only includes 50 patient with matched samples, all the other patients are excluded from the analysis. After the first filtering described above and exclusion of patients without matched samples, the data set includes matched pairs from 49 patients with expression levels for 1047 different mirnas and mrnas. The data that is used for the second question, analysis of the AFP, has to include the level of alpha-fetoprotein for each patient. Hence, all patients for which we do not have information about this are excluded from the analysis. The AFP level is regarded as low if it is less than 20 and high if it is higher than 200. Between 20 and 200 there is a grey-zone where it is uncertain whether the level should be regarded as high or low since the sensitivity and specificity of the test overlap and therefore the few patients with an AFP in this interval are excluded. After this filtering of the data set we have 221 patients left with expression levels from 1047 different mirnas and mrnas. 15

26 4.1.2 Validation data This dataset was supposed to consist of expression levels of mrna and mirna generated from 100 surgical tumor samples with matched normal samples from the surgical margins, collected from patients over the last several years from HCC resections carried out at Karolinska University Hospital in Huddinge. Unfortunately only data from six patients was received within the time frame of the project. These six patients were used as a validation set, but since this is a very small amount of data, a model will also be built from only two thirds of the training data and the remaining data from this set is instead used as a validation set. The samples from Karolinska were prepared and sequenced at SciLifeLab in Solna. The raw data obtained from the sequencing, consisting of FASTQ files containing all the reads, needed some preparation before any analysis could be performed. This followed a common pipeline for preparation of raw expression data with the steps explained below. The process described below only had to be done for the mirna data since the corresponding preparation of the mrna data had already been performed at SciLifeLab. 1. Removal of the 3 adapter and parsing by barcodes Usually, the first part of the sequenced read corresponds to the mirna sequence followed by the adapter sequence and the barcode which were ligated in the library preparation process. The adapter sequence was found by inspection and comparison to sequences used in Illumina sequencing [38]. Since the adapter is not part of the original sequence, it has to be removed from all reads before continuing working with the data. The removal of the 3 adapter is accomplished by using the program Cutadapt [39] which searches for reads whose 3 ends align to the adapter sequence and then remove it. All the trimmed reads which are longer than 17 nucleotides are saved in a FASTQ file with the copy number for each specific sequence (how many times every specific sequence was found). The sequences shorter than 17 nucleotides are not saved since these can correspond to other sequences than mirna. The reads were already parsed by barcode, since every patient had a specific barcode besides the adapter, and we retrieved the data separated for each patient. 2. Mapping against the genome After the adapters have been removed, the reads are mapped against the genome. This is done by aligning the reads with a known reference sequence in a way that reveals how they are similar and finding the location where they are best matched. The reference sequences used for the mapping, called annotation files and containing information of where known mirnas are located in the genome, were downloaded from mirbase [40]. First the mirnas in the annotation files corresponding to the appropriate species have to be chosen, in this case Homo sapiens. The next step is to match the sequenced reads to the reference obtained from these files. For mapping the reads to the genome, the program Bowtie [41] was used. Bowtie takes annotation files and sequencing reads in FASTQ format as input and then outputs the set of aligned sequences in SAM format (Sequence Alignment/MAP format). The SAM file contains all the reads and their aligned sequences (if the read was mapped), the mapping quality and some other information about the alignments. For some reads an alignment to the reference genome is not found and hence it is not mapped to the genome. This can be an indication of a sequencing error or that this is a not yet known mirna. The mapped reads are summarized by read counts to indicate which mirnas are expressed in the different samples by counting by all the reads mapping to each sequence (i.e. to each mirna). 3. Normalization To compare expression levels from different libraries, the read counts have to be normalized to compensate for the fact that different mirnas and mrnas have been mapped from 16

27 reads of differing lengths. The training data obtained from TCGA was normalized using RPKM, but since this is not the best known normalization method available for mirna expression data [25] we use TMM for the normalization of the mirna data obtained from Karolinska. The mrna data from Karolinska is already normalized using RPKM, as is appropriate for mrna expression data. The normalized data set is then filtered in the same manner as the training set; all mirnas and mrnas with more than 80% zeros are excluded. When the data from Karolinska is used as validation set, the mirnas and mrnas are compared between this data and the data from TCGA and all the mirnas and mrnas that do not exist in both data sets are excluded. 4.2 Data analysis The data analysis will follow the following procedure. First significantly differentially expressed mirna and mrna between the two classes (tumor and normal or high or low AFP) are identified. Then the Pearson correlation between microrna and mrna expression levels for samples in each class (tumor sample, normal sample, high AFP and low AFP) are computed. Four random forest classification models are built with different inputs. One has only mirna as input, one has mrna, one has both mirna and mrna and one has all correlated mirna and mrna. These classification models are then tested using the validation data to compare which input gives the best performance of the model. First, two thirds of the data from TCGA is used for building the model since the rest of the data is used as validation set. After that, all of the data from TCGA is used as training data and the data from Karolinska containing 6 patients is used as validation set. When using the TCGA data as validation set, the two thirds used as training set are randomly chosen. An average result of doing this 10 times is computed for higher statistical accuracy Differential expression analysis For both the questions previously discussed, the first step after the preprocessing of the data is to find the differentially expressed mirnas and mrnas. Here "differentially expressed" means that the expression level changes systematically between two conditions (such as tumor and normal tissue in the first question we want to examine or high and low AFP in the second question). The two methods used for finding the significant mirnas and mrnas are fold change with a cut-off at 0.5 ( log 2 FC > 0.5) and t-test with significance level α = In the first case, the training data consists of paired samples of expression data from tumor and normal tissue from 49 patients. Hence a two-sided, paired t-test is used. In the second case the test data consists of unpaired samples of expression data from patients with a high AFP (138 patients) and patients with a low AFP (83 patients). Hence the t-test in this case is a two-sided, unpaired t-test. In both these cases the obtained p-values are corrected for multiple testing using FDR. The mirnas and mrnas which are found to be significantly differentially expressed (i.e. have a FDR-corrected p-value of less than 0.01 and a log 2 fold change smaller than 0.5) are chosen for building the classification model Building the random forest classification model The random forest classification models were implemented with the RandomForestClassifier Python package [42] with the standard parameters suggested by the Python package. The 17

28 classification models were built using four different inputs: 1. All significantly differentially expressed mirnas 2. All significantly differentially expressed mrnas 3. All significantly differentially expressed mirnas and mrnas 4. All significantly differentially expressed and correlated mirnas and mrnas The models built with the first three inputs are straight-forward. The expression levels of the mirnas and/or mrnas for each patient are used as predictors and the condition of the same patient is the corresponding response variable. First we use all the data from TCGA as training data. Then in the case where we compare tumor and normal samples, this results in 98 predictors, each containing 118 or elements (depending on if the input is mirna or mrna) and 98 response variables of which 49 are tumor and 49 are normal. In the other case, where high and low levels of AFP are compared, there are 221 predictors, each containing 118 or elements and 221 response variables of which 138 are high AFP and 83 are low AFP. When we use two thirds of the TCGA data as training set we have 66 predictors in the first case and 148 predictors in the second case. When we use the correlated mirnas and mrnas as input, we first have to compute the correlations. Using Pearson correlation and a threshold of r > 0.7 for when a mirna and an mrna are considered correlated, the correlation between each mrna-mirna pair is computed, and the mirnas and mrnas which are not correlated to any of the others are excluded from the input data. In the first case this results in 98 predictors with corresponding response variables as previously explained, but the predictors now contain 69 elements for mirna and 518 elements for mrna. In the second case, the 221 predictors contain 64 elements for mirna and 89 elements for mrna Testing the model When the random forest classification models with the four different inputs have been built, they are tested using the validation data set. When we use the data from Karolinska we have a validation data set containing 12 predictors (6 tumor samples and 6 normal samples) and when using one third of the data from TCGA we have 32 predictors when comparing tumor and normal samples and 96 predictors when comparing high and low AFP. The model outputs the predicted class (for the first case tumor or normal and for the second case high or low AFP) based on the input data and since the class is already known for all the patients, this information can be used to evaluate the performance of the models. The models are evaluated by the percentage of correct classifications and by computing the sensitivity, specificity and the positive predicted value. The number of correct classifications is an average of the numbers obtained by changing the partitioning of the TCGA data into training and test sets and by using the same input to the model 100 times. 18

29 Chapter 5 Results In the significance test for tumor and normal samples, 118 out of the 1047 mirnas and of the mrnas were found to be significantly differentially expressed at level α = 0.01 and with a log 2 fold change larger than 0.5. The results are displayed with volcano plots (scatterplots of the negative log 10 transformed p-values obtained from the t-test versus the log 2 fold change) in figure 5.1a for the mirnas and 5.1b for the mrnas. mirnas and mrnas with statistically significant differential expression according to the t-test are located above the horizontal threshold line at 2, and mirnas and mrnas with large fold-change values will be located outside the vertical threshold lines at -0.5 and 0.5. In the second case (where we compare high and low AFP), 161 out of the 1047 mirnas and of the mrnas were found to be significantly differentially expressed. The volcano plots for the mirnas and the mrnas can be seen in figure 5.1c and 5.1d respectively. The mirnas and mrnas considered significant will hence be located in the upper left or upper right parts of the plot (in the plot these points are colored blue). The distributions in the tumor samples and the normal samples of some of the most significant mirnas and mrnas can be seen in Appendix A. First the data from TCGA was divided with two thirds as training set and one third as validation set. Random forest classification models were built using the training set with four different inputs and then tested with the validation set. In the first case, where tumor and normal samples were compared, all the models performed very well with over 90% correct classifications. The sensitivity, specificity and standard deviations for the four different models can be seen in blue in figure 5.2. In the second case, with high and low AFP, the models did not perform as well as in the first case and had approximately 70% correct classifications. The sensitivity, specificity and standard deviations for these models can be seen in red in 5.2. The number of true and false positives and true and false negatives for all the tested models can be found in Appendix B. When the data obtained from Karolinska was used as validation set and all the data from TCGA was used as training set, there were only approximately 40-60% correct classifications. We expected this result to be less accurate than desired due to the small amount of data in the validation set. Another problem is that the diagnosis of these six patients is not entirely determined. Because of this we will leave the results and discussion of the models using this data to further studies when more data from Karolinska is available. The remainder of this report will be focusing on the results when using only the data from TCGA. 19

(a) mirna from tumor and normal samples (b) mrna from tumor and normal samples (c) mirna from samples with high and low AFP (d) mrna from samples with high and low AFP Figure 5.

The significantly differentially expressed are located in the upper left or upper right corners.

30 (a) mirna from tumor and normal samples (b) mrna from tumor and normal samples (c) mirna from samples with high and low AFP (d) mrna from samples with high and low AFP Figure 5.1: Volcano plots showing the log 10 transformed p-values obtained from the t-test and the log 2 fold change for all mirnas and mrnas. The significantly differentially expressed are located in the upper left or upper right corners. (a) Sensitivity of all the random forest models (b) Specificity of all the random forest models Figure 5.2: The staple diagram shows the sensitivity and specificity and standard deviations for the random forest models with the four different inputs. In case 1, tumor and normal samples are compared (the blue bars) and in case 2, high and low AFP are compared (the red bars). 20

31 In order to find out which mirnas that contributed the most to the classification and to see if these could be used alone as input to the model without a large loss of accuracy, the variable importance, which explains how important all the mirnas were for the classification process, was computed. The five mirnas that contributed the most were selected and a new random forest model was built using only these as input to see how the performance would change. The results show a small decrease in classification accuracy when using these five mirnas, but the specificity, sensitivity and PPV is still around 0.90 which we consider to be a very good result. 21

32 Chapter 6 Discussion 6.1 Performance of the Model The classification models built and tested only with data obtained from TCGA performed very well overall. The models built for classifying samples as tumor or normal performed better compared to the ones distinguishing between high and low AFP. This was expected since the samples used for building these AFP models still are all tumor samples which will affect the levels of mirnas and mrnas. Hence, the effect that the AFP levels have on the expression levels is decreased by the effect that the tumor has. However, we can still see a pattern of the expression levels that are linked to the AFP levels, which can be useful for further studies. The large difference in expression between tumor samples and normal samples can be seen in the distribution plots in Appendix A. There we can see that the plots showing the different distributions in samples with high and low AFP are more similar to each other than the plots showing the distributions in tumors and normal samples, as expected. There is only a very small difference in performance between the models with the four different inputs. Since many of the mirnas and mrnas are correlated, a similar result when using these as inputs to the model is expected. Hence, this shows that the model behaves in the desired and expected way. The high prediction accuracy of the models suggests that expression levels of mirna and mrna could be used for diagnosis of HCC and hopefully also for designing new therapeutics using modified RNA with binding sites to some of the mirnas found in this project. For this purpose there has to be a smaller selection of mirnas (the reason for this is discussed in the section "Selection of method" below). From the variable selection we know that the model performs well with only a few mirnas as input and hence these mirnas could be used for such a purpose. 6.2 Selection of Method In this project the number of variables (mirna and mrna) are much larger than the number of observations (patients) which makes variable selection a crucial part of the solution. One prospect of this project was that the results could be used for designing a therapeutic reagent with binding sites for mirnas that would be toxic to tumor cells. It is not possible to use all the significant differentially expressed mirnas for this, since there can only be a limited number of binding sites, but we have to select a small subset of these mirnas that still result in a good predictive performance. When using random forest it is easy to determine which variables are contributing the most to the classification and hence build a model based only on these variables that sustain a good prediction accuracy. This is the reason for choosing to use random forest in this project. Other methods that could have been used is k-nearest neighbor (KNN), support vector machines (SVM) and linear discriminant analysis (LDA) but several prior stud- 22

33 ies [43][44][45] have found that random forest is the most successful method for distinguishing between disease and non-disease samples using gene expression data. 6.3 Future Studies Since the data we used from TCGA was preprocessed and we did not have any control over how the samples were collected and the grade of heterogeneity of the samples, we cannot be sure how reliable the results that we obtained using this data are. An immediate continuation of this project is to build classification models based only on data from the Karolinska hospital to have full control over the collection and processing of the samples. There are 100 samples from patients at Karolinska that are currently being sequenced in order to enable such an analysis. Another possible improvement of the models could be achieved by optimizing the parameters used in the random forest models. An interesting continuation of this project would be to study if the mirnas with a significant difference in expression between liver tumors and normal liver samples could be used for treatment of HCC. Future studies can examine if modified RNA coding for a toxic payload can be used for binding to the mirnas found to be lacking in the tumors and hence target only the tumor cells. Binding sites for a few of the mirnas, the ones that were found to be the most important for the classification, can then be incorporated into the modrna which will make them selectively target the tumor cells. There are some factors that have not been discussed in this thesis but may be very important for these kinds of treatments to work. One such factor is whether the mirnas are up or down regulated in the tumor cells. If binding sites to mirnas that are up regulated in tumor cells are incorporated in the modrnas, the cells that are not tumors will be attacked instead of the opposite. Another important factor is the quantity of the mirnas that are chosen. It has to be studied how high the quantity in the normal cells has to be in order to make sure that the modrna binds to all the normal cells and hence does not target them and how low the quantity has to be in the tumor cells in order to kill all of them. There are some mirnas that are present at reasonable levels in normal liver samples but almost completely lacking in the tumor samples. These could be a good choice when selecting which mirnas to use for developing a treatment. 23

34 Appendix A Distribution plots The distributions of some of the mirnas and mrnas with the largest significant differences can be seen below. All figures show the distribution in the tumor sample in red and the normal sample in blue. Most of the mirnas and mrnas are down regulated in the tumor samples which can be seen in the figures by a lower mean of the distribution, but there are also some that are up regulated. Something worth noticing is the size distribution of the mirnas. Some appear in a large number (e.g. Figure A.1b) while some are very rare (e.g. Figure A.1d). This could be important for further studies. All names of the mirnas and mrnas in Figures A.1, A.2, A.3 and A.4 have been coded in order to protect intellectual property rights and enable future commercialization of diagnosis and treatment methods based on these discoveries. Figure A.1 shows the distributions of some mirnas in tumor samples and normal samples. Figure A.2 shows the distributions of some mrnas in tumor samples and normal samples. In the figures below, the distribution of some mirnas and mrnas in samples with high AFP (red) and low AFP (blue) are shown. The difference in distribution is smaller than in the case where we compared tumor samples and normal samples, as could be seen in the result. Figure A.3 shows the distributions of some mirnas in samples with high and low AFP. Figure A.4 shows the distributions of some mrnas in samples with high and low AFP. 24

35 (a) mirna A (b) mirna B (c) mirna C (d) mirna D Figure A.1: Distributions of some mirna in tumor samples and normal samples. 25

36 (a) mrna A (b) mrna B (c) mrna C (d) mrna D Figure A.2: Distributions of some mrna in tumor samples and normal samples. 26

37 (a) mirna E (b) mirna F (c) mirna G (d) mirna H Figure A.3: Distributions of some mirna in samples with high and low AFP respectively. 27

38 (a) mrna E (b) mrna F (c) mrna G (d) mrna H Figure A.4: Distributions of some mrna in samples with high and low AFP respectively. 28

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction Optimization strategy of Copy Number Variant calling using Multiplicom solutions Michael Vyverman, PhD; Laura Standaert, PhD and Wouter Bossuyt, PhD Abstract Copy number variations (CNVs) represent a significant