A Statistical Framework for Classification of Tumor Type from microrna Data

Size: px
Start display at page:

Download "A Statistical Framework for Classification of Tumor Type from microrna Data"

Transcription

1 DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 A Statistical Framework for Classification of Tumor Type from microrna Data JOSEFINE RÖHSS KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

2

3 A Statistical Framework for Classification of Tumor Type from microrna Data JOSEFINE RÖHSS Master s Thesis in Mathematical Statistics (30 ECTS credits) Master Programme in Applied and Computational Mathematics (120 credits) Royal Institute of Technology year 2016 Supervisor at Moderna Therapeutics:Hugh Salter Supervisor at KTH: Timo Koski Examiner: Timo Koski TRITA-MAT-E 2016:46 ISRN-KTH/MAT/E--16/46-SE Royal Institute of Technology SCI School of Engineering Sciences KTH SCI SE Stockholm, Sweden URL:

4

5 Abstract Hepatocellular carcinoma (HCC) is a type of liver cancer with low survival rate, not least due to the difficulty of diagnosing it in an early stage. The objective of this thesis is to build a random forest classification method based on microrna (and messenger RNA) expression profiles from patients with HCC. The main purpose is to be able to distinguish between tumor samples and normal samples by measuring the mirna expression. If successful, this method can be used to detect HCC at an earlier stage and to design new therapeutics. The micrornas and messenger RNAs which have a significant difference in expression between tumor samples and normal samples are selected for building random forest classification models. These models are then tested on paired samples of tumor and surrounding normal tissue from patients with HCC. The results show that the classification models built for classifying tumor and normal samples have high prediction accuracy and hence show high potential for using microrna and messenger RNA expression levels for diagnosis of HCC. Keywords: mirna, mrna, random forest, classification, HCC, diagnosis

6

7 Acknowledgements First of all, I would like to thank my supervisor at Moderna Therapeutics, Hugh Salter, who has given me the opportunity to write this thesis and introduced me to this interesting field. Thank you for all your enthusiasm for this project and all the help you have given me during the process. To all the people at Wallenberg Applied Bioinformatics Infrastructure (WABI) at SciLifeLab, and especially Pär Engström, thank you for inviting me to your group and for helping me with the microrna data. I would also like to thank my supervisor at the Department of Mathematical Statistics at the Royal Institute of Technology, Professor Timo Koski, for his feedback during the process. Finally, many thanks to Peter Modin for all the proofreading, mental support and for discussing this subject with me so many times. Stockholm, June 2016 Josefine Röhss iii

8

9 Contents 1 Introduction 1 2 Biological Background Messenger RNA MicroRNA Correlation between microrna and mrna RNA sequencing Modified RNA Theoretical Background Normalization methods T-test and Fold Change Pearson correlation Multiple testing Random Forest Sensitivity, specificity and positive predictive value Method Data preparation Data analysis Results 19 6 Discussion Performance of the Model Selection of Method Future Studies A Distribution plots 24 B Results random forest 29 B.1 Case 1: Tumor and normal samples B.2 Case 2: High and low AFP iv

10

11 Chapter 1 Introduction Hepatocellular carcinoma (HCC) is the sixth most common type of cancer worldwide and is one of the main causes of cancer-related deaths in the world [1]. The main reason for the large number of deaths is late detection of the tumor and hence diagnosis at a later stage of disease, at which time survival is measured in months [2]. The overall survival rate after 5 years is 5-9% from the time of diagnosis of HCC [1]. Currently no universal guidelines for diagnosis exist, but the serum marker alpha-fetoprotein (AFP) is commonly used for diagnostics and surveillance [3] since the majority of the HCC patients have an increased level of AFP [4]. However, this is not an effective method to use alone since far from all patients show increased levels of AFP at an early stage of disease and other chronic liver diseases such as Hepatitis C (HCV) can also cause increased AFP levels [5]. Since earlier diagnosis is crucial for obtaining a higher survival rate, the focus of much research revolves around diagnostic strategies to identify early HCC [3]. Recent studies have shown that microrna (mirna) and messenger RNA (mrna) expression levels can be used for detection of different types of cancer. Classification of human tumors using mirna expression profiles is currently an area of intense research and several classifiers have been developed for different types of tumors [6]. There is, however, no current study of expression levels of mirna and mrna that takes the relationship between these into account. Since mirna and mrna are not independent measures, the correlation between these has to be included in the analysis [7]. In this thesis, we use the expression levels of mirna and mrna from two different conditions for building a classification method that can predict a condition given expression values from a patient with HCC. We compare the results from four different inputs to the classification method; mirna expression levels, mrna expression levels, both mirna and mrna expression levels and both mirna and mrna expression levels after taking the potential correlation between these into account. There are two different pairs of conditions that are investigated. The first question is whether the built classification method can predict if the observed expression profiles come from a tumor sample or a non-tumorous sample. The hope is that the resulting method can be used to determine the condition of an undiagnosed patient based on the expression profile of mirna and mrna. The second question is whether there is a connection between the AFP level and the expression profiles of mirna and/or mrna. Our hope is that the classification model with conditions (response variables) "high AFP level" and "low AFP level" will be able to predict the AFP level based on a specific mirna (or mrna) pattern. Since AFP is a diagnostic tool that is currently used, this can be a step towards personalized treatment of HCC if the AFP level can be connected to a certain mirna pattern since mirna can be used to identify drug targets. 1

12 Chapter 2 Biological Background 2.1 Messenger RNA Messenger RNA (mrna) are single-stranded RNA molecules with the main purpose of carrying the genetic information transcribed (copied) from DNA in the nucleus to the site of protein synthesis in the ribosomes [8]. On the ribosomes the information is then translated into protein compositions by "reading" the mrna sequence and, according to this, using amino acids to build the proteins [9]. The transcription of DNA to mrna is also called gene expression. The single-stranded RNAs obtained from the transcription of the DNA are called transcripts, and the transcriptome is the full range of these transcripts in a cell [8]. Hence, by studying the transcriptome, we learn more about the expression of genes which increases our understanding of development and disease [10]. 2.2 MicroRNA MicroRNA (mirna) are single-stranded, non-coding small RNA molecules of an approximate length of 22 nucleotides. mirna regulate gene expression in many cellular processes by binding to the 3 untranslated region (a section located at the 3 end of the mrna that often contains regulatory regions) of targeted mrna and by doing that, repress mrna translation [11]. mirna genes can be found in many different places in the genome, often in regions that historically have been referred to as "junk DNA", such as introns, because their function was unknown [12]. Since the discovery of mirna in these regions, several studies comparing the expression of mirna in tumor tissue and corresponding non-tumorous tissue for different types of tumors have been published. These studies have indicated that the mirna expression of some mirnas are altered in specific tumors, implying that mirna may be involved in development of cancer and other diseases [12]. Because of this, mirna expression profiles have been suggested as possible biomarkers for identifying tumors [11]. 2.3 Correlation between microrna and mrna Since mirna repress translation for specific targeted mrnas, there are associations between the expression profiles of mirna and their target mrnas [14], making them dependent measures. An increase or decrease in the level of mirna causes the opposite reaction in the level of the target mrna, resulting in a correlation between their expressions [7]. If using both mirna and mrna expression profiles for building a classification method, the correlation between significantly differentially expressed mirna and mrna from the same class (e.g. tumor and normal samples) has to be taken into account. The correlation can be computed by pairwise Pearson correlation between all possible pairs of mirna and mrna from the same class. This results in a multiple testing problem which also has to be corrected. More about the computation of 2

13 Pearson correlation and correction of multiple testing is explained in the chapter Theoretical background. 2.4 RNA sequencing The cellular transcriptome constantly changes depending on several factors such as environment, development and disease. Studying the variations of the transcriptome in response to specific conditions or treatments can give a deeper understanding of how changes in transcriptional activity can reflect or contribute to disease [15]. RNA sequencing (RNA-seq) is one of the most efficient tools for examining the variations of the transcriptome and has been revolutionizing the study of gene expression profiles [16]. RNA-seq is useful for detecting both known and novel features and there are many different methods depending on the purpose of the experiment, such as sequencing mrna, ribosome profiling or sequencing of small RNA species such as mirna [16]. In the process of RNA-seq, mrna (and other RNAs) are converted to cdna (a DNA that is complementary to the mrna strand) that is used as the input to a sequencing library preparation. The process is explained in more detail below for microrna sequencing microrna sequencing microrna sequencing (mirna-seq) is an RNA-seq method for microrna. This method has become very popular and is useful e.g. for studying expression profiles of mirnas and discovering novel mirnas. The process of mirna-seq includes isolation of the mirnas and construction of multiplexed mirna libraries (collections of sequences). There are many different sequencing platforms which require different protocols, but they typically follow similar steps [17]. These steps are described below and can be seen in Figure Isolation of total RNA containing mirna In a given sample, all the RNA is extracted and isolated. There are several different methods for accomplishing this and more information can be found in [17]. 2. Adapter ligation For every RNA sample obtained from the isolation, DNA adapters are added to both ends of the small RNAs. mirnas are distinguished from other small RNAs, such as RNA degradation products, by a 5 phosphate and a 3 hydroxyl group. Based on these groups, the 3 adapter that is ligated to the small RNAs is designed for capturing micrornas. After that, mirnas are ligated to a 5 adapter which contains the binding site for sequencing primer [17]. 3. Reverse transcription to generate cdna libraries The resulting adapter ligated mirnas are reverse transcribed with a primer complementary to the 3 adapter to generate a cdna library [18]. 4. Amplification by PCR To provide a sufficient quantity for sequencing, the cdna library is amplified by PCR. The amplified library is gel extracted and quality-checked prior to sequencing [17]. To sequence multiple libraries at the same time, each mirna library can be barcoded by implementing a unique tag sequence as part of the adapter or PCR amplification primer during construction [18]. 5. Sequencing The sequencing process differ among platforms. The Illumina platform, which is used in this project, have free nucleotides which build up sequences matching the reads. These nucleotides are fluorescently labeled, with a specific color corresponding to each base (A, 3

14 C, G, T) so that the color of each added nucleotide can be seen and hence indicates what the sequence of bases is [19]. The output from Illumina sequencing (a popular sequencing platform) is FASTQ files with all the reads obtained and the corresponding quality information from the sequencing [20]. This data then has to be processed (removing the sequence adapter and mapped against the genome) and normalized before any analyses can be performed. Figure 2.1: The process of RNA-seq 4

15 2.4.2 FASTQ FASTQ format is a text-based format used for storing nucleotide sequences and the associated quality scores. A FASTQ file from a sequencing run consists of 4 lines for each read. The first line starts with a "@" and is followed by a unique identifier for the read, optionally followed by some additional sequence description [17]. The second line contains the read as a raw sequence of nucleotides. The third row starts with "+" and is then followed by the same identifier as in the first row and followed by the same additional information. The last row contains the quality scores for each nucleotide in the sequence in the second row denoting the probability that the corresponding base is sequenced incorrectly [21]. An example of a FASTQ file can be seen in Figure 2.2. Figure 2.2: An example of four lines, representing one read, in a FASTQ file. This is a sequence of length 35 with identifier "HCC.4". The range of the encoded quality scores depends on the software used for sequencing, in this figure Illumina sequencing has been used. Here "!" represents the lowest quality while "5" is in the middle of the range of scores [22]. 2.5 Modified RNA RNA modifications are changes to the chemical composition of RNA which can result in altered function or stability. Current studies are evaluating the possibilities to use modified RNA (mod- RNA) for treatment of a broad range of diseases since RNA-based therapeutics makes it possible to target proteins involved in diseases that with existing medicines have been difficult to target [13]. Binding sites for specific mirnas can be incorporated into a modrna. This will restrict the expression of the therapeutic RNA to cells lacking that specific mirna. Hence, if a mirna is found to be highly expressed in non-tumorous tissue but lacking in the tumors, a modrna containing binding sites to that mirna will be selectively targeted to the tumor cells relative to the non-tumorous cells. 5

16 Chapter 3 Theoretical Background 3.1 Normalization methods In order to discover important changes in expression from RNA-seq raw data, normalization of the data is an essential step in the preprocessing. The aim of normalization is to minimize systematic technical effects and experimental variation by correcting for differences in sample size and sequence lengths. Normalization of expression data is important in order to be able to compare different samples with each other and to ensure that a gene with the same expression level in two samples is not detected as differentially expressed [23]. There are several different methods for normalization of RNA-seq data and currently no universal guidelines exist for when to use which method. Since mirna is a relatively new field of research, normalization methods were first developed for mrna data. Some of these methods are applicable to mirna, while others have been modified or developed for mirna data [23]. In this thesis two different methods are used which are further described below Reads Per Kilobase per Million reads (RPKM) The most common normalization for expression data is to normalize to library size since two samples with different sizes of the library would produce proportionally different numbers of reads for the same gene [24]. Current RNA-seq analysis methods often use the total number of reads to normalize data between samples. This is done by scaling the number of reads in each library to a common value across all the libraries in the experiment [25]. Reads per kilobase per million reads (RPKM) is a simple method for library size scaling that divides the number of reads for a gene by the total library size [24]. The RPKM method also divides the number of reads by the length of the transcript in kilobases (number of base pairs divided by 1000) since longer transcripts are more likely to have sequences mapped to their region. This results in a higher number of reads which creates bias in comparisons between transcripts of different lengths [24]. Definition 3.1 (RPKM) Let C R be the total number of reads mapped to a specific mrna, R, in a sample. Further, let L R be the transcript length of R in kilobases and let N be the total number of reads in the sample. Then the measure reads per kilobase per million reads (RPKM) for R is defined as RP KM R = C R 10 9 L R N (3.1) The Trimmed Mean of M-values normalization method (TMM) Library size scaling methods, such as RPKM, are too simple for some biological applications [25]. One example is if a large number of mrnas (or mirnas) is unique to one of the experimental 6

17 conditions (e.g. tumor or normal sample), or if there is a small number of highly expressed outliers. This can affect the normalization of the remaining mrnas (or mirnas) and result in an analysis that is skewed towards one of the experimental conditions. If the normalization method does not adjust for this, the analysis of the expression profiles can be skewed towards one of the experimental conditions [25]. There are some more advanced normalization methods that account for the sampling properties of the RNA-seq data and hence prevent the problem of a skewed normalization. One such method is the trimmed mean of M-values normalization method (TMM). Definition 3.2 Assume that we have a library k with a total number of N k reads. Define Y gk as the observed number of gene g in library k, µ gk as the true, unknown expression level and L g as the length of gene g. The expected value of Y gk is then defined as where E[Y gk ] = µ gkl g S k N k (3.2) Here, S k is the unknown total RNA output of a sample. G S k = µ gk L g (3.3) g=1 Since the expression levels and true lengths of every gene are unknown, the total RNA production, S k, cannot be estimated directly. Instead, the relative RNA production of two samples, f k = S k /S k, is used. The TMM normalization method computes the overall expression levels of genes between samples under the assumption that the majority of the genes are not differentially expressed. The ratio of RNA production is estimated by using a weighted trimmed mean of the log expression ratios (trimmed mean of M values, TMM). To compute normalization factors across several samples, one sample is selected as a reference and then the TMM factor is calculated for each non-reference sample [25]. Definition 3.3 (TMM) The TMM normalization factor for gene g in sample k using reference sample r is calculated as w log 2 (T MM (r) r k ) = gk Mgk r w r (3.4) gk where the weights w r gk are computed by w r gk = N k Y gk N k Y gk + N r Y gr N r Y gr (3.5) and Mgk r is the overall expression level of gene g in library k with reference library r and defined as for Y gk 0 and Y gr 0 ( ) Mgk r Ygk /N k = log 2 Y gr /N r (3.6) 7

18 3.2 T-test and Fold Change The t-test and Fold Change (FC) are two methods that are commonly used for significance testing in expression profile analysis. These two methods have different benefits and disadvantages and it has therefore become increasingly common to use the significance threshold of both the t-test and FC in order to increase the confidence in the statistically significant result [26] T-test Two-population t-tests for equal means are statistical hypothesis tests which are commonly used for determining whether the mean of two populations significantly differs from each other [27]. The first step in the process of the t-test is to formulate the hypothesis to be tested, called the null hypothesis (H 0 ), and an appropriate alternative hypothesis (H 1 ). A common null hypothesis is that the means of two populations are equal. Since we will reject the null hypothesis both if the difference is larger than zero and if it is less than zero, this is called a two-tailed hypothesis test. After formulating the hypothesis, a significance threshold α has to be decided in advance (usually α = 0.05 or α = 0.01). In order to know when to reject H 0, a test statistic is computed and since there are different types of t-tests depending on the relation between the samples, the test statistic can be computed with different formulas. A p-value can be computed for any test statistic and is defined as the probability of obtaining a value at least as extreme as the observed, given that the null hypothesis is true. The null hypothesis is rejected if the p-value is lower then the significance threshold α. In this thesis, two types of t-tests are used, one for independent samples and one for paired samples. Both of these are testing the null hypothesis that two samples have identical means under the assumptions that the samples are normally distributed and have equal variance. Consider two populations with means µ 1 and µ 2 and standard deviations σ 1 and σ 2 where we want to test the null hypothesis that the means are equal, that is, H 0 : µ 1 = µ 2 and H 1 : µ 1 µ 2. Let X 1 and X 2 be two samples of size n drawn from these populations with sample means X 1 and X 2 and sample standard deviations S 1 and S 2. If X 1 and X 2 are independent, the test statistic (T ) computed in the t-test is defined as in equation 3.7. Definition 3.4 (Test statistic for independent samples) Consider the two random variables X 1 = {x (1) 1, x(2) 1,..., x(n) 1 } and X 2 = {x (1) 2, x(2) 2,..., x(n) 2 } with sample means X 1 and X 2 and sample standard deviations S 1 and S 2. The test statistic (T ) for testing the null hypothesis H 0 : µ 1 = µ 2 against the alternative hypothesis H 1 : µ 1 µ 2 is defined as T = X 1 X 2 S 2 1 +S 2 2 n (3.7) If the two samples are related (have matched observations), they are no longer independent and the test statistic has to be computed in another way. For two dependent populations X 1 and X 2, the difference between each pair of matched observations is first computed. The new sample, X D, consisting of the differences is treated as a random sample of size n from a population with mean µ D. The new null hypothesis for equal means between the populations is then H 0 : µ D = 0 and the test statistic is defined as in equation 3.9. Definition 3.5 (Test statistic for dependent samples) Consider two dependent random variables X 1 = {x (1) 1, x(2) 1,..., x(n) 1 } and X 2 = {x (1) 2, x(2) 2,..., x(n) 2 }. Define the new sample X D as the difference between the samples X 1 and X 2 8

19 x (i) D = x(i) 1 x(i) 2, i = 1,, n (3.8) with sample mean X D. The test statistic (T ) for testing the null hypothesis H 0 : µ D = 0 against the alternative hypothesis H 0 : µ D 0 can then be computed by T = X D X 2 S 2 1 +S 2 2 n (3.9) Fold Change Fold change (FC) is a measure describing how much a value changes between data sets and is often used in biological applications such as comparing the mrna expression level between two conditions. It is not commonly used for other purposes than biological since it is more heuristic than the t-test. The FC evaluates the ratio between observations (mrna expressions) of two different conditions and all mrnas that differ more than a predefined threshold are considered differentially expressed [28]. A common threshold is FC=2. This makes log 2 (FC) a useful measure, especially for graphics, since the absolute value of both a decrease and an increase in expression level will be equal to 1 at the cutoff. Definition 3.6 (Fold Change) For two samples X and Y, the (FC) is computed by F C = X Y (3.10) Then the log2-fold change is defined as log 2 (F C) = log 2 (X) log 2 (Y ) (3.11) T-test and Fold Change for testing differential expression The first analyses of expression data only used fold change to measure differential expression, usually with a fold change of 2 being the cutoff [26]. However, since the fold change cutoffs do not take the variance of the samples into account, this is not a suitable method to use alone [28]. Instead it became common to use a statistical test such as t-test, but there are some disadvantages with these methods as well. If some mrnas have a very low variance, the estimated sample variance can be skewed and result in a large t-statistic and therefore falsely selected as significant [28]. Hence, now the most common requirement is that differentially expressed genes satisfy both the p-value and fold-change criteria simultaneously [26]. 3.3 Pearson correlation Pearson correlation is a common measure of correlation which shows the linear relationship between two normally distributed data sets [29]. The Pearson correlation coefficient, denoted ρ for populations and r for sample data, is defined as in equations 3.12 and 3.13 respectively. Definition 3.7 (Pearson population correlation coefficient) For two random variables X and Y, the Pearson correlation coefficient is defined as ρ X,Y = Cov(X, Y ) σ X σ Y = E[(X µ X)(Y µ Y )] σ X σ Y (3.12) where Cov(X, Y ) = E[(X E[X])(Y E[Y ])] = E[XY ] E[X]E[Y ] is the covariance between X and Y. 9

20 Definition 3.8 (Pearson sample correlation coefficient) For two random variables X = x 1,..., x n and Y = y 1,..., y n with averages x and ȳ, the Pearson correlation coefficient is defined as r X,Y = ni=1 (x i x)(y i ȳ) ni=1 (x i x) 2 n i=1 (y i ȳ) 2 (3.13) The Pearson correlation coefficient is always between 1 and 1, where ±1 corresponds to a perfect linear relationship (high correlation). 3.4 Multiple testing Single hypothesis testing In single hypothesis testing we wish to test some null hypothesis H 0 against an alternative hypothesis H 1. There are two types of error which can occur in this process. A type I error, also called false positive, occurs when the null hypothesis is incorrectly rejected (H 0 is rejected when it is actually true). A type II error, also called false negative, occurs when the null hypothesis incorrectly fails to be rejected (H 0 is not rejected even though it is false). From the hypothesis test we obtain a p-value, and H 0 is rejected if this value is less than or equal to a significance threshold α, representing the acceptable type I error [30]. Hence, the p-value obtained from testing an individual hypothesis is used to control the false positive rate of the test [31] The problem of multiple testing The analysis of large data sets has increased significantly during the past years and it is common to test thousands of features in a genome data set simultaneously against some null hypothesis [32]. When testing a single hypothesis, a 5% chance that the result is a false positive is acceptable, but when testing multiple hypotheses this will result in numerous false positives. Therefore, in order to retain the same overall rate of false positives, we have to reduce the acceptable error α for each test [33]. One way of doing this is to use a family-wise error rate. Definition 3.9 (FWER) Let H 1,..., H m be a family of hypotheses and let p 1,..., p m denote the corresponding p-values. Then the family-wise error rate (FWER) can be defined as the probability of falsely rejecting at least one true null hypothesis H i. This can be written as FWER=Pr(rejecting at least one H i H i is true). A multiple testing procedure is said to control the FWER at a significance level of α if FWER α. There are many methods for controlling the FWER and one of the most well-known is the Bonferroni procedure [33]. Theorem 3.10 (Bonferroni Correction) Let H 1,..., H m be a family of hypotheses and let p 1,..., p m denote the corresponding p-values. The Bonferroni correction states that rejecting the null hypothesis for all p i α m controls the FWER. Proof If m is the number of hypotheses and m 0 is the number of these that are true, the FWER for the Bonferroni correction can be written as { m0 ( F W ER = P p i m) α } (3.14) i=1 10

21 since p i α m means that H i is rejected. By Boole s inequality we have that P Hence FWER α. { m0 i=1 ( p i m) α } m 0 i=1 { ( P p i α )} α m 0 m m m α m = α (3.15) In many situations, FWER is too strict and not only reduces the number of false positives, but also the number of true discoveries since biological data is derived from correlated networks [30]. In these situations there is a need of a method which identifies as many significant results as possible, while retaining a relatively low proportion of false positives False Discovery Rate (FDR) The false discovery rate (FDR) is a multiple testing procedure which reduces the number of false positives without reducing the number of true discoveries [34]. FDR is defined as the expected proportion of errors amongst the rejected hypotheses. Consider the problem of testing m null hypotheses of which m 0 are true and R is the number of rejected null hypotheses. This can be described by the following table: Declared not significant Declared significant Total True H 0 U V m 0 False H 0 T S m m 0 Total m R R m Table 3.1: The number of errors made when testing m null hypotheses. Here V, S, U and T are unobserved random variables, where V is the number of false positives (Type I error), S is the number of true positives, T is the number of false negatives (Type II error) and U is the number of true negatives [34]. If each null hypothesis is tested separately at rejection level α, R = R(α) is increasing with rate α which is not desirable. We introduce Q as the proportion of rejected null hypotheses which are falsely rejected. This can be written as Q = Then the FDR is defined as the expected Q. V V + S (3.16) Definition 3.11 (FDR) The false discovery rate (FDR) is defined as the expected proportion of rejected null hypotheses which are falsely rejected. If V is the number of false positives and S is the number of true positives, the FDR is calculated by [ ] V F DR = E V + S [ ] V = E R (3.17) We want to keep this value below a threshold q which is the number of acceptable false discoveries. The q-value is similar to the p-value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. The q-value of a particular feature can be described as the expected proportion of false positives among all features as or more extreme 11

22 than the observed one [32]. The procedure for keeping the FDR below q is called the Bonferroni procedure or the FDR procedure. Theorem 3.12 (FDR procedure) Consider testing the null hypotheses H 1, H 2,, H m based on the corresponding p-values p 1, p 2,, p m. Sort the p-values in increasing order p (1), p (2),, p (m) and denote H (i) as the null hypothesis corresponding to p (i). Let k be the largest i {1, 2,, m} for which p (i) iq m (3.18) Then reject all hypotheses H (1), H (2),, H (k) [34]. 3.5 Random Forest Random forest is a machine learning method that uses decision trees for performing regression and classification [35]. Training data, consisting of dependent variables with corresponding independent variables, is used to build the model which after the training can be used for prediction of unknown dependent variables for a set of independent variables Regression trees and classification trees A regression tree is a nonlinear predictive model used for predicting the values of continuous dependent variables, called response variables, from one or more predictor variables. The method starts by dividing the predictor space, the set of predictor variables, into j non-overlapping regions R 1,..., R j. For every observation that falls into region R j, the prediction for that observation is defined as the average of the response values for all training observations in the region R j. The regions R 1,..., R j are constructed by minimizing the residual sum of squares (RSS) defined as in equation Definition 3.13 (Residual Sum of squares, RSS) Given observations X 1,, X p and their corresponding response variables y 1,, y p, the residual sum of squares (RSS) is defined as J j=1 i R j (y i ŷ Rj ) 2 (3.19) where ŷ Rj is the average of all y i such that X i R j Since it is computationally infeasible to consider every possible division of the feature space into j regions, an approach called recursive binary splitting is used. This approach starts with all observations in one single region and then successively splits the predictor space by choosing the best split at that particular step in the process. The best split is found by selecting a predictor X j and a cut-point s such that the splitting of the predictor space into the regions {X X j < s} and {X X j s} results in the smallest RSS. This process is then repeated by splitting each region in the same manner until a predefined stop criterion is reached. Classification trees are similar to regression trees, with the difference of qualitative response variables (classes). Instead of using the average of the response variables in a region as prediction, we predict that each observation belongs to the most commonly occurring class of the training observations in that region. To construct the regions, a classification error rate is minimized instead of the RSS. 12

23 (a) Output of recursive binary splitting (b) Tree corresponding to the recursive binary splitting Figure 3.1: A two-dimensional example of recursive binary splitting with four cut-points. The output of the splitting (the resulting regions) can be seen in figure 3.1a. In figure 3.1b, a tree corresponding to this partition can be seen. Definition 3.14 (Classification error rate) The classification error rate is defined as the fraction of the training observations in that region that do not belong to the most common class. So for the m:th region, the classification rate E is defined as E = 1 max k (ˆp mk ) (3.20) where ˆp mk represents the proportion of training observations in the m:th region that belong to class k Bagging Bagging, also called Bootstrap Aggregation, is a method that reduces the variance in a statistical learning method. Since decision trees often suffer from high variance, this is a very useful method here [35]. The idea behind Bagging is that averaging a set of observations reduces the variance. Hence, to reduce the variance of a statistical learning method, separate prediction models can be built using several training sets from a population and then the average of these predictions is used. In other words, B sets of training data are created by repeatedly sampling from the original training set. Prediction models ˆf 1 (x),..., ˆf B (x) are built from each of these training sets and averaging them results in a single model ˆfbag (x) with reduced variance ˆf bag (x) = 1 B B ˆf b (x) (3.21) This is the procedure of Bagging in the context of regression. In classification, the class predicted by each of the B training sets for a given test observation is recorded. The majority vote is then used for predicting the class of that test observation, i.e. the most commonly predicted class is b=1 13

24 chosen as the predicted class Random Forest Random forest is a method similar to Bagging but with an improvement which prevents correlation. Random forest also builds decision trees from sets of training data, sampled from the original training set. The difference between the methods is the splitting in the process of building these decision trees. In Bagging, all p predictors are considered as candidates for every split. In Random Forest however, for each split a random sample of m predictors is chosen as candidates from the set of p predictors. From these m predictors, one is then chosen for the split. In other words there are p m predictors that are not even considered as candidates for each split. This will reduce the risk of using the same predictor for the first split in all the trees, which could happen if there is one very strong predictor compared to the others, resulting in highly correlated predictions and hence overfitted models. Random Forest is a common method to use for classification of samples using gene expression data because of its resemblance to the hierarchical structure of the cells and the ability to perform feature selection which is essential when dealing with high-dimensional data sets [44]. 3.6 Sensitivity, specificity and positive predictive value Sensitivity and specificity, also known as true positive rate and true negative rate, are two statistical measures of the performance of a classification test. Sensitivity measures the proportion of the actual positives that are correctly classified and specificity measures the proportion of the actual negatives that are correctly classified [36]. Definition 3.15 Suppose that a group of patients are tested for a disease. Then we have the following terminology: True positive (TP): the patient has the disease and the test is positive. False positive (FP): the patient does not have the disease but the test is positive. True negative (TN): the patient does not have the disease and the test is negative False negative (FN): the patient has the disease but the test is negative. Using this terminology, the sensitivity and specificity are defined as in equations 3.22 and Sensitivity = Specif icity = T P T P + F N T N T N + F P (3.22) (3.23) The positive predicted value (PPV) is another statistical measure commonly used for answering the question of how likely it is that a patient has a disease given that the test result is positive. This measures the proportion of the classified positives that are true and is the most commonly used measure by clinicians. Definition 3.16 Given the number of true positives (TP) and false positives (FP), the positive predicted value (PPV) is defined as in equation 3.24 P P V = T P T P + F P (3.24) 14

25 Chapter 4 Method 4.1 Data preparation Training data The data used as training data for building the classification method was obtained from The Cancer Genome Atlas (TCGA) [37] which is a publicly available data base started by the National Cancer Institute and the National Human Genome Research Institute to identify genomic changes in different types of cancer. The data contains expression data for mrna and microrna from 377 patients with liver hepatocellular carcinoma. Approximately 50 of these have data from both the tumor (called tumor sample) and the margin of the tumor (called normal sample) while for the others there are no samples from the margin. There is also clinical data available such as vital status at time of report and disease-specific diagnostic information. The clinical parameter of interest for this project is the AFP level. The expression data for both mirna and mrna has already been normalized by RPKM when we downloaded the data. If the expression data for a specific mrna or mirna is missing for a patient it has been set to zero. All mirnas and mrnas that have more than 80% zeros for any patient are excluded from the analysis since these are not as interesting for the analysis. We also exclude a few patients from the data set where the initial diagnosis of HCC was shown on pathology to be another type of cancer, such as colorectal carcinoma. This data will be used as training data for both the previously mentioned questions to be examined, but with some alternations in the preprocessing. To be able to create a classification method for the first question where we are differencing between tumor samples and normal samples, there has to be matched data (both tumor and normal samples) for all the patients. Since the data provided from TCGA only includes 50 patient with matched samples, all the other patients are excluded from the analysis. After the first filtering described above and exclusion of patients without matched samples, the data set includes matched pairs from 49 patients with expression levels for 1047 different mirnas and mrnas. The data that is used for the second question, analysis of the AFP, has to include the level of alpha-fetoprotein for each patient. Hence, all patients for which we do not have information about this are excluded from the analysis. The AFP level is regarded as low if it is less than 20 and high if it is higher than 200. Between 20 and 200 there is a grey-zone where it is uncertain whether the level should be regarded as high or low since the sensitivity and specificity of the test overlap and therefore the few patients with an AFP in this interval are excluded. After this filtering of the data set we have 221 patients left with expression levels from 1047 different mirnas and mrnas. 15

26 4.1.2 Validation data This dataset was supposed to consist of expression levels of mrna and mirna generated from 100 surgical tumor samples with matched normal samples from the surgical margins, collected from patients over the last several years from HCC resections carried out at Karolinska University Hospital in Huddinge. Unfortunately only data from six patients was received within the time frame of the project. These six patients were used as a validation set, but since this is a very small amount of data, a model will also be built from only two thirds of the training data and the remaining data from this set is instead used as a validation set. The samples from Karolinska were prepared and sequenced at SciLifeLab in Solna. The raw data obtained from the sequencing, consisting of FASTQ files containing all the reads, needed some preparation before any analysis could be performed. This followed a common pipeline for preparation of raw expression data with the steps explained below. The process described below only had to be done for the mirna data since the corresponding preparation of the mrna data had already been performed at SciLifeLab. 1. Removal of the 3 adapter and parsing by barcodes Usually, the first part of the sequenced read corresponds to the mirna sequence followed by the adapter sequence and the barcode which were ligated in the library preparation process. The adapter sequence was found by inspection and comparison to sequences used in Illumina sequencing [38]. Since the adapter is not part of the original sequence, it has to be removed from all reads before continuing working with the data. The removal of the 3 adapter is accomplished by using the program Cutadapt [39] which searches for reads whose 3 ends align to the adapter sequence and then remove it. All the trimmed reads which are longer than 17 nucleotides are saved in a FASTQ file with the copy number for each specific sequence (how many times every specific sequence was found). The sequences shorter than 17 nucleotides are not saved since these can correspond to other sequences than mirna. The reads were already parsed by barcode, since every patient had a specific barcode besides the adapter, and we retrieved the data separated for each patient. 2. Mapping against the genome After the adapters have been removed, the reads are mapped against the genome. This is done by aligning the reads with a known reference sequence in a way that reveals how they are similar and finding the location where they are best matched. The reference sequences used for the mapping, called annotation files and containing information of where known mirnas are located in the genome, were downloaded from mirbase [40]. First the mirnas in the annotation files corresponding to the appropriate species have to be chosen, in this case Homo sapiens. The next step is to match the sequenced reads to the reference obtained from these files. For mapping the reads to the genome, the program Bowtie [41] was used. Bowtie takes annotation files and sequencing reads in FASTQ format as input and then outputs the set of aligned sequences in SAM format (Sequence Alignment/MAP format). The SAM file contains all the reads and their aligned sequences (if the read was mapped), the mapping quality and some other information about the alignments. For some reads an alignment to the reference genome is not found and hence it is not mapped to the genome. This can be an indication of a sequencing error or that this is a not yet known mirna. The mapped reads are summarized by read counts to indicate which mirnas are expressed in the different samples by counting by all the reads mapping to each sequence (i.e. to each mirna). 3. Normalization To compare expression levels from different libraries, the read counts have to be normalized to compensate for the fact that different mirnas and mrnas have been mapped from 16

27 reads of differing lengths. The training data obtained from TCGA was normalized using RPKM, but since this is not the best known normalization method available for mirna expression data [25] we use TMM for the normalization of the mirna data obtained from Karolinska. The mrna data from Karolinska is already normalized using RPKM, as is appropriate for mrna expression data. The normalized data set is then filtered in the same manner as the training set; all mirnas and mrnas with more than 80% zeros are excluded. When the data from Karolinska is used as validation set, the mirnas and mrnas are compared between this data and the data from TCGA and all the mirnas and mrnas that do not exist in both data sets are excluded. 4.2 Data analysis The data analysis will follow the following procedure. First significantly differentially expressed mirna and mrna between the two classes (tumor and normal or high or low AFP) are identified. Then the Pearson correlation between microrna and mrna expression levels for samples in each class (tumor sample, normal sample, high AFP and low AFP) are computed. Four random forest classification models are built with different inputs. One has only mirna as input, one has mrna, one has both mirna and mrna and one has all correlated mirna and mrna. These classification models are then tested using the validation data to compare which input gives the best performance of the model. First, two thirds of the data from TCGA is used for building the model since the rest of the data is used as validation set. After that, all of the data from TCGA is used as training data and the data from Karolinska containing 6 patients is used as validation set. When using the TCGA data as validation set, the two thirds used as training set are randomly chosen. An average result of doing this 10 times is computed for higher statistical accuracy Differential expression analysis For both the questions previously discussed, the first step after the preprocessing of the data is to find the differentially expressed mirnas and mrnas. Here "differentially expressed" means that the expression level changes systematically between two conditions (such as tumor and normal tissue in the first question we want to examine or high and low AFP in the second question). The two methods used for finding the significant mirnas and mrnas are fold change with a cut-off at 0.5 ( log 2 FC > 0.5) and t-test with significance level α = In the first case, the training data consists of paired samples of expression data from tumor and normal tissue from 49 patients. Hence a two-sided, paired t-test is used. In the second case the test data consists of unpaired samples of expression data from patients with a high AFP (138 patients) and patients with a low AFP (83 patients). Hence the t-test in this case is a two-sided, unpaired t-test. In both these cases the obtained p-values are corrected for multiple testing using FDR. The mirnas and mrnas which are found to be significantly differentially expressed (i.e. have a FDR-corrected p-value of less than 0.01 and a log 2 fold change smaller than 0.5) are chosen for building the classification model Building the random forest classification model The random forest classification models were implemented with the RandomForestClassifier Python package [42] with the standard parameters suggested by the Python package. The 17

28 classification models were built using four different inputs: 1. All significantly differentially expressed mirnas 2. All significantly differentially expressed mrnas 3. All significantly differentially expressed mirnas and mrnas 4. All significantly differentially expressed and correlated mirnas and mrnas The models built with the first three inputs are straight-forward. The expression levels of the mirnas and/or mrnas for each patient are used as predictors and the condition of the same patient is the corresponding response variable. First we use all the data from TCGA as training data. Then in the case where we compare tumor and normal samples, this results in 98 predictors, each containing 118 or elements (depending on if the input is mirna or mrna) and 98 response variables of which 49 are tumor and 49 are normal. In the other case, where high and low levels of AFP are compared, there are 221 predictors, each containing 118 or elements and 221 response variables of which 138 are high AFP and 83 are low AFP. When we use two thirds of the TCGA data as training set we have 66 predictors in the first case and 148 predictors in the second case. When we use the correlated mirnas and mrnas as input, we first have to compute the correlations. Using Pearson correlation and a threshold of r > 0.7 for when a mirna and an mrna are considered correlated, the correlation between each mrna-mirna pair is computed, and the mirnas and mrnas which are not correlated to any of the others are excluded from the input data. In the first case this results in 98 predictors with corresponding response variables as previously explained, but the predictors now contain 69 elements for mirna and 518 elements for mrna. In the second case, the 221 predictors contain 64 elements for mirna and 89 elements for mrna Testing the model When the random forest classification models with the four different inputs have been built, they are tested using the validation data set. When we use the data from Karolinska we have a validation data set containing 12 predictors (6 tumor samples and 6 normal samples) and when using one third of the data from TCGA we have 32 predictors when comparing tumor and normal samples and 96 predictors when comparing high and low AFP. The model outputs the predicted class (for the first case tumor or normal and for the second case high or low AFP) based on the input data and since the class is already known for all the patients, this information can be used to evaluate the performance of the models. The models are evaluated by the percentage of correct classifications and by computing the sensitivity, specificity and the positive predicted value. The number of correct classifications is an average of the numbers obtained by changing the partitioning of the TCGA data into training and test sets and by using the same input to the model 100 times. 18

29 Chapter 5 Results In the significance test for tumor and normal samples, 118 out of the 1047 mirnas and of the mrnas were found to be significantly differentially expressed at level α = 0.01 and with a log 2 fold change larger than 0.5. The results are displayed with volcano plots (scatterplots of the negative log 10 transformed p-values obtained from the t-test versus the log 2 fold change) in figure 5.1a for the mirnas and 5.1b for the mrnas. mirnas and mrnas with statistically significant differential expression according to the t-test are located above the horizontal threshold line at 2, and mirnas and mrnas with large fold-change values will be located outside the vertical threshold lines at -0.5 and 0.5. In the second case (where we compare high and low AFP), 161 out of the 1047 mirnas and of the mrnas were found to be significantly differentially expressed. The volcano plots for the mirnas and the mrnas can be seen in figure 5.1c and 5.1d respectively. The mirnas and mrnas considered significant will hence be located in the upper left or upper right parts of the plot (in the plot these points are colored blue). The distributions in the tumor samples and the normal samples of some of the most significant mirnas and mrnas can be seen in Appendix A. First the data from TCGA was divided with two thirds as training set and one third as validation set. Random forest classification models were built using the training set with four different inputs and then tested with the validation set. In the first case, where tumor and normal samples were compared, all the models performed very well with over 90% correct classifications. The sensitivity, specificity and standard deviations for the four different models can be seen in blue in figure 5.2. In the second case, with high and low AFP, the models did not perform as well as in the first case and had approximately 70% correct classifications. The sensitivity, specificity and standard deviations for these models can be seen in red in 5.2. The number of true and false positives and true and false negatives for all the tested models can be found in Appendix B. When the data obtained from Karolinska was used as validation set and all the data from TCGA was used as training set, there were only approximately 40-60% correct classifications. We expected this result to be less accurate than desired due to the small amount of data in the validation set. Another problem is that the diagnosis of these six patients is not entirely determined. Because of this we will leave the results and discussion of the models using this data to further studies when more data from Karolinska is available. The remainder of this report will be focusing on the results when using only the data from TCGA. 19

30 (a) mirna from tumor and normal samples (b) mrna from tumor and normal samples (c) mirna from samples with high and low AFP (d) mrna from samples with high and low AFP Figure 5.1: Volcano plots showing the log 10 transformed p-values obtained from the t-test and the log 2 fold change for all mirnas and mrnas. The significantly differentially expressed are located in the upper left or upper right corners. (a) Sensitivity of all the random forest models (b) Specificity of all the random forest models Figure 5.2: The staple diagram shows the sensitivity and specificity and standard deviations for the random forest models with the four different inputs. In case 1, tumor and normal samples are compared (the blue bars) and in case 2, high and low AFP are compared (the red bars). 20

31 In order to find out which mirnas that contributed the most to the classification and to see if these could be used alone as input to the model without a large loss of accuracy, the variable importance, which explains how important all the mirnas were for the classification process, was computed. The five mirnas that contributed the most were selected and a new random forest model was built using only these as input to see how the performance would change. The results show a small decrease in classification accuracy when using these five mirnas, but the specificity, sensitivity and PPV is still around 0.90 which we consider to be a very good result. 21

32 Chapter 6 Discussion 6.1 Performance of the Model The classification models built and tested only with data obtained from TCGA performed very well overall. The models built for classifying samples as tumor or normal performed better compared to the ones distinguishing between high and low AFP. This was expected since the samples used for building these AFP models still are all tumor samples which will affect the levels of mirnas and mrnas. Hence, the effect that the AFP levels have on the expression levels is decreased by the effect that the tumor has. However, we can still see a pattern of the expression levels that are linked to the AFP levels, which can be useful for further studies. The large difference in expression between tumor samples and normal samples can be seen in the distribution plots in Appendix A. There we can see that the plots showing the different distributions in samples with high and low AFP are more similar to each other than the plots showing the distributions in tumors and normal samples, as expected. There is only a very small difference in performance between the models with the four different inputs. Since many of the mirnas and mrnas are correlated, a similar result when using these as inputs to the model is expected. Hence, this shows that the model behaves in the desired and expected way. The high prediction accuracy of the models suggests that expression levels of mirna and mrna could be used for diagnosis of HCC and hopefully also for designing new therapeutics using modified RNA with binding sites to some of the mirnas found in this project. For this purpose there has to be a smaller selection of mirnas (the reason for this is discussed in the section "Selection of method" below). From the variable selection we know that the model performs well with only a few mirnas as input and hence these mirnas could be used for such a purpose. 6.2 Selection of Method In this project the number of variables (mirna and mrna) are much larger than the number of observations (patients) which makes variable selection a crucial part of the solution. One prospect of this project was that the results could be used for designing a therapeutic reagent with binding sites for mirnas that would be toxic to tumor cells. It is not possible to use all the significant differentially expressed mirnas for this, since there can only be a limited number of binding sites, but we have to select a small subset of these mirnas that still result in a good predictive performance. When using random forest it is easy to determine which variables are contributing the most to the classification and hence build a model based only on these variables that sustain a good prediction accuracy. This is the reason for choosing to use random forest in this project. Other methods that could have been used is k-nearest neighbor (KNN), support vector machines (SVM) and linear discriminant analysis (LDA) but several prior stud- 22

33 ies [43][44][45] have found that random forest is the most successful method for distinguishing between disease and non-disease samples using gene expression data. 6.3 Future Studies Since the data we used from TCGA was preprocessed and we did not have any control over how the samples were collected and the grade of heterogeneity of the samples, we cannot be sure how reliable the results that we obtained using this data are. An immediate continuation of this project is to build classification models based only on data from the Karolinska hospital to have full control over the collection and processing of the samples. There are 100 samples from patients at Karolinska that are currently being sequenced in order to enable such an analysis. Another possible improvement of the models could be achieved by optimizing the parameters used in the random forest models. An interesting continuation of this project would be to study if the mirnas with a significant difference in expression between liver tumors and normal liver samples could be used for treatment of HCC. Future studies can examine if modified RNA coding for a toxic payload can be used for binding to the mirnas found to be lacking in the tumors and hence target only the tumor cells. Binding sites for a few of the mirnas, the ones that were found to be the most important for the classification, can then be incorporated into the modrna which will make them selectively target the tumor cells. There are some factors that have not been discussed in this thesis but may be very important for these kinds of treatments to work. One such factor is whether the mirnas are up or down regulated in the tumor cells. If binding sites to mirnas that are up regulated in tumor cells are incorporated in the modrnas, the cells that are not tumors will be attacked instead of the opposite. Another important factor is the quantity of the mirnas that are chosen. It has to be studied how high the quantity in the normal cells has to be in order to make sure that the modrna binds to all the normal cells and hence does not target them and how low the quantity has to be in the tumor cells in order to kill all of them. There are some mirnas that are present at reasonable levels in normal liver samples but almost completely lacking in the tumor samples. These could be a good choice when selecting which mirnas to use for developing a treatment. 23

34 Appendix A Distribution plots The distributions of some of the mirnas and mrnas with the largest significant differences can be seen below. All figures show the distribution in the tumor sample in red and the normal sample in blue. Most of the mirnas and mrnas are down regulated in the tumor samples which can be seen in the figures by a lower mean of the distribution, but there are also some that are up regulated. Something worth noticing is the size distribution of the mirnas. Some appear in a large number (e.g. Figure A.1b) while some are very rare (e.g. Figure A.1d). This could be important for further studies. All names of the mirnas and mrnas in Figures A.1, A.2, A.3 and A.4 have been coded in order to protect intellectual property rights and enable future commercialization of diagnosis and treatment methods based on these discoveries. Figure A.1 shows the distributions of some mirnas in tumor samples and normal samples. Figure A.2 shows the distributions of some mrnas in tumor samples and normal samples. In the figures below, the distribution of some mirnas and mrnas in samples with high AFP (red) and low AFP (blue) are shown. The difference in distribution is smaller than in the case where we compared tumor samples and normal samples, as could be seen in the result. Figure A.3 shows the distributions of some mirnas in samples with high and low AFP. Figure A.4 shows the distributions of some mrnas in samples with high and low AFP. 24

35 (a) mirna A (b) mirna B (c) mirna C (d) mirna D Figure A.1: Distributions of some mirna in tumor samples and normal samples. 25

36 (a) mrna A (b) mrna B (c) mrna C (d) mrna D Figure A.2: Distributions of some mrna in tumor samples and normal samples. 26

37 (a) mirna E (b) mirna F (c) mirna G (d) mirna H Figure A.3: Distributions of some mirna in samples with high and low AFP respectively. 27

38 (a) mrna E (b) mrna F (c) mrna G (d) mrna H Figure A.4: Distributions of some mrna in samples with high and low AFP respectively. 28

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction Optimization strategy of Copy Number Variant calling using Multiplicom solutions Michael Vyverman, PhD; Laura Standaert, PhD and Wouter Bossuyt, PhD Abstract Copy number variations (CNVs) represent a significant

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Gene-microRNA network module analysis for ovarian cancer

Gene-microRNA network module analysis for ovarian cancer Gene-microRNA network module analysis for ovarian cancer Shuqin Zhang School of Mathematical Sciences Fudan University Oct. 4, 2016 Outline Introduction Materials and Methods Results Conclusions Introduction

More information

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012 STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION by XIN SUN PhD, Kansas State University, 2012 A THESIS Submitted in partial fulfillment of the requirements

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

4. Model evaluation & selection

4. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2017 4. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq) RNA sequencing (RNA-seq) Module Outline MO 13-Mar-2017 RNA sequencing: Introduction 1 WE 15-Mar-2017 RNA sequencing: Introduction 2 MO 20-Mar-2017 Paper: PMID 25954002: Human genomics. The human transcriptome

More information

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc. Chapter 23 Inference About Means Copyright 2010 Pearson Education, Inc. Getting Started Now that we know how to create confidence intervals and test hypotheses about proportions, it d be nice to be able

More information

Nature Methods: doi: /nmeth.3115

Nature Methods: doi: /nmeth.3115 Supplementary Figure 1 Analysis of DNA methylation in a cancer cohort based on Infinium 450K data. RnBeads was used to rediscover a clinically distinct subgroup of glioblastoma patients characterized by

More information

Introduction to Systems Biology of Cancer Lecture 2

Introduction to Systems Biology of Cancer Lecture 2 Introduction to Systems Biology of Cancer Lecture 2 Gustavo Stolovitzky IBM Research Icahn School of Medicine at Mt Sinai DREAM Challenges High throughput measurements: The age of omics Systems Biology

More information

CHAPTER ONE CORRELATION

CHAPTER ONE CORRELATION CHAPTER ONE CORRELATION 1.0 Introduction The first chapter focuses on the nature of statistical data of correlation. The aim of the series of exercises is to ensure the students are able to use SPSS to

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Analysis of gene expression in blood before diagnosis of ovarian cancer

Analysis of gene expression in blood before diagnosis of ovarian cancer Analysis of gene expression in blood before diagnosis of ovarian cancer Different statistical methods Note no. Authors SAMBA/10/16 Marit Holden and Lars Holden Date March 2016 Norsk Regnesentral Norsk

More information

EpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes

EpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes EpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes by Konstantin Halachev Supervisors: Christoph Bock Prof. Dr. Thomas Lengauer A thesis submitted

More information

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys Multiple Regression Analysis 1 CRITERIA FOR USE Multiple regression analysis is used to test the effects of n independent (predictor) variables on a single dependent (criterion) variable. Regression tests

More information

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1 Lecture 27: Systems Biology and Bayesian Networks Systems Biology and Regulatory Networks o Definitions o Network motifs o Examples

More information

DNA codes for RNA, which guides protein synthesis.

DNA codes for RNA, which guides protein synthesis. Section 3: DNA codes for RNA, which guides protein synthesis. K What I Know W What I Want to Find Out L What I Learned Vocabulary Review synthesis New RNA messenger RNA ribosomal RNA transfer RNA transcription

More information

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process Research Methods in Forest Sciences: Learning Diary Yoko Lu 285122 9 December 2016 1. Research process It is important to pursue and apply knowledge and understand the world under both natural and social

More information

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA PART 1: Introduction to Factorial ANOVA ingle factor or One - Way Analysis of Variance can be used to test the null hypothesis that k or more treatment or group

More information

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger Conditional Distributions and the Bivariate Normal Distribution James H. Steiger Overview In this module, we have several goals: Introduce several technical terms Bivariate frequency distribution Marginal

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

Biomarker adaptive designs in clinical trials

Biomarker adaptive designs in clinical trials Review Article Biomarker adaptive designs in clinical trials James J. Chen 1, Tzu-Pin Lu 1,2, Dung-Tsa Chen 3, Sue-Jane Wang 4 1 Division of Bioinformatics and Biostatistics, National Center for Toxicological

More information

Correlation and Regression

Correlation and Regression Dublin Institute of Technology ARROW@DIT Books/Book Chapters School of Management 2012-10 Correlation and Regression Donal O'Brien Dublin Institute of Technology, donal.obrien@dit.ie Pamela Sharkey Scott

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

CHAPTER III METHODOLOGY

CHAPTER III METHODOLOGY 24 CHAPTER III METHODOLOGY This chapter presents the methodology of the study. There are three main sub-titles explained; research design, data collection, and data analysis. 3.1. Research Design The study

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

An Improved Algorithm To Predict Recurrence Of Breast Cancer

An Improved Algorithm To Predict Recurrence Of Breast Cancer An Improved Algorithm To Predict Recurrence Of Breast Cancer Umang Agrawal 1, Ass. Prof. Ishan K Rajani 2 1 M.E Computer Engineer, Silver Oak College of Engineering & Technology, Gujarat, India. 2 Assistant

More information

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling Olli-Pekka Kauppila Daria Kautto Session VI, September 20 2017 Learning objectives 1. Get familiar with the basic idea

More information

Binary Diagnostic Tests Two Independent Samples

Binary Diagnostic Tests Two Independent Samples Chapter 537 Binary Diagnostic Tests Two Independent Samples Introduction An important task in diagnostic medicine is to measure the accuracy of two diagnostic tests. This can be done by comparing summary

More information

Patnaik SK, et al. MicroRNAs to accurately histotype NSCLC biopsies

Patnaik SK, et al. MicroRNAs to accurately histotype NSCLC biopsies Patnaik SK, et al. MicroRNAs to accurately histotype NSCLC biopsies. 2014. Supplemental Digital Content 1. Appendix 1. External data-sets used for associating microrna expression with lung squamous cell

More information

Behavioral Data Mining. Lecture 4 Measurement

Behavioral Data Mining. Lecture 4 Measurement Behavioral Data Mining Lecture 4 Measurement Outline Hypothesis testing Parametric statistical tests Non-parametric tests Precision-Recall plots ROC plots Hardware update Icluster machines are ready for

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 13 & Appendix D & E (online) Plous Chapters 17 & 18 - Chapter 17: Social Influences - Chapter 18: Group Judgments and Decisions Still important ideas Contrast the measurement

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Research Questions, Variables, and Hypotheses: Part 2. Review. Hypotheses RCS /7/04. What are research questions? What are variables?

Research Questions, Variables, and Hypotheses: Part 2. Review. Hypotheses RCS /7/04. What are research questions? What are variables? Research Questions, Variables, and Hypotheses: Part 2 RCS 6740 6/7/04 1 Review What are research questions? What are variables? Definition Function Measurement Scale 2 Hypotheses OK, now that we know how

More information

3. Model evaluation & selection

3. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2016 3. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene

More information

MODULE 3: TRANSCRIPTION PART II

MODULE 3: TRANSCRIPTION PART II MODULE 3: TRANSCRIPTION PART II Lesson Plan: Title S. CATHERINE SILVER KEY, CHIYEDZA SMALL Transcription Part II: What happens to the initial (premrna) transcript made by RNA pol II? Objectives Explain

More information

ChIP-seq data analysis

ChIP-seq data analysis ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017 Contents Background ChIP-seq protocol ChIP-seq data analysis Transcriptional regulation Transcriptional

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL

CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL 127 CHAPTER 6 HUMAN BEHAVIOR UNDERSTANDING MODEL 6.1 INTRODUCTION Analyzing the human behavior in video sequences is an active field of research for the past few years. The vital applications of this field

More information

INVESTIGATION OF ROUNDOFF NOISE IN IIR DIGITAL FILTERS USING MATLAB

INVESTIGATION OF ROUNDOFF NOISE IN IIR DIGITAL FILTERS USING MATLAB Clemson University TigerPrints All Theses Theses 5-2009 INVESTIGATION OF ROUNDOFF NOISE IN IIR DIGITAL FILTERS USING MATLAB Sierra Williams Clemson University, sierraw@clemson.edu Follow this and additional

More information

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 11 + 13 & Appendix D & E (online) Plous - Chapters 2, 3, and 4 Chapter 2: Cognitive Dissonance, Chapter 3: Memory and Hindsight Bias, Chapter 4: Context Dependence Still

More information

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Plous Chapters 17 & 18 Chapter 17: Social Influences Chapter 18: Group Judgments and Decisions

More information

Prediction of micrornas and their targets

Prediction of micrornas and their targets Prediction of micrornas and their targets Introduction Brief history mirna Biogenesis Computational Methods Mature and precursor mirna prediction mirna target gene prediction Summary micrornas? RNA can

More information

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers Gordon Blackshields Senior Bioinformatician Source BioScience 1 To Cancer Genetics Studies

More information

SUMMER 2011 RE-EXAM PSYF11STAT - STATISTIK

SUMMER 2011 RE-EXAM PSYF11STAT - STATISTIK SUMMER 011 RE-EXAM PSYF11STAT - STATISTIK Full Name: Årskortnummer: Date: This exam is made up of three parts: Part 1 includes 30 multiple choice questions; Part includes 10 matching questions; and Part

More information

CHAPTER TWO REGRESSION

CHAPTER TWO REGRESSION CHAPTER TWO REGRESSION 2.0 Introduction The second chapter, Regression analysis is an extension of correlation. The aim of the discussion of exercises is to enhance students capability to assess the effect

More information

RNA SEQUENCING AND DATA ANALYSIS

RNA SEQUENCING AND DATA ANALYSIS RNA SEQUENCING AND DATA ANALYSIS Length of mrna transcripts in the human genome 5,000 5,000 4,000 3,000 2,000 4,000 1,000 0 0 200 400 600 800 3,000 2,000 1,000 0 0 2,000 4,000 6,000 8,000 10,000 Length

More information

Psychology, 2010, 1: doi: /psych Published Online August 2010 (

Psychology, 2010, 1: doi: /psych Published Online August 2010 ( Psychology, 2010, 1: 194-198 doi:10.4236/psych.2010.13026 Published Online August 2010 (http://www.scirp.org/journal/psych) Using Generalizability Theory to Evaluate the Applicability of a Serial Bayes

More information

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015

More information

Comparison of discrimination methods for the classification of tumors using gene expression data

Comparison of discrimination methods for the classification of tumors using gene expression data Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley

More information

Tutorial: RNA-Seq Analysis Part II: Non-Specific Matches and Expression Measures

Tutorial: RNA-Seq Analysis Part II: Non-Specific Matches and Expression Measures : RNA-Seq Analysis Part II: Non-Specific Matches and Expression Measures March 15, 2013 CLC bio Finlandsgade 10-12 8200 Aarhus N Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com support@clcbio.com

More information

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells. SUPPLEMENTAL FIGURE AND TABLE LEGENDS Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells. A) Cirbp mrna expression levels in various mouse tissues collected around the clock

More information

WELCOME! Lecture 11 Thommy Perlinger

WELCOME! Lecture 11 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 11 Thommy Perlinger Regression based on violated assumptions If any of the assumptions are violated, potential inaccuracies may be present in the estimated regression

More information

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS) Chapter : Advanced Remedial Measures Weighted Least Squares (WLS) When the error variance appears nonconstant, a transformation (of Y and/or X) is a quick remedy. But it may not solve the problem, or it

More information

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc. Variant Classification Author: Mike Thiesen, Golden Helix, Inc. Overview Sequencing pipelines are able to identify rare variants not found in catalogs such as dbsnp. As a result, variants in these datasets

More information

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data Breast cancer Inferring Transcriptional Module from Breast Cancer Profile Data Breast Cancer and Targeted Therapy Microarray Profile Data Inferring Transcriptional Module Methods CSC 177 Data Warehousing

More information

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Supplementary Materials RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Junhee Seok 1*, Weihong Xu 2, Ronald W. Davis 2, Wenzhong Xiao 2,3* 1 School of Electrical Engineering,

More information

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing Last update: 05/10/2017 MODULE 4: SPLICING Lesson Plan: Title MEG LAAKSO Removal of introns from messenger RNA by splicing Objectives Identify splice donor and acceptor sites that are best supported by

More information

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection 202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

ANOVA in SPSS (Practical)

ANOVA in SPSS (Practical) ANOVA in SPSS (Practical) Analysis of Variance practical In this practical we will investigate how we model the influence of a categorical predictor on a continuous response. Centre for Multilevel Modelling

More information

9 research designs likely for PSYC 2100

9 research designs likely for PSYC 2100 9 research designs likely for PSYC 2100 1) 1 factor, 2 levels, 1 group (one group gets both treatment levels) related samples t-test (compare means of 2 levels only) 2) 1 factor, 2 levels, 2 groups (one

More information

Measuring Focused Attention Using Fixation Inner-Density

Measuring Focused Attention Using Fixation Inner-Density Measuring Focused Attention Using Fixation Inner-Density Wen Liu, Mina Shojaeizadeh, Soussan Djamasbi, Andrew C. Trapp User Experience & Decision Making Research Laboratory, Worcester Polytechnic Institute

More information

12.1 Inference for Linear Regression. Introduction

12.1 Inference for Linear Regression. Introduction 12.1 Inference for Linear Regression vocab examples Introduction Many people believe that students learn better if they sit closer to the front of the classroom. Does sitting closer cause higher achievement,

More information

Supplementary Figures

Supplementary Figures Supplementary Figures Supplementary Figure 1. Pan-cancer analysis of global and local DNA methylation variation a) Variations in global DNA methylation are shown as measured by averaging the genome-wide

More information

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach) High-Throughput Sequencing Course Gene-Set Analysis Biostatistics and Bioinformatics Summer 28 Section Introduction What is Gene Set Analysis? Many names for gene set analysis: Pathway analysis Gene set

More information

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol. Ho (null hypothesis) Ha (alternative hypothesis) Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol. Hypothesis: Ho:

More information

Binary Diagnostic Tests Paired Samples

Binary Diagnostic Tests Paired Samples Chapter 536 Binary Diagnostic Tests Paired Samples Introduction An important task in diagnostic medicine is to measure the accuracy of two diagnostic tests. This can be done by comparing summary measures

More information

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when.

Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. INTRO TO RESEARCH METHODS: Empirical Knowledge: based on observations. Answer questions why, whom, how, and when. Experimental research: treatments are given for the purpose of research. Experimental group

More information

6. Unusual and Influential Data

6. Unusual and Influential Data Sociology 740 John ox Lecture Notes 6. Unusual and Influential Data Copyright 2014 by John ox Unusual and Influential Data 1 1. Introduction I Linear statistical models make strong assumptions about the

More information

RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB

RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB CSF-NGS January 22, 214 Contents 1 Introduction 1 2 Experimental Details 1 3 Results And Discussion 1 3.1 ERCC spike ins............................................

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Study Guide for the Final Exam

Study Guide for the Final Exam Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make

More information

Sheila Barron Statistics Outreach Center 2/8/2011

Sheila Barron Statistics Outreach Center 2/8/2011 Sheila Barron Statistics Outreach Center 2/8/2011 What is Power? When conducting a research study using a statistical hypothesis test, power is the probability of getting statistical significance when

More information

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences. SPRING GROVE AREA SCHOOL DISTRICT PLANNED COURSE OVERVIEW Course Title: Basic Introductory Statistics Grade Level(s): 11-12 Units of Credit: 1 Classification: Elective Length of Course: 30 cycles Periods

More information

Applied Statistical Analysis EDUC 6050 Week 4

Applied Statistical Analysis EDUC 6050 Week 4 Applied Statistical Analysis EDUC 6050 Week 4 Finding clarity using data Today 1. Hypothesis Testing with Z Scores (continued) 2. Chapters 6 and 7 in Book 2 Review! = $ & '! = $ & ' * ) 1. Which formula

More information

A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING TO DETECT MEAN AND/OR VARIANCE SHIFTS IN A PROCESS. Received August 2010; revised February 2011

A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING TO DETECT MEAN AND/OR VARIANCE SHIFTS IN A PROCESS. Received August 2010; revised February 2011 International Journal of Innovative Computing, Information and Control ICIC International c 2011 ISSN 1349-4198 Volume 7, Number 12, December 2011 pp. 6935 6948 A NEW DIAGNOSIS SYSTEM BASED ON FUZZY REASONING

More information

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14 Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14 Still important ideas Contrast the measurement of observable actions (and/or characteristics)

More information

The Biology and Genetics of Cells and Organisms The Biology of Cancer

The Biology and Genetics of Cells and Organisms The Biology of Cancer The Biology and Genetics of Cells and Organisms The Biology of Cancer Mendel and Genetics How many distinct genes are present in the genomes of mammals? - 21,000 for human. - Genetic information is carried

More information

Prediction of Malignant and Benign Tumor using Machine Learning

Prediction of Malignant and Benign Tumor using Machine Learning Prediction of Malignant and Benign Tumor using Machine Learning Ashish Shah Department of Computer Science and Engineering Manipal Institute of Technology, Manipal University, Manipal, Karnataka, India

More information

Small RNAs and how to analyze them using sequencing

Small RNAs and how to analyze them using sequencing Small RNAs and how to analyze them using sequencing RNA-seq Course November 8th 2017 Marc Friedländer ComputaAonal RNA Biology Group SciLifeLab / Stockholm University Special thanks to Jakub Westholm for

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

Day 11: Measures of Association and ANOVA

Day 11: Measures of Association and ANOVA Day 11: Measures of Association and ANOVA Daniel J. Mallinson School of Public Affairs Penn State Harrisburg mallinson@psu.edu PADM-HADM 503 Mallinson Day 11 November 2, 2017 1 / 45 Road map Measures of

More information

1 The conceptual underpinnings of statistical power

1 The conceptual underpinnings of statistical power 1 The conceptual underpinnings of statistical power The importance of statistical power As currently practiced in the social and health sciences, inferential statistics rest solidly upon two pillars: statistical

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

Sample Size Estimation for Microarray Experiments

Sample Size Estimation for Microarray Experiments Sample Size Estimation for Microarray Experiments Gregory R. Warnes Department of Biostatistics and Computational Biology Univeristy of Rochester Rochester, NY 14620 and Peng Liu Department of Biological

More information

On Regression Analysis Using Bivariate Extreme Ranked Set Sampling

On Regression Analysis Using Bivariate Extreme Ranked Set Sampling On Regression Analysis Using Bivariate Extreme Ranked Set Sampling Atsu S. S. Dorvlo atsu@squ.edu.om Walid Abu-Dayyeh walidan@squ.edu.om Obaid Alsaidy obaidalsaidy@gmail.com Abstract- Many forms of ranked

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 5, 6, 7, 8, 9 10 & 11)

More information

Modeling Sentiment with Ridge Regression

Modeling Sentiment with Ridge Regression Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,

More information

Inferring Biological Meaning from Cap Analysis Gene Expression Data

Inferring Biological Meaning from Cap Analysis Gene Expression Data Inferring Biological Meaning from Cap Analysis Gene Expression Data HRYSOULA PAPADAKIS 1. Introduction This project is inspired by the recent development of the Cap analysis gene expression (CAGE) method,

More information

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course Introduction Biostatistics and Bioinformatics Summer 2017 From Raw Unaligned Reads To Aligned Reads To Counts Differential Expression Differential Expression 3 2 1 0 1

More information

AP Statistics. Semester One Review Part 1 Chapters 1-5

AP Statistics. Semester One Review Part 1 Chapters 1-5 AP Statistics Semester One Review Part 1 Chapters 1-5 AP Statistics Topics Describing Data Producing Data Probability Statistical Inference Describing Data Ch 1: Describing Data: Graphically and Numerically

More information

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN Outline 1. Review sensitivity and specificity 2. Define an ROC curve 3. Define AUC 4. Non-parametric tests for whether or not the test is informative 5. Introduce the binormal ROC model 6. Discuss non-parametric

More information

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

3 CONCEPTUAL FOUNDATIONS OF STATISTICS 3 CONCEPTUAL FOUNDATIONS OF STATISTICS In this chapter, we examine the conceptual foundations of statistics. The goal is to give you an appreciation and conceptual understanding of some basic statistical

More information