EpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes

Size: px

Start display at page:

Download "EpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes"

Kristopher Goodman
5 years ago
Views:

1 EpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes by Konstantin Halachev Supervisors: Christoph Bock Prof. Dr. Thomas Lengauer A thesis submitted in conformity with the requirements for the degree of Master of Science Computer Science Department Saarland University September 2006

3 Abstract EpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes Konstantin Halachev Master of Science Department of Computer Science Saarland University 2006 Five years ago, the human genome sequence was published, an important milestone towards understanding human biology. However, basic cell processes cannot be explained by the genome sequence alone. Instead, further layers of control such as the epigenome will be important for significant advances towards better understanding of normal and disease-related phenotypes. A new research field in computational biology is currently emerging that is concerned with the analysis of functional information beyond the human genome sequence. Our goal is to provide biologists with means to navigate the large amounts of epigenetic data and with tools to screen these data for biologically interesting associations. We developed a statistical learning methodology that facilitates mapping of epigenetic data against the human genome, identifies areas of over- and underrepresentation, and finds significant correlations with DNA-related attributes. We implemented this methodology in a software toolkit called EpiGRAPH regression. EpiGRAPH regression is a prototype of a genome analysis tool that enables the user to analyze relationships between many attributes, and it provides a quick test whether a newly analyzed attribute can be efficiently predicted from already known attributes. Thereby, EpiGRAPH regression may significantly speed up the analysis of new types of genomic and epigenomic data. i

4 ii

5 I hereby declare that this thesis is entirely my own work except where otherwise indicated. I have used only the resources given in the list of references. Konstantin Halachev September 28, 2006 iii

6 iv

7 Acknowledgments It is a pleasure to thank the many people who made this thesis possible. First and foremost, I would like to thank my thesis advisor, Christoph Bock, for his continuous guidance throughout the duration of this project. Our numerous scientific discussion and his many constructive comments greatly improved this work. He is also thanked for letting me use the DNA melting temperature data that is not publicly available. But mostly, I thank him for showing me how to be a (successful) researcher. I thank Prof. Thomas Lengauer for stirring up my enthusiasm for research in the field of computational biology and for inviting me to work in his group. The people of the group have been very friendly and open to discussions. I want to thank Joachim Büch for providing me with dedicated technical support. I acknowledge the International Max Planck Research School for Computer Science (IMPRS) program and its coordinator Kerstin Meyer-Ross for the financial support during my master studies. Also, all my friends are thanked for their good company and for showing occasional interest in my work. And of course, I thank my mother, my father and my brother for always supporting me. Last but not least, I want to say a big thank you to Laura Toloşi for her complete review of this manuscript, for her understanding and encouraging me when it was most needed. v

8 vi

9 Contents 1 Introduction Problem statement Motivation Related Work Contribution Basic Biological Background The Human Genome The Human Epigenome EpiGRAPH regression - Methods and Implementation Methods Bivariate Analysis Multivariate Correlation Analysis Multivariate Prediction Design and Implementation Implementation Attribute management Visualization Software Structure Results and Analysis Simulation Study Evolutionary Conservation vs. Melting Temperature Tissue-specific DNA Methylation Patterns Conclusion and Future Work Contributions Future Work A EpiGRAPH regression User Guide 66 A.1 General A.2 Attributes A.3 Analysis vii

10 A.3.1 Bivariate Analysis A.3.2 Multivariate Analysis Correlation A.3.3 Multivariate Analysis Prediction A.3.4 Compute A.3.5 List A.3.6 Extract A.3.7 Help B Digital Attachments 80 viii

11 ix

12 Chapter 1 Introduction 1.1 Problem statement Creating a unified functional view of the human genome will be an essential milestone in computational biology research. This involves understanding the relationship between the information encoded in the genome and the biological functions of the human organism (genotype-phenotype relation). The human genome is organized into a highly complex chemical and logical structure. Identifying global genomic and epigenomic patterns of this structure is key towards building a detailed biological model. Regions, in which we observe deviation from these patterns may indicate relation to specific biological function. Identifying these regions is important for selecting markers that characterize the onset and development of diseases, e.g. cancer. In this work, our goal is to integrate data sets representing different genomic and epigenomic properties into a comprehensive view of the genome. To that end, we develop a methodology to analyze interrelations between such properties. Furthermore, we aim at identifying genomic regions where the local mutual behavior of the chosen properties deviates from the global trend since such behavior might point to biological function. 1.2 Motivation The research field of computational biology is mainly driven by the available experimental data. The increasing amount of experimental data measuring different genomerelated properties justifies the initiation of many research projects in the field. Each improvement of the technologies used in the laboratories increases the quality and quantity of the measurements. The natural next step is to analyze the data and extract meaningful biological information from it. Such information may then be validated 1

13 and used in drug design or disease treatment. After the sequencing of the entire human genome in 2001, multiple large scale genomic and epigenomic projects were launched (e.g. ENCODE[21], Hapmap[22], the Human Epigenome project[8]). These projects produce vast amounts of data mapped onto the human genome, which are too large for effective manual screening and analyzing. However, the publicly available genome-data tools focus mainly on the visualization and integration of these data sets and provide little help for in-depth analysis. We believe that a toolkit that allows managing large data sets, analyzing mutual dependencies and using machine-learning techniques to identify function-related genome regions is the next step towards a unified view of the human genome. 1.3 Related Work The current most important publicly available servers for managing data sets mapped onto the genome are the UCSC Genome Browser[14] and Ensembl[2]. These genome browsers enable researchers to visualize and browse entire genome sequences of many different species with annotated information including gene prediction and structure, proteins, expression, regulation, variation, comparative analysis, etc. These annotated data usually come from more than one source. These tools provide zoom, scroll and other various possibilities to visualize genome-related measurements. However, they do not include possibilities for data mining. A recent web-based data mining tool EpiGRAPH class [6] proposed by Christoph Bock (2006) allows a test and control set of genome regions to be analyzed in the context with over 1000 attributes that relate to all aspect of the DNA and genomic organization (isochores, repeats, epigenetic information, transcription factor binding sites, evolutionary conservation). This tool makes use of a variety of statistical learning methods (e.g. support vector machines) and allows the identification of attributes and groups of attributes that play important role in distinguishing between the test and control set of genome regions. In a recent work [7], this tool achieved above 90% success rate in predicting DNA methylation in blood lymphocytes from DNA sequence, repetitive DNA motifs and predicted DNA structure. 1.4 Contribution In this work, we propose and present a software toolkit that provides functionality to store, integrate and analyze large genomic and epigenomic data sets. It provides with automated screening methods for initial analysis of genomic and epigenomic data sets. We propose a method for quantifying the dependency between two genome-wide properties and identifying interesting local regions. Furthermore, we extend that method 2

14 to estimate the more complex dependency between a specific measurement and a set of other measurements. Furthermore, we propose methods to predict missing values for such genome-related properties in the case when they are dependent on other data sets. We implement these methods in a software toolkit called EpiGRAPH regression that integrates a large number of currently available genomic and epigenomic data sets and allows easy integration and analysis of new data sets. We validate the toolkit on simulation data. Furthermore, we use the EpiGRAPH regression toolkit to analyze the dependencies between methylation patterns measured in different tissue types and to estimate the dependency between evolutionary conservation and DNA melting temperature. This work introduces a methodology and software toolkit for automated screening and analyzing of genomic and epigenomic data sets. In Chapter 2, we discuss the basic biological notions used in this work. We explain in details the methods used and the key implementation decisions in Chapter 3. Validation of the toolkit on simulated data together with analysis of experimental data are presented in Chapter 4. This work concludes with discussion and future work presented in Chapter 5. 3

15 Chapter 2 Basic Biological Background An important goal in biomedical research is to identify markers that characterize disease onset and development. Deviation from expected genomic and epigenomic patterns is an important characteristic of such markers. However, genomic and epigenomic patterns are frequently hard to detect and characterize due to the size and complexity of the experimental data. Bioinformatics can help with providing efficient algorithms and applications to mine this data in search for specific patterns and deviations from them. Section 2.1 introduces basic genome-related notions. A detailed explanation of several epigenomic mechanisms is presented in Section 2.2. The biological mechanisms introduced in this chapter are described in more details in [4]. 2.1 The Human Genome This section discusses basic notions about the human genome that are necessary for understanding the data and the methodology proposed later in this work. DNA The hereditary genetic information of an organism is contained in its genome. The genetic material is organized as DNA (deoxyribonucleic acid) molecules, contained in the nucleus of every cell. The DNA is a double helix formed of two paired sequences of nucleotides of the following four types: adenine (A), thymine (T), cytosine (C) and guanine (G) (see Figure 2.1). The main force promoting the formation of the helix is complementary base-pairing (see Figure 2.2). Adenines form hydrogen bonds with thymines and cytosines form hydrogen bonds with guanines. The human DNA is 3 billion basepairs long. It is packed into macromolecules called chromosomes. In human, DNA is organized into 23 pairs of chromosomes. Each chromosome consists of a single DNA molecule associated with structural macromolecules 4

16 Figure 2.1: Chemical structures of the four main nucleotides Figure 2.2: DNA double helix structure and base pairs called histone proteins that fold and pack the fine DNA thread into a more compact structure (see Figure 2.3). Genes and regulatory DNA sequences The DNA contains genetic specifications of all biological processes of a cell. Of particular interest are the genes, specific subsequences of ,000 base pairs length of DNA. Genes are translated into single stranded nucleic acid macromolecules called RNA, which can be further translated into proteins that perform most of the biological functions of the cell (see Figure 2.4). The transcription of the genes into RNA and further into proteins is regulated by the needs of the organism. The genes in the human genome have an average size of 27,000 nucleotide pairs. A 5

Figure 2.3: The various layers of chromosome structure typical gene carries in its sequence of nucleotides the information for the sequence of the amino acids of a protein.

17 Figure 2.3: The various layers of chromosome structure typical gene carries in its sequence of nucleotides the information for the sequence of the amino acids of a protein. Only about 1300 nucleotide pairs are required to encode an average size protein. Most of the remaining DNA in a gene consists of long stretches of non-coding DNA, which interrupt relatively short segments of DNA that code for proteins. The coding sequences are called exons; the noncoding sequences are called introns. In addition to introns and exons, each gene is associated with regulatory DNA sequences, which are responsible for ensuring that the gene is expressed at the proper level and time. 2.2 The Human Epigenome Cells from different tissues within the human body share the same genome sequence, but exhibit diverse phenotypes. This is largely cause by tissue-specific differences in gene expression, which are - in turn - governed by regulatory mechanisms that do not depend exclusively on the genome sequence. In this section, we discuss two epigenetic mechanisms that influence transcription via changes of the structural organization of 6

18 Figure 2.4: The process of producing a protein from a gene the genome within the cell nucleus: DNA methylation and histone modifications (see Figure 2.5). DNA methylation An important mechanism for the regulation of gene transcription is DNA methylation. Genes are translated into RNA and further into proteins by chemical processes, which involve specific transcription factor proteins. These proteins associate to specific DNA sequences. Modification of these DNA sequences provides mechanisms for regulation of gene expression. Such a mechanism is attaching a methyl group to a cytosine nucleotides, in certain situations when a cytosine nucleotide is followed by a guanine nucleotide ( CpG pattern). Patterns of DNA methylation are inherited by daughter DNA strands as a result of semi-conservative DNA replication. DNA methylation is found mainly in transcriptionally silent regions of the genome. Cells contain a family of proteins that bind to DNA methylated sequences. These proteins, in turn, interact with the chromatin complexes and histone deacetylases that condenses the chromatin so it becomes transcriptionally inactive (see Figure 2.6). Genomic imprinting is another cell mechanism that involves DNA methylation. Mammalian cells are diploid, containing one set of genes inherited from the father and one set from the mother. In a few cases the expression of a gene has been found to be dependent on whether it is inherited from the mother or the father, a phenomenon call genomic imprinting. During the formation of the germ cells, genes subject to imprinting are marked by methylation according to whether they are present in a sperm or an egg. In this way, the parental origin of the gene can be subsequently detected in the embryo. DNA methylation is thus used as a mark to distinguish two copies of a gene that may be otherwise identical. 7

19 Figure 2.5: The two main epigenomic mechanisms: DNA methylation and histone modifications Histone modifications All eukaryotic organisms have elaborate ways of packaging DNA in chromosomes. This compression is performed by proteins that successfully coil and fold the DNA into higher levels of organization. This dynamic structure allows rapid, on-demand access to specific DNA sequences. The proteins that are involved in packaging of the DNA are generally divided into two general classes: the histones and the nonhistone chromosomal proteins. Histones are responsible for the first and most basic level of chromosome organization, the nucleosome. It consists of segments of DNA string wound around a protein core formed from histones and called nucleosome core. Each nucleosome core particle is separated from the next one by a region of linker DNA (see Figure 2.7). Histones can undergo posttranslational modifications, which alter their interaction with the DNA. The H3 and H4 types of histones have long tales protruding from the nucleo- 8

The subsequent nucleosome cores are connected with linker DNA some, which can be covalently modified at several places.

20 Figure 2.6: DNA methylation is a mechanism for silencing genome regions Figure 2.7: Nucleosome structure. DNA strand wounded around histone proteins forming a nucleosome core. The subsequent nucleosome cores are connected with linker DNA some, which can be covalently modified at several places. The core of the histones can also be modified. Combinations of such mutations are thought to constitute a code, the so-called histone code. These covalent modifications of the histones influence the structure of the chromosome and accessibility of specific DNA sequence regions and are important for gene regulation and DNA repair processes. 9

21 Chapter 3 EpiGRAPH regression - Methods and Implementation In this chapter, we present in details the methods and structure of EpiGRAPH regression. EpiGRAPH regression is a software toolkit that extends the functionality of the available public genome browsers (e.g. UCSC Genome Browser[14], Ensembl[2]). It allows managing and data mining of large genomic experimental genome-wide measurements. It provides the user with the functionality to identify global and local dependency patterns between different genome-related properties. This chapter is divided into two main sections. In Section 3.1, we formulate the general problems that EpiGRAPH regression addresses, together with the approaches proposed. We provide a detailed explanation of EpiGRAPH regression software structure and implementation in Section Methods In order to enable the user to screen and analyze specific properties of the genome and the epigenome, the software has to process quantitative measurements. We refer to numerical genome-wide measurements with the notion of attributes: Definition An attribute is a finite sequence of tuples of the form (chromosome, start, end, score). Each such tuple uniquely identifies a genome region through its parameters (chromosome, start, end) and score is the corresponding measurement for this region. A few examples of genome-wide attributes are given below: DNA methylation. An epigenetic property that measures the number of methylated DNA bases in a genome region Histone modification. Another epigenetic property that represents the modifications of the histone proteins in a specific DNA region 10

22 Evolutionary conservation. A DNA property that identifies the level of conservation of a region with respect to evolution DNA Melting temperature. A DNA property that gives the temperature needed for the two strands of the DNA to dissociate Bivariate Analysis Much research has been conducted in analyzing specific genomic attributes (e.g. recent publications are [15], [26]). More information regarding specific cell processes can be obtained by understanding the mutual dependencies between multiple attributes. We propose an automated method for bivariate analysis that addresses the problem of detecting mutual dependencies between two attributes. Problem Description. Suppose we have two attributes A and B. The bivariate analysis problem is to quantify the overall dependency of A and B and to identify regions where the local dependency deviates significantly. Biological motivation. Identifying the overall dependency is a quantification of how much these attributes are interrelated in different cell processes. For example, if two attributes are independent of each other on a global level, but highly dependent in a specific region, then this region might be related to a biological function. Bivariate analysis - our solution. We propose a three-step algorithm: first we choose a measure to quantify the overall dependency of the attributes. Subsequently, we investigate local behavior by analyzing selected regions of the genome. We then choose the most relevant of these regions based on statistic decisions. A challenge of the overall method is the computational effort necessary Quantifying the overall dependency The first step in our method is to quantify the overall dependency between two attributes. An appropriate choice for this purpose is the Pearson correlation coefficient. It is recommended in [5] for estimating dependency between two random numerical vectors. Furthermore, this choice is supported by the wide use of Pearson correlation coefficient in the research field of Bioinformatics. Intuitively the Pearson correlation coefficient (or simply correlation coefficient) measures the strength of the linear dependency between two random variables. A positive correlation implies that when one variable is above its mean, the other one also tends to be; and likewise for both tending to be below their means. A negative correlation implies that when one variable is above its mean, the other one tends to be below its mean, and vice versa. We consider our attributes scores as samples drawn from the distributions of these two random variables. 11

23 Formally, the correlation coefficient ρ is defined as follows: ρ X,Y = cov(x, Y ) σ X σ Y = E((X µ X)(Y µ Y )) σ X σ Y, (3.1.1) where X and Y are two random variables with expected values µ X and µ Y and standard deviations σ X and σ Y respectively. The correlation coefficient has real values in the range [ 1, 1]. A value of 1 shows perfect positive linear dependency and 1 shows perfect negative linear dependency between the variables. The values in between indicate the degree of the dependency. In the case of linearly independent random variables the correlation coefficient is 0. In practice, we do not know the underlying distributions f(x) and g(y) of the random variables X and Y and we cannot compute the corresponding means and the standard deviations. Therefore, when only samples drawn from these distributions are available, we use an estimator ˆρ of the correlation coefficient. If x 1, x 2,..., x n i.i.d. f and y 1, y 2,..., y n i.i.d. g then ˆρ X,Y = (xi x)(y i ȳ) (n 1)s x s y, (3.1.2) where are the sample means and s x = 1 n 1 x = 1 n n x i and ȳ = 1 n i=1 n i=1 n (x i x) 2 and s y = 1 n 1 i=1 are the sample standard deviations. Identifying regions of deviating behavior y i n (x i ȳ) 2 The next problem we address is that of detecting genomic regions where the dependency between the two attributes of interest is different from the global trend. However, since the number of all possible genomic regions can be as large as 10 18, it is impossible to compute all the corresponding correlation coefficients. Hence, we need to develop a region sampling strategy, which selects a computationally reasonable amount of regions. Moreover, the correlations of the attributes for this restricted set of regions should approximate well the correlation distribution over the whole genome. We present our region sampling strategy in the first part of this section. Subsequently, we need to select the relevant regions from the sample set, which reduces to solving a statistical significance problem. We present our solution in the second part of this section. 12 i=1

24 Region sampling strategy. Every genomic region is uniquely defined by its starting position and size. Our approach is to restrict the possible regions sizes to a fixed set of values and also restrict the starting positions using overlapping windows. Formally, let N denote the size of the genome, S {1, 2,..., N} be a set of region sizes and f : S [0, 1) be a function defining the overlap for a fixed region size. Then for a certain region size s S we inspect all regions of the form: [i s (1 f(s)), i s (1 f(s)) + s], i = 0, 1, 2,... (3.1.3) In this manner, the number of inspected regions is reduced from O(N 2 ) to O ( s S ) N s 1 1 f(s)) Assessing significance. The region sampling procedure outputs a set of regions, for which we compute the corresponding correlation coefficients. In order to identify deviating local behavior, we analyze the distribution of the correlations. However, it is often not meaningful to compare correlation coefficients of regions with different sizes. That is because correlation in larger regions tend to be dependent on properties of the higher order structure of the chromosome(see Figure 2.3), while correlation in smaller regions tend to be dependent on local chemical and sequence properties. Furthermore, genome regions associated with common function rarely have large difference in size. For this reason, we perform a separate significance analysis for each specific window size. We use an established methods for assessing statistical significance based on z-score cutoffs. Each correlation coefficient c from the inspected sample is associated a standard score z := c µ, where µ is the sample mean and σ is the sample standard deviation. σ A region with z-score z is called significant w.r.t. a cutoff q, if z q. However, the z-score significance test is based on the assumption that the underlying distribution of the sample is Gaussian. As an alternative, we propose using quantile cutoffs, a method, which does not require a prior distribution assumption. A p-quantile (p [0, 1]) is a value x p, which divides a distribution of a random variable X, such that: P (X x p ) p and P (X x p ) 1 p There are several methods for estimating the quantiles of a sample population, which we use to obtain the p-quantile value x p, for example empirical distribution function, empirical distribution function with averaging or weighted average. Then, we call a region with correlation coefficient c significant w.r.t. a p-quantile cutoff if c x p (if p 0.5) or c x p (if p < 0.5). 13

25 Bivariate analysis algorithm We summarize the bivariate analysis algorithm in Algorithm 1. Algorithm 1 Bivariate Analysis Input: Two attributes A = (a 1, a 2,..., a n ) and B = (b 1, b 2,..., b n ), a window size w, an overlap k, a z-score cutoff q Output: A list of significant regions R 0 and their corresponding correlation coefficients C 0 1: Based on Equation select a set of overlapping windows R = (r 1, r 2,..., r m ) 2: Compute a correlation coefficient for each region from R, using Equation 3.1.2, C = (c 1, c 2,..., c m ) 3: Compute associated z-scores for each correlation coefficient, Z = (z 1, z 2,..., z m ) 4: Select those regions r i R, for which z i q, add them to R 0 and add the corresponding correlation coefficients to C Multivariate Correlation Analysis Each individual biological attribute potentially stores important functional information. We believe that function-related information is also contained in the characteristic interplay of genomic attributes. The bivariate analysis described in the previous section provides us with the possibility to detect global and local dependencies of two attributes. However, one might be interested in analyzing more complex dependencies that involve a larger number of attributes. For that purpose, we need an automated method (which we call multivariate correlation analysis) for quantifying global and local dependencies between an attribute and a group of attributes. This section describes in detail the multivariate correlation analysis method. Problem Description. Let A be a target attribute and B = {B 1, B 2,..., B k } a group of attributes. The multivariate correlation analysis addresses the quantification of the overall dependency of the target A and the properties B and to identify regions, where the local dependency deviates significantly. Biological motivation. The overall dependency between the target A and the properties B quantifies their interrelation in different cell processes. For example, suppose we want to understand the dependency between a target attribute A and the genome sequence. Genome sequence cannot be represented efficiently in one attribute. However, we might represent to some extent its behavior with the DNA sequence patterns. A DNA sequence pattern of size k is a fixed sequence of k nucleotides (for example one DNA sequence pattern of size 2 is AT, it represents areas in the genome sequence, where an adenine nucleotide is followed by a thymine). A DNA sequence pattern score for a specific region of the genome represents, the number of occurrences of this 14

26 pattern in the genome sequence of this area. Thus, to some extent we can represent the genome sequence using DNA sequence patterns. Hence, in order to understand the overall dependency of the target A from DNA sequence, we apply the multivariate correlation analysis to target A and properties, all DNA sequence patterns of size two. Therefore, the multivariate correlation analysis allows us to extend the set of biological hypotheses, which we can investigate using the EpiGRAPH regression toolkit. Multivariate correlation analysis - our solution. We propose a two-steps algorithm: first we use machine-learning regression methods to obtain a prediction of the target A based on the observations B. Then our problem reduces to estimating the global and local dependencies between the predicted values of A and A itself. The main challenge of this method is the reduction of the computational effort. Quantifying the overall dependency The first problem we need to address is quantifying the global mutual dependency between a target attribute A and a group of attributes B 1, B 2,..., B k. For this purpose, we decided to use supervised-learning methods. In a general setting, let X = (X 1, X 2,..., X k ) be a set of features and Y is an output. A supervised-learning method estimates a functional relationship (model) by mapping the output Y to the inputs X. Such model is created based on a set of tuples ( x i, y i ), for which x i are the values for the inputs and y i is the observed output value corresponding to them. This set of tuples is called training set. If the output value for a supervised-learning method is numerical, this method is called regression. We apply regression to our particular setting as follows: attributes B 1,..., B k correspond to inputs X 1,..., X k by choosing only the score value and the target A corresponds to the output Y. Then, we fit a regression model, which estimates the dependencies between inputs and the output. In what follows, we briefly present three well-established regression methods. For more technical details, see [12]. Regression methods. The EpiGRAPH regression toolkit implements three supervisedlearning regression methods. For the following section we will refer to the input features with X 1, X 2,..., X k and to the output with Y. The regression methods included in Epi- GRAPH regression are : linear regression, ridge regression, lasso and support vector machines. Linear regression with least squares. Linear regression, discussed in many statistics books, for example [19] attempts to model the relationship between the variables by fitting a linear function to the observed data. This is modeled by ŷ = ˆβ 0 + i ˆβ i x i, where ŷ is the estimated output value corresponding to the coefficients ˆβ. Fitting the linear model then reduces to computing the values for β that minimize a fixed 15

27 loss function L(ŷ, y). A loss function measures the quality of the approximation of y by ŷ. The most common loss function used for fitting a linear model is the residual sum of squares. The linear model that minimizes the L(ŷ, y) := RSS(β) = i (y i ŷ i ) 2 is guaranteed by the Gauss-Markov Theorem to have the smallest variance among all unbiased estimators of the data under the assumption that the linear model is indeed correct. A biased estimator is one, which for some reason over- or under-estimates the quantity that is being estimated. A variance of a random variable Z, var(z) = E((Z E(Z)) 2 ) is a measure, indicating how far from the expected values the values of Z usually are. Thus, the linear regression with least squares produces the optimum unbiased linear model, which fits the training set. Furthermore, fitting this model is not computationally expensive since it is proven in [12] that the optimum values for β are given by ˆβ = (x T x) 1 x T y and the fitted value at the input x i is ŷ i = ŷ(x i ) = x T i ˆβ. Ridge regression and Lasso. Linear regression with least squares has low bias, but large variance. We can reduce the variance by fitting a biased estimator. Thus, we can improve the overall model accuracy. This can be achieved by shrinking the coefficients of the linear model by imposing a penalty on their values. This approach has the further advantage of improving the interpretability of the model, since the coefficients that correspond to features of low importance are shrunk towards zero. Formally, the Ridge regression [13] is a linear model as in 3.1.2, which minimizes the penalized residual sum of squares of the coefficients: ˆβ ridge = arg min β { (y i β 0 β i x i ) 2 + λ β 2 i } Here λ 0 is a complexity parameter that controls the amount of shrinkage. Lasso [23] is an L 1 norm variant of ridge regression shrinkage method. The lasso coefficients are chosen such as to minimize the following modified RSS: ˆβ lasso = arg min β { (y i β 0 β i x i ) 2 }, subject to β i t, where t is again a complexity parameter that controls the amount of shrinkage. Support Vector Regression. Real world data dependency models are rarely linear. In the last decade Support Vector Machines have proven to be a very effective 16

28 and efficient supervised-learning method that can fit complex models to high dimensional data. In the following paragraphs, we briefly introduce the Support Vector Regression theory, as described in [20]. Support Vector Machines (SMVs) were first developed as a method to classify a set of data points from R n into two classes by selecting a hyperplane h in some possibly high dimensional space L such that h separates the data points with maximum distance to the closest data point from each class. This hyperplane is called the maximum-margin hyperplane. This idea was extended for regression purposes in [25]. Suppose we are given training data {(x 1, y 1 ), (x 2, y 2 ),..., (x k, y k )} X x R, where X denotes the space of the input patterns (e.g. X = R d ). In ε-sv regression, the goal is to find a function f(x), for which i f(x i ) y i ε and which at the same time is as flat as possible. Suppose f(x) is a linear function of the form f(x) = < β, x > + b where β X, b R and <, > denotes the dot product in X. Flatness in this case means that one seeks a small β. One way to ensure this is to minimize the norm, i.e. β 2 =< β, β >. We can formulate this problem as a convex optimization problem: subject to minimize 1 2 β 2 { yi < β, x i > b ε < β, x i > + b y i ε Sometimes though such linear function f might not exist and in these cases we want to allow for some data points to be outside the ε margin. Analogously to the soft margin loss function [1], which was adapted to SVMs by Cortes and Vapnik [9], one can introduce slack variables ξ i, ξ i to cope with otherwise infeasible constraints of the optimization problem. Hence we arrive at the formulation stated in [24]: minimize 1 2 β 2 + C l (ξ i + ξi ) i=1 subject to { yi < β, x i > b ε + ξ i < β, x i > + b y i ε + ξ i ξ i, ξ i 0 Solving this optimization problem is explained in detail in [20]. Important observations are that the complete algorithm can be described in terms of dot products between the data. Even when evaluating f(x) we do not need to compute β explicitly. These observations allow to extend the set of possible functions f(x) by using kernel functions corresponding to dot products in some feature space 17

29 F. Using kernel functions, SVR has the possibility to map the data to higher dimensions and identify the optimal function f(x) operating in these higher dimensions without additional computational effort. These possibilities promote Support Vector methods as a very effective and efficient technique for both classification and regression problems. These regression methods cover a basic set of data models. They can be easily extended with the further advancement of the software. Obtaining prediction. For each of these regression models, we obtain a prediction A linear pr, A ridge pr, A lasso pr, A SV pr M, respectively, for the target A. A loss function value computed for each of these predictions summarizes how well the features determine the target w.r.t. each specific regression model. For this purpose, we need to avoid model overfitting. Overfitting is a property of a regression method to adjust to the training dataset better, while losing the possibility to generalize for data points that do not appear in the training set. We avoid overfitting to the data by using a procedure similar to cross-validation. We divide the set of data points D := {( x i, y i ) i = 1,..., n} into k parts D 1, D 2,..., D k, where k i=1 D i = D and D i D j = i j. In order to obtain predictions ŷ k for the data points in D i, we train a model M i on the training set D i := k j=1,j i D j. We use M i to predict the fitted value ŷi k for each data point ( x k i, yi k ) D i. In this manner, the predicted target value ŷ i for the data point ( x i, y i ) is obtained using a model that is not trained on this data point. Prediction accuracy estimation. The regression models described above provide with predictions of the target A. In order to quantify the overall dependency between the target and the features, we estimate the prediction accuracy for each regression model. This indicates how well the regression model approximates the true model of the data. Given a target attribute A = (a 1, a 2,..., a n ) and a prediction A pr = (a pr 1, a pr 2,..., a pr n ), we estimate the dependency between A and A p r via two different loss (error) functions: residual sum of squares (RSS) and Pearson correlation coefficient. The correlation coefficient (3.1.2) estimates the linear dependency between the prediction and the real values. A correlation coefficient of 1 stands for a prediction that mimics exactly the behavior of the real values with a possible bias and the behavior resemblance decreases with the decrease of the correlation coefficient. An alternative approach for estimating prediction accuracy is the RSS (3.1.2). A perfect prediction has a RSS of 0 and the prediction accuracy decreases with the increase of the RSS. EpiGRAPH regression provides both RSS and correlation coefficients as estimation of the overall dependency between the target attribute and the properties. 18

30 Identifying regions of deviating behavior In what follows, we focus on identifying regions, where the local dependency between the target attribute A and the properties B deviates from the global dependency. We start with the target attribute A and a prediction A pr, chosen from the set of predictions based on some optimality criteria (best correlation coefficient or best RSS). Given these two attributes, we want to detect the regions in which the local behavior deviates from the global. However, we already solved this problem in the bivariate analysis case. If we perform the bivariate analysis on A versus A pr, we quantify the global dependency and identify regions of deviating local behavior. Multivariate correlation analysis algorithm See Algorithm 2 for a summary of the multivariate correlation analysis Multivariate Prediction We believe that an important addition to the multivariate correlation analysis is the possibility to predict missing attribute values. This is reasonable in the case when these values can be predicted with significant accuracy. In the following paragraphs we present our approach. Problem Description. Suppose we have a target attribute A and a group of attributes B = B 1, B 2,..., B k such that there is a very strong and detectable dependency between A and B. Suppose also that there exists a genome region r, for which the properties B have corresponding values (b r 1, b r 2,...b r k ), but the attribute A does not have a value mapped to this region. The multivariate prediction problem is to predict the score for A in the region r based on the overall dependency between A and B and the values (b r 1, b r 2,...b r k ). Biological motivation. Prediction of missing attribute values has a number of important applications. An example for such application, are attributes that are highly dependent on other already measured attributes, but on the other hand consume time and/or resources to be measured in a lab. Furthermore, large-scale genomic project like ENCODE[21] focus their resources to analyze only on a selection of genomic regions. Multivariate prediction can extrapolate the data from such limited number of observation over the whole genome. The predicted values for a biologically interesting attribute are a subject of new research. Multivariate prediction - our solution. Our approach towards the multivariate prediction consists of two main steps. First, using the multivariate correlation analysis approach, we obtain the prediction accuracy measurements (correlation coefficient and RSS) for each of the investigated regression methods. We select the best regression 19

31 Algorithm 2 Multivariate Correlation Analysis Input: a target attribute A = (a 1, a 2,..., a n ) and a group of m feature attributes B 1 = (b 1 1, b 1 2,..., b 1 n), B 2 = (b 2 1, b 2 2,..., b 2 n),..., B m = (b m 1, b m 2,..., b m n ), a list of regression methods RM 1, RM 2,..., RM s, a partition count p, a window size w, an overlap k, a z-score cutoff q Output: global dependency correlation coefficients c RM 1, c RM 1,..., c RMs and RSS scores r RM 1, r RM 1,..., r RMs, a list of significant regions R 0 and a list of their corresponding coefficients C 0 identified with the best prediction model w.r.t. correlation coefficient 1: Create a set of data points D = {((b 1 1, b 2 1,..., b m 1 ), a 1 ), ((b 1 2, b 2 2,..., b m 2 ), a 2 ),..., ((b 1 n, b 2 n,..., b m n ), a n )} 2: for j = 1 to s do 3: Create a partition D 1, D 2,..., D p of the set of data points D = {( x i, y i ) i = 1,..., n}, such that p i=1 D i = D and D i D j =, i j. Such partition is created as every data point from D is assigned uniformly at random to one of the partitions D 1, D 2,..., D p 1 or D p 4: for i = 1 to p do 5: Train a model Mj i on the training set D i := k t=1,t i D t 6: Use Mj i to predict ŷ i for each data point ( x i, y i ) in D i 7: end for 8: Unite these predictions in an attribute A RM j pr 9: Compute the correlation coefficient c RM j between A and A p r RM j 10: Compute the RSS r RM j between A and A p r RM j 11: end for 12: Choose best regression model RM bestcorr w.r.t. correlation coefficient RM bestcorr = arg max RM c RM 13: Choose best regression model RM bestrss w.r.t. RSS RM bestrss = arg max RM r RM RM bestcorr 14: Perform bivariate analysis algorithm with input (A, Apr, w, k, q) identify regions of deviating behavior R 0 and their corresponding correlation coefficients C 0 methods w.r.t. correlation coefficient and RSS prediction accuracy and train regression models. We use these models to predict the missing values for the target attribute. Selection of prediction method For each of the possible regression methods, we apply a limited version of the multivariate correlation analysis to obtain the prediction accuracy correlation coefficient and residual sum of squares. The multivariate correlation analysis is limited in the sense that it does not perform bivariate analysis for detection of local regions. In this manner, we obtain a ranking of the regression methods w.r.t. each prediction accuracy criteria. Subsequently, we select the highest ranked regression method and train a model M on the complete training set. We use M to obtain prediction for the missing 20

32 values of the target attribute. Selection of the features to be used Each regression algorithm uses a set of inputs ˆx, on which it bases the prediction for the value of the output. In the setting of the multivariate prediction, EpiGRAPH regression uses the values of the properties B 1, B 2,..., B k as inputs to the model. However, region-specific properties, e.g. chromosome index, region location and region size, can also be correlated with the target. EpiGRAPH regression provides the user with the possibility to use in the analysis any of these region-specific features. Multivariate prediction algorithm We summarize the multivariate prediction algorithm in Algorithm 3. Algorithm 3 Multivariate Prediction Input: target attribute A = (a 1, a 2,..., a n ) and a group of m feature attributes B 1 = (b 1 1, b 1 2,..., b 1 t ), B 2 = (b 2 1, b 2 2,..., b 2 t ),..., B m = (b k 1, b k 2,..., b k t ), a list of regression methods RM 1, RM 2,..., RM s Output: global correlation coefficients c RM 1, c RM 1,..., c RMs and RSS scores r RM 1, r RM 1,..., r RMs, two lists with predicted values for the attribute A based on the best regression model w.r.t. regression method w.r.t. RSS score (a rss n+1, a rss correlation (a corr n+1, a corr n+2,..., a corr t ) and the best n+2,..., a rss t ) coefficient 1: Apply a restricted version of the multivariate correlation analysis algorithm on the target attribute A and the restricted features B 1 = (b 1 1, b 1 2,..., b 1 n), B 2 = (b 2 1, b 2 2,..., b 2 n),..., B m = (b m 1, b m 2,..., b m n ) to obtain the dependency coefficients c RM 1, c RM 1,..., c RMs and r RM 1, r RM 1,..., r RMs. This restricted version does not invoke the bivariate analysis for detection of local regions of interest in Step 14 of Algorithm 2. 2: Choose best regression model RM bestcorr w.r.t. correlation coefficient RM bestcorr = arg max RM c RM 3: Choose best regression model RM bestrss w.r.t. RSS RM bestrss = arg max RM r RM 4: Train a model M bestcorr on the training set D := {( b i, a i ) i = 1,..., n} and obtain predictions for the missing values of the attribute A (a corr n+1, a corr n+2,..., a corr t ) 5: Train a model M bestrss on the training set D := {( b i, a i ) i = 1,..., n} and obtain predictions for the missing values of the attribute A (a rss n+1, a rss n+2,..., a rss t ) 21

33 3.2 Design and Implementation In this section, we give a detailed description of the design and implementation of the EpiGRAPH regression software toolkit. We start with a discussion of the solutions chosen to carry out the EpiGRAPH regression implementation, such as the programming language and the additional libraries employed. We continue with indetail explanations of the two main modules in the software: handling of the attributes data and the visualization of the results Implementation In the following paragraphs, we discuss the technologies used for the implementation of the EpiGRAPH regression software toolkit. These include the main programming language and the required additional libraries. Programming Language The goals of the EpiGRAPH regression software impose a number of requirements the programming language should meet. We consider important the following characteristics: Well-established programming language. EpiGRAPH regression is designed to be a large-scale project with many potential users. An implementation based on a well-establish programming language facilitate the further extensibility of the project and the reliability of the source code. Platform-independence. The choice of a platform-independent programming language is important for facilitating development and testing of the EpiGRAPH regression software. This feature increases the possibilities for the further advancement of the software as a web-server or stand-alone application. Efficient libraries for mathematical operations. Extensive mathematical operations are the backbone of the analyses provided by EpiGRAPH regression. Their efficiency is essential for the applicability of the software. Abstraction for large integers. During the various computations involved in the different analyses, EpiGRAPH regression often encounters large integers. These are integers that exceed the limits for standard language implementation integers like the ones defined like C++ or Java. EpiGRAPH regression makes use of an efficient abstraction for large integers to eliminate a large number of possible problems. Object-oriented data approach. Modular design of a large-scale project facilitates its management and future development. 22

34 A programming language that fulfills all these requirements is the well-established, script-based Python Programming Language [16]. For EpiGRAPH regression implementation we used the Python 2.4 release. Additional Libraries EpiGRAPH regression uses additional modules for functionality that is not provided in the Python 2.4 standard implementation. We use the Numeric Python module for efficient implementation of array data structure and mathematical operations on it. We also use a Python-to-R interface module called RPy, which allows data and functionality transfer between Python and the R project for statistical computing [17]. Through RPy we use the R standard libraries for statistical-learning regression methods and visualization of statistical data. EpiGRAPH regression uses the R extensions: lars for Lasso, e1071 for Support vector machines, stats for Linear regression and MASS for Ridge regression. Furthermore, EpiGRAPH regression requires Ghostscript for graphics visualization of the data Attribute management EpiGRAPH regression accesses and handles large amount of biological data, which is typically a result from laboratory measurements. Each genome-wide measurement of a certain biological property is stored in our database and referred to as attribute. Hence, one of the important implementation considerations in EpiGRAPH regression is the management of these attributes data sets. The management includes storing, representing, accessing and performing operations on attributes. Due to the large quantities of data, the attribute management is the most important factor for the computational efficiency of EpiGRAPH regression. In the following section, we present EpiGRAPH regression approach to storing, representing, accessing and handling attributes. Storage As formally presented in Section 3.1, an attribute is a sequence of tuples (chromosome, chromosome start, chromosome end, score). EpiGRAPH regression offers the possibility of storing attributes as data files or organized in structured database tables. File attributes. EpiGRAPH regression defines a flexible file format, in which attribute data should be stored so that it is accessible through the toolkit. Suppose we have an attribute A = (a 1, a 2,..., a n ), where: i a i = (chrom i, chromstart i, chromend i, score i ). If the attribute A is stored in a file, it should be in the following file format in order to be accessible through EpiGRAPH regression. Each line in the file codes exactly one tuple. The elements of each tuple are separated by a fixed separator, for example semicolon, tabulation or comma. The order of the elements within the tuple is not fixed, but it is the same for all tuples. The tuples that represent regions from the same 23

35 chromosome are given sequentially and sorted according to chromstart position. Database attributes. EpiGRAPH regression also uses attributes stored in its database. Each attribute is stored in a separate table, where each row in the table represents a tuple of the attribute. EpiGRAPH regression implementation uses an Oracle database. Derived attributes. EpiGRAPH regression can compute and store attributes, which can be derived from other already existing attributes. Formally, let D be the domain of all possible genome regions, P be the domain of all possible genome positions and A 1, A 2,..., A k be attributes. We define the functions a position 1, a position 2,..., a position, such that: k j a position j : P R {NotAvailable} { a position s if r D, such that x r and Aj contains tuple (r, s) j (x) := Not Available otherwise We also define the functions a region 1, a region 2,..., a region, such that: k j a region j : D R {NotAvailable} a region j (r) := { Pr i a position j (r i ) if i a position r j (r i ) Not Available Not Available otherwise An attribute B is called derived, if it is defined on a set of regions RG and for every region r RG the corresponding score for B for the region r is given by the function f : RG R {Not Available}, where f(r) := g(a region 1 (r), a region 2 (r),..., a region k (r)) and g is known. EpiGRAPH regression s support for computation and storage of derived attributes has multiple advantages. Such an advantage is that users of EpiGRAPH regression can integrate a derived attribute by a simple script encoding the function g. Furthermore, the potentially large quantities of derived attributes are not stored, restricting the resource intensity of EpiGRAPH regression calculations. External attributes. EpiGRAPH regression also provides the possibility to connect to data that is not formatted as an attribute. Thus we allow extraction of attributes from other sources to be integrated in the EpiGRAPH framework. An example for such an external attribute is the human genome sequence. The human genome sequence is a potential source of attributes (for example genome sequence patterns, repeats etc). In order to use such attributes in EpiGRAPH regression we need to store only the genome sequence and define a function to extract an attribute from it on the fly. 24

36 Representation In this section, we explain the representation of attributes in the EpiGRAPH regression toolkit. We chose to represent them via Extensible Markup Language (XML). EpiGRAPH regression stores a document that contains the references of all the attributes currently available in the EpiGRAPH regression toolkit. In this document, every attribute is represented with a separate XML entry. In what follows, we present the XML format of an attribute reference together with detailed description of the various attribute coding elements. We use the standard Python 2.4 module xml.dom as a programming interface for XML documents. General attributes. EpiGRAPH regression accesses the references of the available attributes from an XML document. In this document, every attribute is stored in an XML node named Attribute. Every Attribute node has children nodes Attribute- Name, AttributeStorageType, AttributeDataType, AttributeDelayedCalculation and AttributeOverlappingStructure. AttributeName is a text element that defines the name of the attribute. AttributeStorageType is an element that describes how this attribute is stored and how to retrieve/manipulate its values. This element has an XML attribute named type. This XML attribute has possible values File, DB and Runtime thus differentiating between the different storage types of the attributes data. We explain in more detail for each specific storage type the possible children elements of this node later in this section. AttributeDataType is an element that is introduced for compatibility with the Epi- GRAPH class [6] tool. AttributeDelayedCalculation is an element that specifies how the attribute data is to be processed. This text element has two possible values: No and Chromosome. A value No indicates that all values corresponding to this attribute should be read and processed at once. Thus, all the data for the attribute is to be stored simultaneously in the operating memory of the machine, on which EpiGRAPH regression is running. The value Chromosome indicates that the data for the attribute should be processed on chromosome basis. Thus, only the data for a specific chromosome is to be stored in the operating memory. However, this approach results in additional costs for retrieving separately the data for each chromosome. This XML element allows the user to adjust a tradeoff of the operating memory that the attribute uses and the computational time required for retrieving the attribute data. AttributeOverlappingStructure is an element that identifies if the attribute contains overlapping regions. EpiGRAPH regression attributes are restricted that each posi- 25

37 tion can have only one fixed value. However, the case of overlapping regions is common for data sets originating from experimental data. EpiGRAPH regression converts the attribute with overlapping regions to an attribute without overlapping regions via averaging. In order to save processing time for checking if the attribute needs to be converted this property of the newly introduced attributes can be stored in the XML reference for the attribute. In the following paragraphs, we indicate the XML representation of the different storage methods. Storage specific XML elements for file attributes. If the attribute reference defines a file attribute (<AttributeStorageType type= File >) then the children of the AttributeStorageType element provide more details on the file representation of the attribute data. The possible children elements are: FileName, FileColumns, File- Separator, FileSkipFirstLine. FileName is a text element that stores an absolute or relative (to the filesystem root of the project) name of the file containing the attribute data. FileSeparator is a text element that stores the character used to separate the different values on one line. FileSkipFirstLine is a text element, with possible values 0 and 1. It is 1 if the first line in the data file stores the names of the columns or other irrelevant information and it is 0 otherwise. FileColumns is an optional element that contains four text elements FileColumn. The FileColumns element is provided, when the order of the values in the data file differentiates from the default order. The default order is: chromosome start, chromosome end, score, chromosome. If the order is different, then the correct indices of the columns are introduced in the following form: <FileColumns> <FileColumn name="start">0</filecolumn> <FileColumn name="end">1</filecolumn> <FileColumn name="score">2</filecolumn> <FileColumn name="chrom">3</filecolumn> </FileColumns> where the XML attribute name of the FileColumn element indicates the type of the value, for which this element specifies an index. The value of the FileColumn element is an 0-based integer index. Storage specific XML elements for database attributes. If the attribute reference de- 26

38 fines a database attribute (<AttributeStorageType type= DB >), then the children of this element provide details on the database representation of the attribute data. The possible children elements are: TableName, DBColumns and DBQuery. TableName. Each database attribute is stored in a single database table. The name of the database table is stored in the element TableName. DBColumns. For retrieving the attribute data, the information for the column names should also be stored. The attribute data is stored in the database table in four columns : start of chromosome region, end of chromosome region, chromosome and score for this chromosome. Since the names of these columns may vary for the different attributes, EpiGRAPH regression allows the column names to be specified in the XML description. The AttributeStorageType element has a child DBColumns, which has four children elements DBColumn. Each of them has an XML attribute name specifying, which column type it represents. The possible values for the XML attribute name are: start, end, chrom and score. The text value of each of these DBColumn elements represents the database table column name for the column type. An example XML definition of the database column names for an attribute is given below: <DBColumns> <DBColumn name="start">chromstart</dbcolumn> <DBColumn name="end">chromend</dbcolumn> <DBColumn name="score">score</dbcolumn> <DBColumn name="chrom">chrom</dbcolumn> </DBColumns> DBQuery. Another child element of the AttributeStorageType is the optional element DBQuery. This element is provided when the attribute requires a non-default database query in order to retrieve the data. The default query is SELECT \%(start)s, \%(end)s, \%(score)s, \%(chromosome)s FROM \%(table)s ORDER BY \%(chromosome)s, \%(start)s} where %(x) is replaced with the column name extracted from the DBColumn element with XML attribute name x. The DBQuery element has an XML attribute type, which defines the type of the query. The database queries are three main types: Default, Chromosome and Window. A query is of type Default, if when submitted to the database receives as a result, a list of all tuples (start,end,chromosome,score) corresponding to this attribute. A query is of type Chromosome, if when sent with an additional parameter %(chrom) to the database, retrieves only the data corresponding to the requested chromosome chrom. This query type is used for attributes, for which the data is too large to be operated simultaneously and allows the attribute data to be processed chromosome by chromosome. Finally, a query is of type Window, if when the query given a fixed region is submitted to the database, it retrieves a score for this 27

39 specific region. This query type is used for attributes, for which a corresponding score can be defined for each possible genome region. Storage specific XML elements for derived attributes. Derived attributes (encoded in the XML with < AttributeStorageType type= Derived >) are not explicitly stored. They are defined through their dependency on other attributes. A derived attribute uses a function (encoded in a Python method) to obtain the score for a specific region from the scores for the other attributes it is dependent on. The XML reference to a derived attributes is defined using the children elements ModuleName, Method- Name and DependentAttributes. ModuleName and MethodName. The Python method that encodes how the derived attribute is dependent on the other attributes is encoded using a pair of XML elements. ModuleName is a text element specifying the python module, in which the method is implemented and MethodName is a text element that stores the method name. DependentAttributes. The XML elements that define the storage of a derived attribute also include the XML elements referring to the attributes it is dependent on. These attributes are defined in the element DependentAttributes. This element contains a sequence of elements DependentAttribute. Each element DependentAttribute defines a reference to an attribute already defined in EpiGRAPH regression. It contains an XML attribute storage that defines the storage type of the dependent attribute. It also contains an XML element AttributeName that specifies the name, under which this attribute is stored in the XML representation. A preprocessing method can also be defined, to convert the values for the purpose of easing the derived attribute computation effort. It is defines by a pair of elements ModuleName and MethodName. The storage parameters for derived attributes also include a StorageCalculation XML element. This element has a possible values of 1 and 0 and it represents if once computed the values for the derived attribute should be stored ( 1 ) or not ( 0 ). An example definition of the element DependentAttributes is given below: <DependentAttributes> <DependentAttribute storage="file"> <AttributeName>GenomeConservationExternal</AttributeName> </DependentAttribute> <DependentAttribute storage="db"> <AttributeName>Genome</AttributeName> <ModuleName>Plugins.Extensions.PatternFreqCalculator</ModuleName> <MethodName>preprocessCountPattern</MethodName> </DependentAttribute> </DependentAttributes> 28

40 External attributes. External attributes are also defined in an XML element Attribute. External attributes are allowed to be of storage types File and DB. The storage XML elements for external attributes are the same as those provided for General attributes. The difference in the XML representation of the external attributes is in its children MethodName and ModuleName, which specify the python method to be used for retrieving the attributes from the corresponding file or database table. Operating with attributes One challenge of the EpiGRAPH regression implementation is the efficient processing of the attributes. The bivariate analysis, multivariate correlation analysis and multivariate prediction require a set of operations defined on attributes. These operations are highly frequent during the analysis and must be implemented efficiently in order to achieve reasonably low running times. We explain these algorithms in the following paragraphs. Creating an attribute collection. An attribute collection is a data structure that unites the data coming from two or more attributes. Definition An attribute collection derived from the attributes A 1, A 2,..., A k is a finite sequence of tuples of the form (chromosome, start, end, (score 1, score 2,..., score k )). Each such tuple uniquely identifies a genome region through its parameters (chromosome, start, end) and i [1, k], score i is the corresponding measurement for attribute A i for this region. Creating an attribute collection consists of two main steps. The first step is identifying the genome regions that define this attribute collection. The attribute collection set of regions S is the smallest cardinality set of regions so that each attribute A 1, A 2,..., A k has a single value in each region in S. In this manner, for every i, every region in S is a subregion of a region in A i. The second step is to associate the attribute scores for each of the identified regions. This algorithm has a variation, which allows an attribute to dominate the collection. This means that the regions selected for the collection are exactly the regions of the dominating attribute. This type of collection is used in the multivariate analysis. Algorithm details are available in Algorithm 4. Creating an averaged attribute. This procedure decreases the resolution of an attribute via averaging over a number of data points. This procedure is generally known as smoothing, and it helps us decrease the amount of data contained in an attribute, while preserving its essential information. In practice, it is often the case that an attribute is too large(up to millions of measured regions) and processing it slows down the analysis. In the cases when such attributes are processed only for the purpose of quantifying general dependency, EpiGRAPH regression allows usage of smoothing to decrease the complexity. 29

41 Algorithm 4 Attribute Collection Input: Attributes A 1 = (a 1 1, a 1 2,..., a 1 n 1 ), A 2 = (a 2 1, a 2 2,..., a 2 n 2 ),..., A k = (a k 1, a k 2,..., a k n k ), where i j a j i := (chrj i, startj i, endj i, scorej i ) and DomAttribute with value the index of the dominant attribute for this collection or 0 if the collection does not have dominant attribute Output: An attribute collection Ac = (ac 1, ac 2,..., ac m ) 1: Ac = () 2: for chr = chr1 to chry do 3: R chr = 4: end for 5: if DomAttribute == 0 then 6: for i = 1 to k do 7: for j = 1 to n i do 8: R chr j i 9: end for 10: end for 11: else := R chr j i {start j i, endj i } 12: for j = 1 to n DomAttribute do 13: R chr j DomAttribute 14: end for := R chr j {start j DomAttribute, endj DomAttribute } DomAttribute 15: end if 16: for chr = chr1 to chry do 17: Sort the elements of the set R chr into a sequence S chr 18: end for 19: for chr = chr1 to chry do 20: for position = 1 to S chr 1 do 21: scores = () 22: for att = 1 to k do 23: sum = 0 24: for bp = S chr [position] to S chr [position + 1] do 25: sum = sum + V alue(a att, chr, bp), where V alue(a att, chr, bp) is the value for the region in the attribute A att, in which bp belongs. 26: end for sum 27: s att = S chr [position+1] S chr [position]+1 28: scores.append(s att ) 29: end for 30: a = (chr, S chr [position], S chr [position + 1], scores) 31: Ac.append(a) 32: end for 33: end for 30

42 3.2.3 Visualization The visualization of the results is an important feature of the EpiGRAPH regression toolkit. The different types of analysis that the toolkit performs allow users to mine and screen large quantities of data. It is essential for the usability of the toolkit to provide structured and easy-to-interpret visualization of the different analysis results. Layered Output EpiGRAPH regression uses a three-layered visualization structure of the output (see Figure 3.1). On top of the hierarchy a main analysis result is presented as onenumber summary. The second layer provides with more extended results overview for each of the performed experiments addressing single chromosomes, different window sizes and regression methods (in the multivariate cases). The bottom layer gives access to detailed statistics for every part of the performed analysis. Figure 3.1: Three layers of visualization Bivariate Analysis Output In this section we give a detailed description of the visualization structure of the bivariate analysis results. As described in section 3.1.1, we perform bivariate analysis on A and B with the goal in mind to detect the global dependency between them and identify regions of deviating local behavior. As a general result of the bivariate analysis, EpiGRAPH regression displays a table, which contains a summary of the analysis (see figure 3.2). It consists of a correlation coefficient representing the global dependency between the attributes and, furthermore, the maximum and the minimum correlation coefficients obtained on chromosome level. Each cell from the summary contains a hyperlink, which redirects the user to a heatmap 31

43 Figure 3.2: A summary of the results of the bivariate analysis between attributes A and B. A table cell that is an intersection of column marked with A and row marked with B represents information about the bivariate analysis between the attributes A and B. The first row in the cell indicate the global correlation coefficient. The values on the second row in the cell show the maximum and the minimum correlation coefficient among the correlation coefficients estimating the dependency between A and B in the different chromosomes. A table cell is colored according to the value of the global correlation coefficient. The colors range from green, which indicates a negative correlation of 1 to red, which indicates a positive correlation of +1. The displayed results matrix is symmetric to facilitate the analysis of the results in the case of multiple bivariate analyses results with more detailed information (see Figure 3.3). The heatmap visualizes parameter details for the sample distributions of the correlation coefficients computed for regions identified by sliding window of different sizes over each chromosome. Eventually, the user can investigate in more details the sample distribution for a sliding window of different sizes iterated over each chromosome. The information for each such experiment is visualized as shown in Figure 3.4. Multivariate Analysis correlation For the multivariate correlation analysis result, EpiGRAPH regression provides a similar three-layer structure. As a summary of the results, we provide coefficients indicating the global dependency between the attribute A and the properties B 1, B 2,..., B k w.r.t. each regression method (see Figures 3.5 and 3.6). The second visualization layer consists of detailed information for the analysis that can be analyzed w.r.t. a specific regression method (see Figure 3.7), a specific chromosome (see Figure 3.8) and specific sliding window size (see Figures 3.9 and 3.10) Multivariate Analysis prediction In the case of multivariate prediction, EpiGRAPH regression provides a one-page result. It indicates how accurate the different regression methods predictions were w.r.t. RSS and correlation coefficient error functions, as formally presented in section 3.1 (see Figure 3.11). 32

44 3.2.4 Software Structure In this section, we discuss the structure of the EpiGRAPH regression software. Epi- GRAPH regression uses three main file types: Python files encoding the software; XML files coding the settings; additional files (R, JavaScript, DTD) supporting various options of the software. A complete listing of the EpiGRAPH regression software implementation version 1.0 is attached as a digital copy in Appendix B. The complete software structure of EpiGRAPH regression consists of 80 files and lines of code. These are organized in 5 main modules: Attribute Handling, Input and Output, Logics, Statistics and Settings. Attribute Handling is the module responsible for representation and operations on attributes. It contains the data structures for attributes and attribute collections together with the implementation of the operations on them. This functionality is encoded in 7 python files containing 1521 lines of code. Input and Output is the largest module in the software. The Input submodule is responsible for loading the software settings and attributes data. The Output submodule codes the storage and the visualization of the results. This functionality is encoded in 13 files and 3633 lines of code. Logics is the core module of EpiGRAPH regression implementation. It manages the application workflow. It consists of 7 files and 1754 lines of code. Statistics is the module that encodes the statistics-specific operation. Essential part of this module are the mathematical support for computing parameters of sample distributions. This module also includes the implementation of the regression methods used in the multivariate analysis. The module contains 13 files and 1133 lines of code. Settings is the module that includes the settings of EpiGRAPH regression. These files include the initial settings of the software together with the reference files for the attributes that EpiGRAPH regression supports. It contains 19 files and 1701 lines of code. 33

45 34 Figure 3.3: This heatmap displays detailed results for the bivariate analysis between the attributes A and B. Every analyzed chromosome is presented as a column and every tested sliding window size is presented as a row. The cell intersection of the column corresponding to chromosome c and row corresponding to sliding window size w contains a summary of the sample distribution of correlation coefficients extracted from the regions, which were identified when a sliding window of size w was slided over the chromosome c. The table cell contains three numbers representing this sample distribution: the sample mean, the sample standard deviation and the sample size. The coloring of the cells in the table is according to the sample mean values and ranges from red, which indicates sample mean correlation coefficient of +1 to green, which indicates sample mean correlation coefficient of 1. Additional color markers indicate the chromosome with the highest (red) and the lowest (green) global correlation between the two attributes. An informative sliding window size is recommended via blue coloring.

Figure 3.4: Visualization of a sample distribution of correlation coefficients corresponding to regions identified by sliding a window of size w over the data mapped onto chromosome c.

46 Figure 3.4: Visualization of a sample distribution of correlation coefficients corresponding to regions identified by sliding a window of size w over the data mapped onto chromosome c. In the top-left, the histogram of the sample distribution is displayed. On the left of it, sample distribution parameters (such as sample size, sample mean, sample standard deviation, sample skewness and sample kurtosis) are displayed. At the bottom, a chromosome-wide plot of the correlation coefficients is shown. This plot includes the cytogenetic bands for the corresponding chromosome. The z-score cutoffs are indicated on this plot by red horizontal lines for the positive cutoffs and green horizontal lines for the negative cutoffs. The regions that exceed the selected cutoffs are isolated in separate lists displayed in the top-right part. The user is also provided with the possibility to investigate these regions directly in the Genome Browser. 35

47 Figure 3.5: A summary of the multivariate analysis results. The regression methods used in the analysis are displayed as rows of the table. For each regression method the maximum, the minimum and the average value (w.r.t. correlation coefficient and RSS) along the values obtained for the different chromosomes. 36

Figure 3.6: Multivariate correlation results. The regression methods used in this analysis are displayed in the columns. The chromosomes, on which the analysis was performed are ordered as rows.

48 Figure 3.6: Multivariate correlation results. The regression methods used in this analysis are displayed in the columns. The chromosomes, on which the analysis was performed are ordered as rows. The cell intersection of column associated to regression method r and row associated to chromosome c summarizes how well does the regression method r fit the dependency between the attributes A and B 1, B 2,..., B k for chromosome c. It contains the RSS and correlation coefficient values. Additional color markers indicate for each chromosome the best regression methods w.r.t. correlation coefficient (dark green) and RSS (light green). 37

38 Figure 3.7: Multivariate correlation results w.r.t. a specific regression method. This heatmap mimics the bivariate analysis extended visualization.

49 38 Figure 3.7: Multivariate correlation results w.r.t. a specific regression method. This heatmap mimics the bivariate analysis extended visualization. It has an additional table that shows the residual sum of squares value for each chromosome using the specific regression method. This table also includes color indication of the chromosomes with the best (red) and the worst (green) RSS.

Figure 3.8: Multivariate correlation results w.r.t. a specific chromosome. This heatmap visualizes how well the regression methods fits the data for each sliding window size.

50 Figure 3.8: Multivariate correlation results w.r.t. a specific chromosome. This heatmap visualizes how well the regression methods fits the data for each sliding window size. The upper table is the visualization of the correlation coefficients from the bivariate analyses. It is a heatmap where the cell intersection of the column for the regression method r and the row for the sliding window size w contains three values. These three values are the sample mean, sample standard deviation and sample size of the sample distribution corresponding to sliding a window of size w over the data. The color of the cell indicates the value of the sample mean, and ranges from green, which indicate correlation of -1 to red, which indicates correlation of +1. Additionally regression methods column cells are colored, as the one with the highest correlation coefficient is colored in red and with the one with the lowest correlation coefficient is colored in green. The window sizes that provided the most coverage of the data are colored in blue. In the lower table the residual sum of squares for each regression method are shown. The regression methods with highest and lowest RSS are colored accordingly. 39

Figure 3.9: Multivariate correlation results w.r.t. a specific sliding window size. This heatmap summarizes the analysis w.r.t. correlation coefficient that were performed for a specific sliding window size.

51 Figure 3.9: Multivariate correlation results w.r.t. a specific sliding window size. This heatmap summarizes the analysis w.r.t. correlation coefficient that were performed for a specific sliding window size. The cell intersection of the column for the regression method r and the row for the chromosome c contains three parameters (sample mean, sample standard deviation and sample size) of the sample distribution corresponding to sliding a window of size w over the data. The color of the cell indicates the value of the sample mean and ranges from green, which indicates correlation of -1 to red, which indicates correlation of

Figure 3.10: Multivariate correlation results w.r.t. a specific sliding window size. This heatmap summarizes the analysis w.r.t. residual sum of squares that were performed for a specific sliding window size.

52 Figure 3.10: Multivariate correlation results w.r.t. a specific sliding window size. This heatmap summarizes the analysis w.r.t. residual sum of squares that were performed for a specific sliding window size. The cell intersection of the column for the regression method r and the row for the chromosome c contains RSS value. If the cell is colored, this indicates that either the regression method r has the minimum (red) or the maximum (green) RSS for this chromosome. 41

Figure 3.11: Multivariate prediction. This table summarizes the prediction accuracies of the different regression methods for the different chromosomes.

53 Figure 3.11: Multivariate prediction. This table summarizes the prediction accuracies of the different regression methods for the different chromosomes. The cell intersection of the column corresponding to regression method r and the row corresponding to chromosome c shows the RSS value, the correlation coefficient and the training error. These indicate how accurate we expect the regression method r to be when predicting missing values in chromosome c. Additional color markers indicate for each chromosome the best regression methods w.r.t. correlation coefficient (dark green) and RSS (light green).the hyperlinks provided in the colored cells are pointers to the files containing the predictions. 42

54 Chapter 4 Results and Analysis In this chapter, we describe the analyses that we conducted using the EpiGRAPH regression toolkit, both on simulated and experimental data. A study on simulated data analyzing the accuracy of the EpiGRAPH regression methodology in various scenarios is presented in Section 4.1. We also analyze the dependency between evolutionary conservation and DNA melting temperature as a test case of real data. The results are listed in Section 4.2. An elaborate analysis of DNA methylation experimentally determined on 8 different cell lines was conducted to estimate the tissue methylation interdependency and the relationship between DNA methylation and DNA sequence. These results are listed in Section Simulation Study In this section we present our validation scenario of the EpiGRAPH regression methodology. For this purpose, we perform an analysis on simulated data and estimate the accuracy of results. The validation is performed only for the case of the bivariate analysis. As already described in Section 3.1.1, the bivariate analysis detects mutual dependency between two attributes and identifies regions of deviating correlation. In order to accurately test the methodology implemented in EpiGRAPH regression, we simulate two genome attributes with known overall correlation and artificially introduce regions with known deviating local correlation. Subsequently, we perform the bivariate analysis on those attributes and assess the accuracy of the global correlation estimation and we measure the specificity and sensitivity of the detection of deviating regions. Generating random vectors with fixed correlation coefficient The first step in our validation experiment is data simulation. In the specific case of the bivariate analysis, we need to generate two genome-wide attributes with a known correlation coefficient. An algorithm that generates two numeric vectors of fixed size 43

55 with a specified correlation coefficient was proposed in [11]. We present it in Algorithm 5. Algorithm 5 Input: length k and correlation coefficient c Output: Vectors A = (a 1, a 2,..., a k ) and B = (b 1, b 2,..., b k ), such that corr(a, B) = c 1: Draw two random vectors A 1 = (a 1 1, a 1 2,..., a 1 k ) and A 2 = (a 2 1, a 2 2,..., a 2 k ) from N (0, 1) independently 2: A := A 1 3: B := A 1 c + A 2 1 c 2 The general idea of the algorithms is to generate two random vectors and with normal distribution and using a predefined transformation on them to create the desired vectors A and B. In the following paragraph we present a proof that the vectors A and B produced as an output from Algorithm 5 are correlated with correlation coefficient c. The proof idea is based on [11]. First, the algorithm generates two random vectors A 1 and A 2 drawn from the a normal distribution N (0, 1). Lets denote by X the matrix ( A1 ) and by Y the matrix to be found A 2 ( ) 1 0 cov(x) = E(X X t ) =. 0 1 ( A B ). It follows that E(X) = 0 and Suppose that there exists a matrix H such that Y = H X. Then, E(Y ) = H E(X) = 0 and cov(y ) = E(Y Y t ) = H E(X X t ) H t = H H t ( ) 1 0 Let H =. Take A to be A c 1 a 2 1 and B to be given by A 1 c + A 2 1 c 2. Now cov(a, A) = cov(a 1, A 1 ) = 1 and Furthermore, cov(b, B) = cov(a 1 c + A 2 1 c 2, A 1 c + A 2 1 c 2 ) = 1 cov(a, B) = cov(b, A) = cov(a 1, A 1 c + A 2 1 c 2 ) = = E(A 1, A 1 c + A 2 1 c 2 ) E(A 1 ) E(A 1 c + A 2 1 c 2 ) = c Since V ar(a) = V ar(a 1 ) = 1 and V ar(b) = V ar(a 1 c + A 2 1 ( c 2 ) = 1, then ) 1 0 corr(a, B) = cov(a, B) = c. Hence, we proved that by choosing Y = c 1 a 2 X, we guarantee that corr(a, B) = c and also E(Y ) = 0, V ar(y ) = 1. 44

56 EpiGRAPH regression global dependency estimation In this first experiment, we simulate two genome-wide attributes A and B with fixed correlation coefficient of c as presented in Algorithm 5, for various values of c (1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1). Subsequently, we use EpiGRAPH regression bivariate analysis to detect the correlation between these attributes. The estimation of the global correlation between two attributes in EpiGRAPH regression is directly calculated as the correlation coefficient between their score vectors. Hence, our correlation estimator is unbiased and for each of the simulated correlations EpiGRAPH regression reported exactly the simulated value. EpiGRAPH regression deviating regions identification Within the scope of this simulation, we continue with an analysis of the EpiGRAPH regression accuracy, when identifying regions with deviating correlation. For this purpose, we need to simulate two attributes with adjusted both global and local correlation dependencies. Algorithm 5 designs two random vectors with known global correlation c, but the local correlations might show large fluctuations around the values c, especially if the regions are small enough. We extend this method to obtain two random vectors where the global correlation as well as the correlation for every large enough region are approximately c. The method that solves this issue is presented in Algorithm 6. The general idea is Algorithm 6 Input: integer length k, correlation coefficient c, integer constant p << k, approximation constant q Output: Vectors A = (a 1, a 2,..., a k ) and B = (b 1, b 2,..., b k ), such that corr(a, B) = c and corr((a i, a i+1,..., a j ), (b i, b i+1,..., b j )) c j > i + q p 1: A:=() and B:=() 2: while size(a) is less than k and size(b) is less than k do 3: Using Algorithm5 generate two random vectors X 1 and X 2 with size p and correlation c 4: Append A with the values of X 1 and append B with the values of X 2 5: end while to choose a small constant p and repeatedly create random vectors (using the Algorithm 5) of size p with correlation c and append them to the vectors A and B. We repeat this procedure until the vectors A and B reach the desired size k. For such generated vectors A and B, it is true that E(A) = E(B) = 0 and V ar(a) = V ar(b) = 1. 45

57 Furthermore, i < j = 1 j i + 1 corr((a i, a i+1,..., a j ), (b i, b i+1,..., b j )) = 1 j = a t.b t = j i + 1 t=i a i b i + a i+1 b i a i+p 1 b }{{ i+p a } j p+1 b j p a j 1 b j 1 + a j b j }{{} p p 1 j i + 1 [ ] j i + 1 k c = c k For our validation experiments, we fix a window size w for the artificial deviating regions from the values {20, 40, 80, 160, 320} and a local correlation coefficient l from {0.9, 0.7, 0.5, 0.3}. For each possible combination of the values for the region size w and the local correlation coefficient l we conduct a separate experiment, in total 20 experiments. For each experiment, we generate two attributes A and B of size 250,000 with correlation of 0 with p of 5 using Algorithm 6. We artificially include several regions with a window size of w and local correlation coefficient of l. The number of artificially introduced regions is chosen so that the area that those regions cover is less than 5%(12500) of the overall attribute size. We perform an EpiGRAPH regression bivariate analysis with general parameters (sliding window sizes of 2w and 4w, sliding 3 3 step overlap of 50% and z-score cutoff of 1, 2 and 3) on the generated attributes. The regions identified by EpiGRAPH regression were inspected to estimate the sensitivity and specificity of the method, where an artificial region is considered detected if it overlaps with a region identified by EpiGRAPH regression. In what follows, we present the accuracy results for detecting deviating regions based on the z-score cutoff. In general, the simulation reports very high specificity and sensitivity values. The sensitivity and specificity decrease with the size of the introduced regions and with the local correlation in the marked regions. Specificity/Sensitivity /1 0.87/1 0.74/1 0.56/ /1 0.98/1 0.9/1 0.7/ /1 1/1 0.98/1 0.87/ /1 1/1 1/1 0.96/ /1 1/1 1/1 1/1 Table 4.1: EpiGRAPH regression accuracy results using a z-score cutoff of 1 Table 4.1 shows classical specificity vs. sensitivity estimations for the regions identified using a z-score cutoff of 1 (for visualization of these results see Figure 4.1). 46

58 Figure 4.1: A specificity vs. sensitivity plot for the regions identified by using a z-score cutoff of 1. The results for each tested local correlation coefficient are marked with different colors. Each point in the plot corresponds to a simulation experiment, where regions with simulated local correlation were introduced among the values of the initial attributes. EpiGRAPH regression bivariate analysis was performed for each such generated attributes and the regions identified with z-score cutoff of 1 were matched against the list of the introduced regions to evaluate the specificity and sensitivity of the toolkit. These results show that the sensitivity of the toolkit is generally very high for all experiments. The specificity decreases with the size and the correlation coefficient of the simulated local regions. In general, EpiGRAPH regression using a z-score cutoff of 1 retrieves almost all simulated regions. Furthermore, its specificity remains high even for smaller regions sizes and correlation coefficients. Specificity/Sensitivity /1 0.99/ / / /1 1/1 1/ / /1 1/1 1/1 0.99/ /1 1/1 1/1 1/ /1 1/1 1/1 1/1 Table 4.2: EpiGRAPH regression accuracy results using a z-score cutoff of 2 Similar simulation results for a z-score cutoff of 2 are presented in Table 4.2 and Figure 4.2. The results show very high overall specificity and sensitivity. As expected, the performance of EpiGRAPH regression improves for larger regions and with the 47

59 Figure 4.2: A specificity vs. sensitivity plot for the regions identified by using a z-score cutoff of 2. The results for each tested local correlation coefficient are marked with different colors. Each point in the plot corresponds to a simulation experiment, where regions with simulated local correlation were introduced among the values of the initial attributes. EpiGRAPH regression bivariate analysis was performed for each such generated attributes and the regions identified with z-score cutoff of 2 were matched against the list of the introduced regions to evaluate the specificity and sensitivity of the toolkit. increase of the difference between the global and the local correlation. Specificity/Sensitivity /0.72 1/ / / /1 1/0.73 1/ / /1 1/0.93 1/0.7 1/ /1 1/0.99 1/0.88 1/ /1 1/1 1/0.97 1/0.78 Table 4.3: EpiGRAPH regression accuracy results using a z-score cutoff of 3 The simulation results when EpiGRAPH regression uses a z-score cutoff of 3 are presented in Figure 4.3 and Table 4.3. The results show a very high specificity through the experiments with varying sensitivity. The sensitivity is low for small regions with low correlation coefficients and increases with the size of the regions and the local correlation. For z-score cutoff of 3, EpiGRAPH regression reports introduced regions almost exclusively, but only a fraction of them. 48

60 Figure 4.3: A specificity vs. sensitivity plot for the regions identified by using a z-score cutoff of 3. The results for each tested local correlation coefficient are marked with different colors. Each point in the plot corresponds to a simulation experiment, where regions with simulated local correlation were introduced among the values of the initial attributes. EpiGRAPH regression bivariate analysis was performed for each such generated attributes and the regions identified with z-score cutoff of 3 were matched against the list of the introduced regions to evaluate the specificity and sensitivity of the toolkit. In conclusion, our simulation study validates the methodology used by the EpiGRAPH regression bivariate analysis. The simulation reports very high specificity and sensitivity for identifying regions with deviating local correlation and of course that Epi- GRAPH regression detects correctly the global dependencies. The good tradeoff of sensitivity and specificity when using z-score cutoffs of 1, 2 and 3 confirms their choice as default parameters for the application and allows the user to tune the identification analysis. 4.2 Evolutionary Conservation vs. Melting Temperature Another experiment was conducted as part of the cooperation work with Prof. Hovig (Radium Hospital, Oslo). We performed a bivariate analysis on data concerning evolutionary conservation (taken from the UCSC Genome Browser track) and DNA melting temperature (provided by Prof. Hovig). 49

61 Analysis attributes The attributes used in this analysis are the evolutionary conservation and the DNA melting temperature. In the following paragraph, we present a brief introduction of these biological properties. Evolutionary conservation. During managing and copying of genetic information, random accidents and errors occur, altering the nucleotide sequence. In the course of evolution, some of these changes are preserved. Clearly, changes of some parts of the genome are admitted more easily than changes in others. A segment of DNA that does not code for protein or has no significant regulatory role is free to change. In contrast, a gene that codes for a highly optimized essential protein cannot change that easily. The evolutionary conservation attribute represents a numeric measurement of how much genome regions are conserved in 17 different vertebrates, including mammalian, amphibian, bird, and fish species. The evolution tree of these 17 vertebrates is presented in Figure 4.4 Figure 4.4: Evolutionary tree of the 17 different vertebrates used for the evolutionary conservation attribute DNA melting temperature. The DNA melting temperature of a genomic region is the temperature, at which the DNA helix structure denatures. The melting temperature is highly dependent on the number of hydrogen bonds forming the base pairs in this region. Hence, the melting temperature is directly related to the frequency content of cytosines and guanines in the genome region. Regions rich in cytosine and guanine have higher melting temperature and regions with high adenine and thymine content have lower melting temperature. 50

Bivariate analysis results The purpose of this analysis is to detect the overall dependency between the evolutionary conservation and DNA melting temperature and to identify potentially interesting

62 Bivariate analysis results The purpose of this analysis is to detect the overall dependency between the evolutionary conservation and DNA melting temperature and to identify potentially interesting genome regions that exhibit low melting temperature (i.e. are prone to random mutations) but high evolutionary conservation, pointing towards strong selective pressure. Figure 4.5: Bivariate Analysis on evolutionary conservation and DNA melting temperature. The bivariate analysis results show a very low correlation(0.09) of these two attributes (see Figure 4.5). The correlations on chromosome level vary from 0.31 down to

63 52 Figure 4.6: Detailed results for the bivariate analysis of evolutionary conservation and DNA melting temperature presented as a heatmap

64 Figure 4.7: Histogram visualization of the distribution of the correlation coefficients for chromosomes 10 and 19 The bivariate analysis performed on these attributes was computed in the reasonable time of 18 minutes. The detailed results presented in Figure 4.6 indicate an overall stable very low correlation between these attributes. A notable exception is chromosome 19, where we observe a significantly different correlation of We expect evolutionary conservation to be high in gene-rich genome regions due to their important functional role. We also expect such regions to have high DNA melting temperature due to the higher content of cytosines and guanines in those regions (especially in the gene promoter sites). Hence, in gene rich regions or chromosomes we expect higher correlation of the attributes. Chromosome 19 has the highest number of genes to chromosome size ratio from all chromosomes (see Table 4.4) with around 1128 genes and size of 63,811,651bp. It is one of the smallest chromosomes and contains around 1128 known genes. We observe a very strong correlation of 0.8 between the ratio gene count:chromosome size and the dependency of the evolutionary conservation and DNA melting temperature. We continue with comparison of the distribution of the correlation coefficients for chromosome 19 and for the other chromosomes. We inspected the experiments performed with sliding window of size 500,000 over chromosome 19 and chromosome 10 (as an typical representative of the rest of the chromosomes). The histograms in Figure 4.7 clearly show that while for chromosome 10 the distribution resembles a normal distribution with mean 0, in chromosome 19 the mean of the distribution is strongly shifted towards We also present the chromosome-wide visualization of the correlation coefficients (see Figure 4.8). For chromosome 10 the correlation coefficients vary around 0 throughout the chromosome. On the other hand, for chromosome 19 we observe low to medium correlation along the whole chromosome. It only decreases to 0 around the centromer and telomer, which might be due to the lack of genes in those regions. 53

65 Chromosome #genes Size Gene count : size Dependency ,522, e ,018, e ,505, e ,411, e ,857, e ,975, e ,628, e ,274, e ,429, e ,413, e ,452, e ,449, e ,142, e ,368, e ,338, e ,827, e ,774, e ,117, e ,811, e ,435, e ,944, e ,554, e X ,824, e Y ,701, e Table 4.4: Gene count and gene size compared to the correlation between evolution and melting temperature. A very high correlation of 0.8 is observed between the ratio of gene count to chromosome size and the dependency estimated from the bivariate analysis on evolutionary conservation and DNA melting temperature The overall results from this analysis confirm our expectations and thus they validate the methodology used by EpiGRAPH regression, this time on real-world data. Based on the results from the EpiGRAPH regression bivariate analysis we observed that the dependency between the melting temperature and evolutionary conservation is highly correlated (0.8) with the ratio between gene count per chromosome and the chromosome size. As future work, the identified regions of deviating low correlation are to be inspected. 54

66 4.3 Tissue-specific DNA Methylation Patterns We applied the EpiGRAPH regression methodology on a recently published DNA methylation data set from Richard M. Myers lab at Stanford University [18]. The data set consists of methylation scores for eight different cancer-related cell lines derived from six different tissue types (see Figure 4.9). The data for each cell line represents smoothed scores for experimentally determined regions of unmethylated CpGs in the ENCODE regions. In the following section we describe the analyses that we performed on this data set together with the results from them. The methylation data set The experimental data is such that higher scores indicate regions that are more strongly methylated. The scores for each cell line are represented as a separate attribute in the EpiGRAPH regression toolkit. The data has a very high resolution. The regions in the attributes are of average size of 25bp. Every attribute contains scores for approximately 721,000 regions. We calculated an additional average methylation attribute, which for each region is the mean of the values for the eight different cell lines. Bivariate experiments We performed a bivariate analysis on every possible pair of cell line methylation attributes. The purpose of the experiment is to detect similarities and differences between the different cell line methylation patterns. The result from these analyses will also provide us with a quantification of the overall similarity between the DNA methylation of any two cell lines. Furthermore, each of those analyses identifies regions of significantly higher and lower correlation. Regions of low or no correlation will be particularly interesting since they are potentially involved in tissue-specific or cancer-specific processes. The methylation data set consists of 9 methylation patterns attributes (8 different cell lines and an average attribute). The complete bivariate experiment on this data set results in 36 separate bivariate analyses. The general results from those experiments are listed in Figure As expected, we observe moderate overall correlation between the different cell lines in the range The lowest correlation of 0.25 is between the SNU182 and HT1080 cell lines and the highest (that is excluding the analysis with the average attribute) of 0.61 is between the U87 and CRL1690. Due to the very high resolution of the data, we perform smoothing of the attributes to decrease the noise. We smooth the methylation attributes by averaging over 5 and 10 points as described in Section The bivariate analysis results for attributes averaged over 5 points are displayed in Figure 4.11 and for over 10 points are displayed in Figure We observe that the correlation coefficient between the different cell line varies in the range These two experiments show similar results, which are slightly higher than the initial 55

67 experiment. This is a reason to believe that via this smoothing procedure the noise component in the experiment results was reduced. Therefore, we continue with analyzing the results when attributes are averaged over 5 points. The computational time needed for these analysis was 7 hours for the complete bivariate experiment on the original data set, 5 hours when averaged over 5 data points and 4 hours when averaged over 10 data points. From these results we conclude that there is a moderate correlation between the methylation patterns of the different cell lines. The difference in the methylation patterns may partially be due to tissue-specific and/or cancer-specific methylation. Therefore it is important to explore the details for the bivariate analysis for the different pairs of cell line attributes. Detailed results for every analysis are attached to this document in digital format in Appendix B. As a common observation for those experiments, we note that very high overall correlation coefficients among the different analyses were achieved in chromosome 6. This leads to the conclusion that the methylation pattern of chromosome 6 does not vary among the different cell lines. Therefore, we presume that chromosome 6 does not contain multiple regions that have tissue-specific or cancer-specific methylation. We also observe a repeated low correlations for chromosomes 4, 8, 12 and 18. These results potentially indicate that these chromosomes contain tissue or cancer specific methylated sites. The results from one of the bivariate analyses that support these findings between the methylation patterns of HTC116 and HT1080 is displayed in Figure In conclusion, we observe medium correlations between the different cell line with the notably higher correlation for chromosome 6 and relatively lower correlations for chromosomes 4, 8, 12 and 18. The identified regions of low correlation in those experiments are going to be further inspected and compared to current knowledge in this area. 56

68 Figure 4.8: Chromosome-wide visualization of the distribution of the correlation coefficients for chromosomes 10 and 19 57

contained in the methylation data set 10:

69 Figure 4.9: The eight different cell lines contained in the methylation data set Figure 4.10: Results of the bivariate analysis performed on the methylation data set 58

methylation attributes averaged over 5 points 12:

70 Figure 4.11: Results of the bivariate analysis performed on the methylation attributes averaged over 5 points Figure 4.12: Results of the bivariate analysis performed on the methylation attributes averaged over 10 points 59

71 Figure 4.13: Bivariate analysis results for HCT116 and HT

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS) Chapter : Advanced Remedial Measures Weighted Least Squares (WLS) When the error variance appears nonconstant, a transformation (of Y and/or X) is a quick remedy. But it may not solve the problem, or it