Systematic Analysis for Identification of Genes Impacting Cancers

Size: px

Start display at page:

Download "Systematic Analysis for Identification of Genes Impacting Cancers"

Abigayle Goodman
5 years ago
Views:

1 Systematic Analysis for Identification of Genes Impacting Cancers Arpita Singhal Stanford University Saint Francis High School ABSTRACT Currently, vast amounts of molecular information involving genomic characterizations exist for various types of cancers. However, the integration of the various forms of biological data, necessary for a better understanding of the key processes underlying cancer, remains challenging. This project uses microarray based comparative genomic hybridization (acgh) data to study genomic alterations on various tumor samples, with the statistical procedures in R. To find the hidden copy number states of each chromosome to characterize genomic alterations, this project utilizes Hidden Markov Models on datasets from cancer patients. The efficacy of the homogeneous and heterogeneous Hidden Markov Models is evaluated against the known truth of simulated data by looking at the true positive rates and false discovery rates for breakpoint detection. This project mainly determines the number and types of copy number variations present in the chromosomes of the tumor datasets, obtained from the The Cancer Genome Atlas portal. Recurrent chromosomal aberrations at particular genome locations may indicate the presence of tumor suppressor genes or oncogenes. After recognizing the chromosomes with high copy number changes, genes causing these high copy number variations are identified. The association between chromosomal location and cancer phenotype provides a more reliable and informative cancer genome characterization that can lead to useful insights into cancer biology for further disease classification, prognosis, and personalized treatment. INTRODUCTION A central issue in cancer biology is the identification of specific chromosomal regions that are involved in cancer progression and other biological processes. Unbalanced chromosomal abnormalities, that result in gains and losses of chromosomal segments, often cause several human genetic disorders, including cancer. Driven by an accumulation of genetic and epigenetic changes, tumors represent altered levels of gene expression and the disruption of normal cell growth and survival. A variety of cancers exhibit gains in protooncogenes and losses in tumor suppressor genes; thus, growth-limiting functions and self-repair processes of cancerous regions are often seriously harmed. The genomic alterations, observed in tumors, reflect underlying failures in the maintenance of genetic stability. Copy Number Variations (CNVs) Copy Number Variations collectively describe deletions, insertions, duplications, and other complex variants present in the human genome. Redon et al. (2006) defined a CNV as a DNA segment of one kilobase or larger that is present at a variable copy number in comparison to a reference genome. A CNV can be simple in structure, such as a duplication, or it may involve complex gains or losses of homologous sequences at multiple sites in the genome. Chromosomal copy numbers are defined to be 2 for normal cells, 1 or 0 for single and double deletions, and 3 or higher for single copy gains or higher level amplifications. Figure 1 shows the various forms of chromosome changes. Cancer progression is usually a result of copy number variations, which may represent the over-expression of proto-oncogenes or down-regulation of tumor suppressor genes in cancer genomes. Structural variations, such as CNVs, influence the expression of different phenotypic traits and are found to impact various diseases and affect the development of tumors. DNA copynumber variations are used in cancer research, by searching for novel genes involved in cancers through the analysis of genes located in specific regions. Thereby, it is of considerable importance to identify as precisely as possible the chromosomal regions with abnormal copy numbers. 65

2 Array CGH Through the use of microarray based comparative genomic hybridization, the regions of genes with altered copy numbers can be identified. This technique characterizes the relationship between target sequences on an unknown test genome and reference genome. Array CGH has been developed to identify CNV expression within cancerous regions. As an indispensable tool to understand disease mechanisms, acgh detects and maps changes in the copy number of DNA sequences and can be used to analyze tumor genomes and chromosomal aberrations. The log-ratio values, obtained from the acgh data, are used as the emissions in the Hidden Markov Model, in order to find the hidden states representing the copy numbers of the chromosomes. This technique uses a test DNA sample, such as tumor genomic DNA, and a reference DNA sample, such as normal genomic DNA, that are both labeled with different fluorescent dyes. The DNA samples are then combined with unlabeled Cot-1 DNA, a reagent used to block repetitive DNA sequences and prevent nonspecific hybridization. The two samples are hybridized together onto a microarray, and a microarray scanner is used to measure the fluorescent signals and capture digital images. The fluorescence intensity signals from labeled DNA, hybridized on target probes, are processed and normalized. The difference between the intensity signals of each probe from the test and reference genomes is expressed as a log ratio and can be analyzed to detect genomic alterations and aberrations. In the ideal case, the log ratio is equal to 0, demonstrating that no copy change has occurred in that region of the genome; however, a higher or lower log ratio implies a change in copy number. The calculation of the log ratios determines the copy number variation. The log ratio always changes due to the test intensity while the reference intensity stays constant at 2, representing the homozygous phenotype in the normal sample. When the tumor sample has no copy of the particular region identified on the chromosome, a value log2(0/2) equal to infinity is seen indicating that region of the chromosome has experienced a homozygous deletion. The log2(1/2) value is observed when the copy number is equal to 1; since log2(1/2) is equal to -1, a heterozygous deletion has occurred. When the tumor intensity is equal to 3, the log2 ratio of (3/2) is calculated and results in 0.585, implying that a heterozygous duplication has taken place. Lastly, when the tumor intensity is equal to 4, the log2 ratio of (4/2) is calculated and results in 1 and implies that a homozygous duplication has occurred. The array CGH is further analyzed with appropriate statistical methods. A log ratio greater than 1 represents a higher number of target sequences in the test genome when compared to the reference genome; conversely, a log ratio less than one indicates a lower number of target sequences in the test genome. However, the complexity of eukaryotic genomes often causes the total signal of a microarray hybridization to be diluted and makes acgh data noisy and inappropriate in determining the accurate copy number of a region. Thus, methods that can accurately use acgh data must be implemented. The analysis of acgh data can help determine the location of DNA copy number aberrations within the tumor genome for improved cancer diagnosis, drug development, and molecular therapy. A representation of the micro-array based comparative genomic hybridization is shown in Figure 2. With more array CGH data sets emerging, more efficient algorithms that detect regions of gains and losses are necessary to provide an accurate estimate of error for the detection. The research conducted for this study uses an algorithm to categorize the chromosomes based on the types of copy number aberrations to accurately identify genes relevant to tumor progression.the objectives of this project are (1) analyze cancer genomic data in order to predict the hidden number states for each chromosomal region and (2) use the hidden number states of each region to accurately identify proto-oncogenes and tumor suppressor genes. 66

3 General approach used in this project The approach used in this project can be divided into the following six steps which are discussed in detail later. 1. Upload data from Data Portal 2. Normalization of Data 3. Segmentation of data 4. Applying Hidden Markov model 5. Results a. Comparing the Efficacy of the Hidden Markov Models for True Positive Rate (TPR) and False Discovery Rate (FDR) b. Detection of Genes through Analysis of gains and Losses Previous Approaches forarray CGH Data Analysis With more array CGH data sets emerging, more efficient algorithms that detect regions of gains or losses and provide an accurate estimate of error for the detection are necessary. Previously, researchers have devised means for analyzing the array CGH data sets. Wang et al. (2004) used the method of Clustering Along Chromosomes to detect the signal regions by depicting the spatial structure within genomic alterations. Olshen et al. (2004) utilized circular binary segmentation to segment a chromosome into connecting regions and illustrate a parametric model of the data with its use of a permutation reference distribution. However, these methods do not take into account the various biological covariates, including the distance between clones, that impact segmentation of the array CGH data. The research conducted for this study uses an algorithm to categorize the chromosomes based on the types of copy number aberrations to accurately identify genes relevant to tumor progression. The Cancer Genome Atlas (TCGA) The array CGH data used for this project was obtained from the TCGA data portal. This platform allows access to data sets, and it provides various types of data, including clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. METHODS Data The data used for this project was obtained from the TCGA platform. GBM Level 1 Array CGH data, from the Agilent Human Genome CGH Microarray 244A platform processed at the Harvard Medical School Center, was downloaded from the TCGA data portal. Level 1 data represents raw signals per probe for each participant s tumor sample. All data sets were processed using the R packages Bioconductor (Gentleman, 2004), limma (Smyth and Speed, 2003), and snapcgh (Smith, 2009). Data Normalization during Pre-Processing Raw array CGH data often has many experimental and biological factors that make it difficult to identify the true copy number for a genomic clone. Biological factors include the purity and ploidy of a sample. In order to correct this issue, background correction and normalization techniques were performed on each array. With normalization, the ploidy of the reference sample no longer played a role. The arrays were normalized using the normalizewithinarrays() function within the limma package. This function normalized the expression log ratios for two-color spotted microarray experiments, so that the log ratios averaged to zero within each array. The backgroundcorrect() function, also within the limma package, was used to correct the background of the microarray expression intensities by subtracting the average signal intensity of the area between spots. 67

4 Segmentation of Data Each array CGH was processed using the processcgh() function from the snapcgh package. This function used the normalized MAList, that contained the log expression ratios and was created by the normalization and background correction. It, then, ordered and filtered the clones based on the mapping information of the log ratios. Thus, the datasets were segmented. Using segmentation models, specific segments were identified and the segment variance of log ratio values was minimized. Hidden Markov Models (HMMs) Hidden Markov Models are a formal foundation for making probabilistic models of sequence labeling problems (Eddy, 2004). An HMM indicates a finite set of states, with each set containing emission probability distributions and specific transition probabilities between states. At each state, a residue is produced from the state s emission probability distribution. Then, the next state is chosen based on the state s transition probability distribution. The model thus generates two sets of information: the underlying state path, which is created while transitioning from state to state and is hidden, and the observed sequence, which is the residue emitted from each state in the state path. Because HMMs can effectively uncover the relationship between the underlying states and the observed emissions, they are useful in analyzing array CGH data. The log ratios obtained from the array CGH data are the emissions, and the underlying states are the copy number values of each region on the chromosome and correspond to the emissions, based on specific probabilities. Two types of HMMs exist to identify the underlying states of the array CGH data, representing the copy number aberrations: the homogenous model and the heterogeneous model, which both have their own distinct advantages. The former option, the homogenous HMM, estimates the number of hidden states via model selection and performs an analysis for each chromosome. It regards the underlying states as segments of a common mean that represent the copy number values of each region. The homogenous HMM assumes that the transition probability matrix is the same at the each state and thus does not consider the distance between clones. To fit the unsupervised homogenous HMM for each dataset, methods in the Bioconductor package snapcgh were used; the function runhomhmm() was used to discover the hidden copy numbers for each chromosome from the patient datasets for GBM. On the other hand, the heterogeneous HMM utilizes transition probabilities that are dependent on the distance between clones; furthermore, the probability of remaining in the same hidden state is a decreasing function of the distance between one probe and the probe before it. When the distance between two clones is maximized, the state of a probe is not affected by the state of the previous clone. The function, runbiohmm() was used for the heterogeneous HMM. This project uses both the homogenous and heterogeneous HMMs to identify which one assesses the copy number variations more accurately using simulated data and the corresponding True Positive Rates and False Discovery Rates. RESULTS First, the efficacy of the homogeneous HMM was compared to that of the heterogeneous HMM. A three-step algorithm was then used to identify the altered chromosomal regions in the cancer data. The three steps of the algorithm consist of the data pre-processing and segmentation of data, the identification of the hidden copy number states of the cancer data using the HMMs, and the quantification of the specific gains or losses to detect the genes in the regions of interest. The three-step algorithm is applied on array CGH GBM datasets for five different patients. True Positive Rate and False Discovery Rate The efficacy of the homogeneous and heterogeneous HMMs is evaluated against the known truth of the simulated data by looking at the true positive and false discovery rates for breakpoint detection, as seen in Figure 3. The data was simulated using the simulatedata() method in the snapcgh package. This function simulates acgh data, and this function was used to create 10 arrays to account for variation in copy number data. The comparesegmentations() method was used to create a matrix, consisting of the true positive rates 68

5 and the false discovery rates for each HMM; this function evaluates the performance of the segmentation method to the known truth of the simulated data. The boxplot() function was used to generate a plot of the rates to effectively compare the two HMMs. The true positive rates and the false discovery rates of both the homogenous and heterogeneous HMMs demonstrated that the heterogenous HMM was more successful in identifying the copy number values accurately. Normalization, Background Correction, and Segmentation The first step, the data pre-processing, helped eliminate any background errors within the data using normalization and background correction methods. These methods allowed for the next steps to become less likely to experience error. Segmentation was carried out using the snapcgh package in R which first splits each dataset into various segments based on the variation of copy number. Then the unsupervised HMM was used to find the copy number states of each chromosome. After the segmentation, the smoothed log ratios for each patient s data were plotted, as shown in Figure 4. Each figure represents the dataset from a different GBM patient and demonstrates the log ratios of each patient plotted against the kilobase. The different colors represent the twenty-four total chromosomes in the human genome. These log ratios were used as the emissions in the HMM, necessary for determining the copy number states of each patient. Use of the Hidden Markov Models Both the homogeneous HMM and heterogeneous HMM were used to identify the copy number states of each chromosome for every patient; however, only results from the heterogeneous HMM are shown because of its higher efficacy rates. Figure 5 displays the plots of the hidden states of each patient that were found for each chromosome. The plot for Patient 1 shows up-regulation of genetic data in somatic chromosomes 5 and 14 and sex chromosome Y, which is shown as chromosome 24; down-regulation of genetic data is observed in chromosomes 4 and 21. The plot of the states for Patient 2 demonstrates upregulation in chromosomes 2, 4, 5, 7, 8, 9, 12, 14, 20, 21, and 23; down-regulation is seen in somatic chromosomes 1, 16, 18, and sex Chromosome Y. The states of Patient 3 show few copy number changes: somatic chromosomes 10 and 15 and sex chromosome Y have an increased copy number, and chromosome 1 has a decrease in copy number. On the other hand, the states of Patient 4 show greater copy number variance. Somatic chromosomes 3, 12, 14, 15, 16, 17, and 22 and sex chromosomes X and Y all demonstrate a greater copy number, and this patient has no losses in genetic data. Lastly, Patient 5 also has several copy number gains in somatic chromosomes 2, 6, 7, 9, 12, 13, 14, 15, 20, and 22 and sex chromosome Y. While it is important to consider the fact that each individual s genome consists of several mutations and some copy number variations, whole chromosomal aberrations are quite often indications of disease. The identification of the copy number states of each chromosome in the genomes of cancer patients is useful for identifying common chromosomes that may impact the progression of the Glioblastoma Multiforme tumor. If observed in several tumors, genes can be identified as oncogenes or tumor suppressor genes through the analysis of the specific chromosomal position. In addition, the individual variance in copy number of each chromosome for each patient allows for personalized treatment. Identification of Specific Gains or Losses The third and final step was conducted by comparing the log ratio plots of the five patient samples, as seen in Figure 4, and identifying the common regions with similar gains or losses and mapping those regions to specific genes. While some datasets displayed a more drastic change in the log ratios as compared to the other datasets, a majority of the datasets exhibited an elevated copy number at chromosomes 12 and Y and a decreased copy number at chromosome 1. The chromosome numbers are identified through the heterogeneous HMM analysis on single chromosomes. The variance in copy number among the different patients can be attributed to the diversity of genomic data from individual to individual. While each patient s genome may represent common gains and losses, there are several external conditions that influence the expression of 69

6 regions of the genome, including the patient s age and medical history. The gains or losses of certain chromosomal regions were identified using the plots of the copy numbers that rely on the log ratios. DISCUSSION This project aims to design an algorithm that can identify the copy number states for each chromosome. Remarkably, the method yields interesting data for analysis. This project applies the methods on Glioblastoma multiforme array CGH data to figure out the copy number states for each chromosome. It also efficiently matches the corresponding copy number gain or loss to a certain region of interest, that may be involved in the progression of the tumor. The results from this project can be used for improved and personalized treatment by identifying genes that are up-regulated or under-expressed. Each data set obtained from a different patient, while being affected by the same disease, has some differing log ratios and copy numbers. The variance in copy number among patients is due to the factors, including environmental and hereditary information, that impact the log ratios and, thus, the copy number variations. For further research, patient medical history, age, and other medical factors can be included in the study in order to more accurately study chromosomal aberrations that are involved in GBM. Some similar regions of interest were identified amongst the GBM patients. Most of the datasets contain duplications at Chromosome 12. Using the GeneName data, the original names of the genes, attributing to the elevated copy number were found. Chromosome 12 contains the genes, PDE3A and ST8SIA1. PDE3A, or Phosphodiesterase 3A, plays a critical role in many cellular processes by regulating the amplitude and duration of the intracellular cyclic nucleotide signals. ST8SIA1, or ST8 Alpha-N-Acetyl- Neuraminide Alpha-2,8-Sialyltransferase 1, is important for cell adhesion and growth of malignant cells. The dysregulation of these genes may attribute to the progression of cancer as these genes are important in maintaining cell processes and seem to affect the growth of malignant cells. In chromosome 1, genes AMY2A and KIFAP2 were under-expressed; this decrease in expression may have caused the cells to stop functioning normally and thus encouraged tumor growth. Additionally, heterozygous and homozygous duplications were seen near Chromosomes 19, which contains genes MLL4 and PSENEN genes. MLL4, or Myeloid lymphoid or mixed-lineage leukemia 4, is most commonly seen in luekemia; however, it is often amplified in tumor cell lines and may be involved in the formation of the GBM tumor. Also, some patients had an increased copy number at chromosome 7, which represents the amplification in the Epidermal Growth Factor Receptor (EGFR) gene that causes cells to grow and divide. EGFR is a highly prominent oncogene present in various types of cancer, including GBM and Lung Cancers. In addition to the genes identified across all samples, genes specific to certain patients can be used for more personalized treatment. CONCLUSION This project has successfully utilized array CGH data to discover various genes that may impact the formation and progression of the GBM tumor in patients. The copy number phenotype discovered for each cancer patient is associated with a known biological marker that may be associated with the progression of the cancer, either by its overexpression or underexpression. If the gene is over-expressed, it is most likely an oncogene that causes cells to grow and divide, as observed in cancers. When the gene is under-expressed, the gene may be a cause of the tumor development because it is probably an important cell cycle gene, that suppresses the formation of tumors in cells. The resulting copy number phenotype, determined with the HMM used in this project, is associated with biological markers that may be previously unassociated with the cancer phenotype. This association will help provide the most reliable and informative genome characterization of cancer and the development of more specialized disease classification, prognosis, and personalized treatment for the cancer patient. Since this algorithm has been used on Level 1 data, this project has successfully demonstrated the analysis of the raw data by normalization, segmentation, and implementation of the HMMs to identify cancer biomarkers for the development of a better and more personalized form of treatment for patients affected with 70

7 GBM. For further research, the algorithm used in this project can be used on more GBM datasets to more successfully find the biological markers that may cause the formation of the brain tumor within the cancer patients. Additionally, the algorithm used in this project can be utilized on other cancer types for a similar analysis of cancer biomarkers. While incorporating this algorithm, other medical factors can be taken into account to eliminate any interference in the study of the copy number variations. Further research can be conducted that will standardize the data to incorporate factors, including the age and previous medical conditions of the patient. ACKNOWLEDGEMENTS I am grateful to Professor Susan Holmes from the Statistics Department at Stanford University for her valuable time, help, and guidance provided while I was conducting this project and taking the BioStatistics course; Professor Trevor Martin for his help during the BioStatistics course; and Julia Fukuyama for her advice on how to approach certain issues while using R. Also, my Physics-Honors teacher, Mrs. Segal, provided me with valuable advice while conducting my project. In addition, I am very grateful to Dr. Sean Davis, Staff Scientist at the Center for Cancer Research at the National Cancer Institute, for his valuable time and feedback provided while conducting this project. Also, I am thankful to my parents for their continuous support. ANNOTATED BIBLIOGRAPHY Eddy, Sean R. What Is a Hidden Markov Model. Nature.com. Nature Publishing Group, Web. 5 Oct This research article discusses the definition of a Hidden Markov Model. The author defines a Hidden Markov Model as a formal foundation for making probabilistic models of sequences by considering transition probabilities. His definition really encompasses the significance of this project, which uses Hidden Markov Models to find the underlying states from the given emissions. Additionally, the author uses examples based on the genetic sequences. Through this example, he notes that the sequence, in terms of A, C, T, and G, represents the overlying emissions, and the underlying state path is hidden and must be discovered through the use of the Hidden Markov Models, that contains transition probabilities. The author of this research article presents his research in a highly credible fashion since he first defines the Hidden Markov Model and then provides examples supporting his definition. In addition, he makes use of several sources from credible authors; for example, he cited Rabiner who conducted a tutorial on Hidden Markov Models. Dr. Sean R. Eddy works at Howard Hughes Medical Institute and the Department of Genetics at Washington University School of Medicine. He has authored research papers that have used Hidden Markov Models. Thus, he is a credible source as he has the knowledge necessary for defining and demonstrating what a Hidden Markov Model is. Olshen, A.B., E. S. Venkatraman, Robert Lucito, and Michael Wigler. Circular binary segmentation for the analysis of array based DNA copy number data. Biostat (2004) 5 (4): , doi: /biostatistics/kxh008. The research paper, Circular binary segmentation for the analysis of array based DNA copy number data, discusses another approach for analyzing array CGH data. They have utilized array CGH data and circular binary segmentation method to translate noisy intensity measurements into regions of equal copy number. They have applied this method on test breast cancer data, as well as simulated data with known copy number alterations to test the efficacy of their new method. They have effectively discovered another method for analyzing array CGH data to detect regions of gains and losses based on the segments that they found with their method. 71

8 The authors of this research paper present the research in a highly efficient and credible way as they have demonstrated a new development while applying it on simulated data and test data. Their method is one approach for analyzing array CGH data to obtain the over-expressed and down-regulated regions. Dr. Venkatraman is from the Department of Epidemiology and Biostatistics at the Memorial Sloan-Kettering Cancer Center; his position gives him the credibility for conducting this research paper. The other two authors, Robert Lucito and Michael Wigler also have significant experience in the cancer field as they conduct cancer research at the Cold Spring Harbor Laboratory in New York. Wang, P., Y. Kim, J. Pollack, B. Narasimhan, and R. Tibshirani. A Method for Calling Gains and Losses in Array CGH Data. Biostatistics 6.1 (2004): Web. This research paper focuses on the development of a new method for detecting gains and losses in Array CGH data. The authors utilize clustering to identify crucial regions. They have developed a new algorithm, Clustering along Chromosomes (CLAC) to detect specific regions. The CLAC builds hierarchical clustering-style trees along each chromosome arm or chromosome and then selects the interesting clusters by controlling the False Discovery Rates. They have applied the data on a lung cancer microarray CGH data set. Their clustering algorithm is iterative as it continues until a big cluster is formed, and it is based on the identification of specific clusters with one gene in each cluster, and then the two adjacent clusters are merged. The authors of this research paper all work in different departments at Stanford University and thus represent an interdisciplinary approach to this paper. The main author, Dr. Wang, works in the Statistics Department and thus is extremely knowledgeable in this field. Their research provides a valuable insight into another way of analyzing array CGH data, and underscores the necessity of analyzing array CGH data to find the regions that have demonstrated gains or losses for better disease treatment in the future. WORKS CITED Albertson, D.G. and Daniel Pinkel, Genomic microarrays in Human Genetic Disease and cancer. Hum. Mol. Genet. (2003) 12 (suppl 2): R145-R152, August 5, 2003, doi: /hmg/ddg261 Eddy, Sean R. What Is a Hidden Markov Model. Nature.com. Nature Publishing Group, Web. 5 Oct Gentleman, R.C., Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dett- ling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, TorstenHothorn, Wolfgang Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch Cheng Li, Martin Maechler, Anthony J. Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean Y. H. Yang, and Jianhua Zhang. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology, 5:R80, Marioni, J.C., N.P. Thorne, S. Tavare, F. Radyanyi. BioHMM: A heterogeneous Hidden Markov Model for Segmenting array CGH data. Bioinformatics.2006; 22: Olshen, A.B., E. S. Venkatraman, Robert Lucito, and Michael Wigler. Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostat (2004) 5 (4): , doi: /biostatistics/kxh008. Rabiner, L.R., A Tutorial on Hidden Markov Model and Selected Applications in Speech Recognition. Proceedings of the IEEE, Volume 77, February 1989, Smith, M.L., John C. Marioni, Steven McKinney, Thomas Hardcastle and Natalie P. Thorne (2009). snapcgh: Segmentation, normalisation and processing of acgh data. R package version Redon, Richard, Shumpei Ishikawa, Karen R. Fitch, Lars Feuk, George H. Perry, T. Daniel Andrews, Heike Fiegler, Michael H. Shapero, Andrew R. Carson, Wenwei Chen, Eun Kyung Cho, Stephanie Dallaire, Jennifer L. Freeman, Juan R. González, MònicaGratacòs, Jing Huang, DimitriosKalaitzopoulos, Daisuke Komura, Jeffrey R. Macdonald, Christian R. Marshall, Rui Mei, Lyndal Montgomery, Kunihiro Nishimura, Kohji Okamura, Fan Shen, Martin J. Somerville, Joelle Tchinda, Armand 72

Valsesia, Cara Woodwark, Fengtang Yang, Junjun Zhang, Tatiana Zerjal, Jane Zhang, LluisArmengol, Donald F. Conrad, Xavier Estivill, Chris Tyler-Smith, Nigel P.

9 Valsesia, Cara Woodwark, Fengtang Yang, Junjun Zhang, Tatiana Zerjal, Jane Zhang, LluisArmengol, Donald F. Conrad, Xavier Estivill, Chris Tyler-Smith, Nigel P. Carter, Hiroyuki Aburatani, Charles Lee, Keith W. Jones, Stephen W. Scherer, and Matthew E. Hurles. "Global Variation in Copy Number in the Human Genome."Nature (2006): Web. Smyth, G.K. Limma: linear models for microarray data. In: Bioinformatics and Computational Biology Solutions using R and Bioconductor. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds), Springer, New York, pages , Web. Wang, P., Y. Kim, J. Pollack, B. Narasimhan, and R. Tibshirani. A Method for Calling Gains and Losses in Array CGH Data. Biostatistics 6.1 (2004): Web. Zhang, N. DNA Copy Number Profiling in Normal and Tumor Genomes. Frontiers in Computational and Systems Biology.Vol. 15. London: Springer, Web. FIGURES Figure 1: Forms of chromosome changes. 73

10 Figure 2. Schematic Representation of Array CGH 74

11 Figure 3: Boxplots comparing the efficacy of the Hidden Markov Models 75

12 Figure 4: Log Ratios for five patients, that were used as the emissions in the Hidden Markov Models. 76

13 Figure 5: The states are identified with the Heterogeneous Hidden Markov Model for the five patients, and they range from 0 to 5 for the chromosomes, depending on the patient. 77

0.1% variance attributed to scattered single base-pair changes SNPs

April 2003, human genome project completed: 99.9% of genome identical in all humans 0.1% variance attributed to scattered single base-pair changes SNPs It has been long recognized that variation in the