A METHOD FOR FINDING NOVEL ASSOCIATIONS BETWEEN GENOME-WIDE COPY NUMBER AND DNA METHYLATION PATTERNS

Size: px

Start display at page:

Download "A METHOD FOR FINDING NOVEL ASSOCIATIONS BETWEEN GENOME-WIDE COPY NUMBER AND DNA METHYLATION PATTERNS"

Morgan Robinson
5 years ago
Views:

1 A METHOD FOR FINDING NOVEL ASSOCIATIONS BETWEEN GENOME-WIDE COPY NUMBER AND DNA METHYLATION PATTERNS Man-Hung Eric Tang 1), Vinay Varadan 2), Sid Kamalakaran 2), Michael Q. Zhang 3), Nevenka Dimitrova 2), James Hicks 1). 1. Cold Spring Harbor Laboratory, 1 Bungtown Rd, NY 12724, USA 2. Philips Research North America, 345 Scarborough Rd, Briarcliff Manor, NY 10510, USA 3. The University of Texas at Dallas, Richardson T 75080, USA and Tsinghua University, Beijing, China mtang@cshl.edu Abstract We present a computational method that combines genome-wide DNA methylation and copy number variation data in an integrated fashion with the aim of finding mechanistic associations between genome instability and local DNA methylation changes. The method is applied to Luminal A breast cancer early-stage tumour samples and focuses on methylation events occurring at frequently rearranged genome locations. Our method accommodates array and sequencing platforms for methylation and DNA copy number estimates. We find significant local methylation changes in tumours tend to occur in the viscinity of breakpoint rich regions, with 80% of the differentially methylated regions occurring within 2Mb from a breakpoint rich locus. Keywords- breast cancer, genome instability, DNA methylation I. INTRODUCTION Breast cancer is a complex genetic disease characterized by multiple genetic and epigenetic changes which have been widely studied in the past two decades. Pioneering works by Perou et al. [1], Sørlie et al. [2] showed that breast tumour cells can be distinguished into five molecular subtypes with clinically different outcomes: Luminal A and B, HER2 positive, basal-like and normal-like. As new high-throughput methodologies have emerged, other genetic anomalies have been studied. Copy Number Variation profiling is a well established methodology to survey major chromosomal rearrangements in the genome. It has been shown in many studies [3,4,5] that CNV patterns are important discriminating features between subtypes of breast and other cancers. Similarly, the characterization of cancer methylomes and their corresponding normal profiles is an important aspect in biomarker discovery. Kamalakaran et al. [6] showed that Luminal and non-luminal breast cancer tumours have different methylation patterns and that differentially methylated genes between tumour and normal cells could be used as prognosis factors. Furthermore, epigenetic subtyping of breast cancer has also been addressed, for example in [7], describing the epigenotypes of Luminal A, B, HER2 positive and basal-like breast tumours. The relationship between gene expression, copy number and DNA methylation is still unclear. The problem has been often tackled in the gene expression perspective, looking at the impact of changes in copy number, methylation levels or both on gene expression, and with the aim of looking for potential therapeutic targets [6,8]. Alongside to these classic genefocused studies, it would be interesting to see whether epigenetic and genetic anomalies occur randomly and if genome instability could be associated with local variations of DNA methylation. For this we need to look at where changes occur rather than the actual levels as in the classic methods. We propose an integrative framework that combines CNV profiles and DNA methylation information of the same tumours focusing at events occurring at common chromosomal rearrangement breakpoints. We showed in Luminal A samples that genomic loci with local DNA methylation changes tend to occur significantly within 2 Mb from chromosomal breakpointdense regions. II. METHOD A. Tumour sample set We used the 119 Norwegian breast cancer dataset described in Sørlie et al. [2]. Each patient of the study is further classified into one of the following sub-groups: Luminal A tumour subtype (40 patients); Luminal B (15), ERBB2 positive (19), basal-like (12), normal-like (14), and 8 undefined. The normal tissue dataset consisted of 11 adjacent breast tissue samples. For each sample, we surveyed DNA methylation and copy number variation data using the experimental platforms described below. The statistical analysis was performed on Luminal A samples only, which represent the largest and most homogeneous group of our dataset. B. MOMA platform We surveyed the methylome of each tumour sample using the MOMA platform [9]. Each CpG island is covered by one or several MOMA fragments that undergo MspI cleavage and McrBC or mock digestion. McrBC and mock digested fragments are then labeled and hybridized on a chip. The hybridization ratio reflects the level of methylation of the probed CpG island. In total, the CpG islands annotated by the UCSC genome browser (hg17 build) are covered by MOMA fragments. The data is normalized by converting the hybridization log-ratios into the probabilistic space using an Expectation-Maximization (EM) method [6]. Each MOMA fragment is assigned one of the following states: high methylation (+1), low methylation (-1) and 0 for partial methylation.

2 C. ROMA platform To measure copy number variation across the genome, we used the ROMA platform described in Lucito et al. [5] The genome is covered by regularly spaced probes printed on an array, providing a coverage of the genome of nucleotides resolution. Copy number ratios are measured using the skin fibroblast CHPSKN-1 cell-line as reference. Since CHPSKN-1 cells come from a male individual, we focused our analysis on the 22 autosomes only. Copy number values are obtained using Circular Binary Segmentation [10]. D. Flow diagram Figure 1 presents the different steps of the analysis procedure. The model contains three layers: input methylation and copy number data (dotted line round boxes), computational modules (solid line round boxes) and output data (square boxes). In the pre-processing step, we derive the profile of copy number gain and losses in the studied set of breast tumours and define windows of similar copy number status (amplified or deleted) across the genome. For each of these windows, a statistical test is performed to evaluate whether the methylation distribution differs from the background. We then obtain a list of regions with local methylation deviations that we compare with loci with high chromosomal breakpoint density. E. CNV analysis across tumour samples We partitioned the genome into variable windows in which copy number ratio remains constant in each sample. Windows are determined by all the breakpoints obtained by segmentation of the copy number values in each sample using the CBS algorithm. Longer intervals describe regions that have very little copy number change across all the patients while short intervals correspond to regions with high copy number changes, ie many breaks across different samples. We defined three levels of amplification in order to bin samples into three categories. In each given interval, samples with a ROMA ratio greater than 1.1 are defined as amplified, deleted if their linear ratio is less than 0.9 and normal if their ROMA ratio fall between these two values. The thresholds that define the normal copy number ratio were chosen empirically to take into account the measurement noise around 1. The CNV profile of the dataset can be then plotted as the fraction of sample showing amplifications and deletions across (Figure 2). F. Detection of local methylation changes To identify local variations of DNA methylation in the 40 luminal A samples, we compared the distribution of methylations calls within each of the intervals defined by all the copy number breakpoints with the one observed across the genome. Each MOMA fragment is surveyed and we can associate to each fragment a triplet of observations accounting for the number of '+1's, '0's, and '-1's seen across all samples. For example, a window can be seen 30 times as +1, 3 times 0 and 7 times -1. Local changes in DNA methylation across the genome were identified using the Hotelling's t 2 -test, a generalization the Student's t-test for multivariate hypothesis testing. The null hypothesis H 0 is defined as the observed distribution of '+1's, '0's, and '-1's observed at each fragment across the MOMA platform. It is calculated based on observations. It has an expectation μ 0 =(μ 01,μ 02,μ 03 ) and covariance B. If a window contains n MOMA fragments, let 1, 2,.., n be n independent 3-dimensional vectors, n , 2,.., n is follows the normal law N(μ,B). Then, the T 2 statistics can be expressed as: where and S T 2 1 = n T 1 μ μ S μ μ n 1 = n 1 0 n μ = 1 n i= 1 i= 1 0 (1) i (2) μ μ T i are the sample maximum likelihood estimators of μ and B. Then T 2 has the Hotelling's T-square distribution and the statistic n p F = p n 1 T² has a Fisher's F distribution with p and n-p degrees of freedom, p=3 and parameter (μ - μ 0 ) T B -1 (μ - μ 0 ). To test whether the null hypothesis H 0 :μ=μ 0 is rejected, we compute the F statistics using the observations 1, 2,.., n of the 3-dimensional normal law N(μ,B) and derive the associated p-value. A window is considered to have a significant deviation in its methylation pattern if its p-value is smaller than G. Breakpoint dense regions We used the segment starts and end defined by the CBS algorithm for the CNV profile of each sample to define our breakpoints. We then calculated the density function and defined the center of the breakpoint dense region as the local maxima of the density. III. RESULTS A. Identification of local methylation changes within frequently recombined regions. Figure 2 summarizes the genome-wide analysis, integrating DNA copy number and DNA methylation for our 40 Luminal A. The aim is to find all local differentially methylated foci along the copy number aberration profile derived from the whole dataset. The top track (CNV) shows the frequency of gains and loses across the genome for all the tumours combined. The genome is partitioned in variable windows in which all the samples share similar copy number state: amplified or deleted. We compute the methylation profile of each of these windows and detect those showing local difference compared to the genome background. The mean methylation distribution of each window along the genome is shown by the tri-color stripe. The ratio of unmethylated, partially methylated and methylated states are respectively blue, yellow and red. The scores of significant loci are shown by the red peaks. i (3) (4)

3 Using a p-value cut-off of 10-3 after Benjamini and Hochberg FDR correction, we identified 66 regions in the genome with significant methylation deviation compare to the distribution seen genome-wide. We compared these location with the regions with high breakpoint density, shown in the bottom track of Figure 2. The results are described in the next section. B. Local methylation changes in Luminal A samples co-localize with breakpoint rich regions within 2Mb The overall picture of spatial distribution of local methylation change compared to background shown in Figure 2 seemed to indicate that they occur frequently near frequently rearranged genomic loci. In order to verify this, we plotted the cummulative fraction of methylation peaks as a function of the distance to the nearest breakpoint dense region (solid red line, Figure 3). The result showed indeed that the observed regions with strong methylation change compared to background occur within 2Mb of a breakpoint. The null distribution was estimated by randomizing the locations of these methylation changes 1000 times and plotting the mean cumulative distribution (dotted red line). We also tested where the difference between the two curves were the most significant (blue line). An FDR corrected Wilcoxon-test was performed between the two distance distributions, taking the one of the randomized data as a reference. Thus the statistics shows where the curved differ the most. We found that the difference of coverage is maximized at a distance of 2Mb where 80% of the local methylation changes in Luminal A co-localize with breakpoint rich regions. The random model on the other hand did not show such a high coverage with 70% of the simulated events falling within 2Mb. IV. CONCLUSIONS We designed a computational framework integrating DNA copy number and DNA methylation with the aim to uncover certain aspects of the mechanisms involved in large chromosomal rearrangements in breast cancer. CNV and DNA methylation are combined in an integrated fashion focusing on regions with frequent chromosomal rearrangments. We found that breakpoint rich genomic regions tend to coincide with local DNA methylation pattern changes. Genome instability is often associated to high density of LINE and SINE elements which have been shown to be methylation resistant in cancer cell lines. In future work, it will be interesting to investigate further the link between methylation, genome instability and retro-transposable repeats. REFERENCES [1] Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu S, Lønning PE, Børresen-Dale AL, Brown PO, Botstein D., Molecular portraits of human breast tumours., Nature Aug 17;406(6797): [2] Sørlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lønning PE, Brown PO, Børresen-Dale AL, Botstein D, Repeated observation of breast tumor subtypes in independent gene expression data sets, Proc Natl Acad Sci U S A Jul 8;100(14): Epub 2003 Jun 26. [3] Bergamaschi A, Kim YH, Wang P, Sørlie T, Hernandez-Boussard T, Lonning PE, Tibshirani R, Børresen-Dale AL, Pollack JR, Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer, Genes Chromosomes Cancer Nov;45(11): [4] Chin K, DeVries S, Fridlyand J, Spellman PT, Roydasgupta R, Kuo WL, Lapuk A, Neve RM, Qian Z, Ryder T, Chen F, Feiler H, Tokuyasu T, Kingsley C, Dairkee S, Meng Z, Chew K, Pinkel D, Jain A, Ljung BM, Esserman L, Albertson DG, Waldman FM, Gray JW, Genomic and transcriptional aberrations linked to breast cancer pathophysiologies, Cancer Cell Dec;10(6): [5] Lucito R, Healy J, Alexander J, Reiner A, Esposito D, Chi M, Rodgers L, Brady A, Sebat J, Troge J, West JA, Rostan S, Nguyen KC, Powers S, Ye KQ, Olshen A, Venkatraman E, Norton L, Wigler M, Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation, Genome Res Oct;13(10): [6] Kamalakaran S, Varadan V, Giercksky Russnes HE, Levy D, Kendall J, Janevski A, Riggs M, Banerjee N, Synnestvedt M, Schlichting E, Kåresen R, Shama Prasada K, Rotti H, Rao R, Rao L, Eric Tang MH, Satyamoorthy K, Lucito R, Wigler M, Dimitrova N, Naume B, Borresen-Dale AL, Hicks JB. DNA methylation patterns in luminal breast cancers differ from non-luminal subtypes and can identify relapse risk independent of other clinical variables. Mol Oncol Feb;5(1): Epub 2010 Dec 2. [7] Bediaga NG, Acha-Sagredo A, Guerra I, Viguri A, Albaina C, Ruiz Diaz I, Rezola R, Alberdi MJ, Dopazo J, Montaner D, de Renobales M, Fernández AF, Field JK, Fraga MF, Liloglou T, de Pancorbo MM. DNA methylation epigenotypes in breast cancer molecular subtypes, Breast Cancer Res Sep 29;12(5):R77. [8] Staaf J, Jönsson G, Ringnér M, Vallon-Christersson J, Grabau D, Arason A, Gunnarsson H, Agnarsson BA, Malmström PO, Johannsson OT, Loman N, Barkardottir RB, Borg A. High-resolution genomic and expression analyses of copy number alterations in HER2-amplified breast cancer. Breast Cancer Res. 2010;12(3):R25. Epub 2010 May 6. [9] Kamalakaran S, Kendall J, Zhao, Tang C, Khan S, Ravi K, Auletta T, Riggs M, Wang Y, Helland A, Naume B, Dimitrova N, Børresen-Dale AL, Hicks J, Lucito R, Methylation detection oligonucleotide microarray analysis: a high-resolution method for detection of CpG island methylation, Nucleic Acids Res Jul;37(12):e89. Epub 2009 May 27. [10] Venkatraman ES, Olshen AB, A faster circular binary segmentation algorithm for the analysis of array CGH data, Bioinformatics Mar 15;23(6): Epub 2007 Jan 18. ACKNOWLEDGMENT Philips Research grant to Cold Spring Harbor Laboratory and NIH ES and HG grants to MQZ.

Figure 1. Flowchart of the analysis pipeline. Figure 2.

computed in regions that are frequently amplified or deleted (orange peaks).

4 Figure 1. Flowchart of the analysis pipeline. Figure 2. Detection of significant local changes in DNA methylation distribution across the genome. CNV profile of the 40 Luminal A samples are shown on the top track (CNV). Local deviations of the distribution of Methylated, UnMethylated and Partially methylated samples are computed in regions that are frequently amplified or deleted (orange peaks). The actual methylation distribution across samples is shown in the tri-color bar (red-yellow-blue). We compared the locations of the deviations with those where breakpoints frequently occur (Breakpoint density track)

$The solid red-line represents the cummulative fraction of identified regions with$

5 Figure 3. Co-localization of regions with significant methylation change with breakpoint rich regions. The solid red-line represents the cummulative fraction of identified regions with methylation change occurring within a certain distance to the nearest breakpoint rich region. The true data significantly differs from random (dotted red line) at a distance of 2Mb.

Understanding DNA Copy Number Data

Understanding DNA Copy Number Data Adam B. Olshen Department of Epidemiology and Biostatistics Helen Diller Family Comprehensive Cancer Center University of California, San Francisco http://cc.ucsf.edu/people/olshena_adam.php