LTA Analysis of HapMap Genotype Data

Size: px

Start display at page:

Download "LTA Analysis of HapMap Genotype Data"

Clyde Henry
6 years ago
Views:

1 LTA Analysis of HapMap Genotype Data Introduction. This supplement to Global variation in copy number in the human genome, by Redon et al., describes the details of the LTA analysis used to screen HapMap CNV calls for somatic artifacts on the basis of the HapMap genotype data (as presented in Supplementary Table 5). Additional analyses of the HapMap data based on the LTA approach are also described in this supplement. It is generally accepted that an analysis of high-quality SNP genotype data from related individuals can yield information about the location of segregating germline deletions. Recently, methods have been developed that exploit unusual features of such data (i.e., Mendelian inheritance errors (MIs), departures from Hardy-Weinberg equilibrium, patterns of uncalled genotypes) and successfully applied to the high-density genomic SNP data generated by the HapMap project to detect experimentally validated deletions (Conrad et al. 2006; McCarrol et al. 2006). In principle, it should be possible to follow a similar tactic to identify deletions that occur in the soma, for example in cell lines generated from somatic tissue. We set out to explore the feasibility of such an approach, develop a method (if possible), and apply it to the Consortium CNV data to detect potential cell-line artifacts. The data used for these analyses are the Consortium data (both 500K EA and WGTP calls) and the Phase I HapMap data (release 20, for all autosomes (International HapMap Consortium 2005). We make limited use of the Phase II data when trying to establish which CNV calls made in the present study are likely to be somatic. Filtering of the HapMap data was done as described in (Conrad et al. 2006). Outline of the LTA Method. The pattern that we wish to exploit will come from a deletion occurring in the cell line of a previously copy-number normal trio member, at an autosomal locus where the other two trio members have normal (n=2) copy number. Specifically, we are trying to find clusters of SNPs at which an allele transmitted from parent-to-child has subsequently been deleted in the parent. We call this approach Loss

2 of Transmitted Allele" analysis or LTA for short. This method will not work if the cell line artifact is a duplication, or if the deletion occurs within a person who was only haploid for the segment to begin with. The method is based on the prediction that a SNP genotyping experiment will call a deletion hemizygote as homozygous for the allele that is present. When an allele that is present only once in all four parental gametes is deleted after transmission to a child, that information is erased and it appears that the child has inherited a de novo mutation. SNPs typed across a cell line deletion would hence be enriched for Mendelian inheritance (MI) errors (e.g., LTA Figure 1.) LTA Figure 1: Example of trio genotype configurations used in this analysis. Proof of Principle. We extended the model described in (Conrad et al. 2006) to include genotype configurations that were informative of a somatic deletion, leading to 8 general trio configurations. To gain a better understanding of the patterns of trio genotypes generated by a somatic deletion, we scored the Phase I HapMap SNP genotypes underlying a known cell-line deletion in NA07055 (del(2q23-q24), greater than 15Mb long), which are presented in Table 1. The run of Type II MIs in the Phase I data caused by this artifact is apparent upon visual inspection (LTA Figure 2).

3 Trio Class Description Count 1 Type I MI, deletion in mother (germline or somatic) 54 2 Type I MI, deletion in father (germline or somatic) 0 3 Type I MI, deletion in either parent (somatic only) 28 4 Uninformative Compatible with somatic deletion in mother Compatible with somatic deletion in father Incompatible with somatic deletion in either parent Compatible with somatice deletion in either parent 532 LTA Table 1: Distribution of trio genotype configurations within known cell line artifact LTA Figure 2: Pattern of Type II MIs created by somatic deletion. The physical position of each Type II MI in the Phase I HapMap data from chromosome 2 is plotted against physical position (NCBI34) for all 30 CEU families. The run of MIs caused by a somatic event starting near 150Mb is quite apparent. Data are from release 16c.1. Several important points stand out. First, there is a strong, specific signal in the pattern of MIs from the family of this individual. Individual NA07055 is the mother of the trio. All MIs are either Type II or Type I but from the mother only. The rate of MIs in this stretch is 0.01, a rate twenty-five times above the genome average. Second, within the class of somatic deletion compatible trio configurations, configurations specific to maternal deletions outnumber paternal configurations by an excess of almost 3:1. However, this

4 indicates that there is some noise in the data and our modeling assumptions do breakdown at points. This phenomenon has been noted in prior work, and can, for instance, lead to the erroneous splitting of a single deletion-compatible stretch into multiple smaller stretches. Screening HapMap data for artifacts. Two major questions can be addressed with this modeling approach. First, how many CNVs called in the present study are not true germline deletions but artifacts of the cell culture process? Second, we may want to ask how many cell line events have occurred within the HapMap as a whole. 500K EA and WGTP HapMap data. Consortium Data. Screening the HapMap CNV calls essentially amounts to a problem of multiple testing, where for each CNV we will test the null hypothesis, This is a germline CNV" based on the trio genotypes at the underlying N HapMap SNP genotypes. We assume that number of Type II MIs in a set of genotypes from a trio is a binomial random variable, and that the genotyping experiment on a given SNP is independent of experiments at all other SNPs (an assumption not likely to hold for some Perlegen SNPs and instances of pooling). We estimate θ c, the rate of Type II MIs, across the entire HapMap, for each combination of genotyping platform and genotyping center, a total of 11 rates. If we see x Type II MIs underlying a called CNV, we calculate the probability that the run of MIs is due to random genotyping error as the probability of observing x or greater Type II MIs in a stretch of N trio genotypes, θ here is taken to be

5 the arithmetic mean of the expected Type II rate for all SNPs within the CNV. 500K EA and WGTP CNVs are assessed separately, as are CNVs in each (CEU, YRI) population. Results. The data for this analysis was all preliminary deletion calls in the CEU and YRI parents (3252 from the WGTP, 1506 from 500K EA). The method for deciding significant regions was a bit convoluted, but with the ultimate goal of creating a set of conservative (leaning towards over-calling) somatic artifact CNVs. First, all CNV calls with no Type II MIs were removed from the list (2748/3252, WGTP; 1392/ k EA). Notably, only 3.2% (153/4758) of all calls overlapped 2 or more Type II MIs. A p-value was then calculated for each CNV as described above. During this analysis, it became clear that Type II MIs were sometimes created when two hemizygous parents (called as a SNP homozygote) give rise to a homozygous null child (erroneously called a heterozygote). Our explanation of this phenomenon is that at high deletion frequencies (perhaps > 40%), SNP genotyping algorithms often cluster CNV status instead of SNP genotype status. To avoid such false-positives, we removed CNVs from regions with greater than 2 deletion" CNVs segregating in the same population and detected by the same platform. Other, similar, frequency thresholds were tested but did not substantially change the number of artifact calls (data not shown). We note that identifying somatic variants at loci with common germline variation is a challenging problem for any approach using only cell line DNA (see Caveats section). CNVs from the remaining set were deemed significant if they had a significant Bonferroni-corrected p-value, corrected against the number of CNVs left within the same platform-population group (e.g., CEU- WGTP CNVs). In total, we made 16 calls, with uncorrected p-values ranging from to 3.3 x (presented in Supplementary Table 5). In a separate analysis of offspring genotypes, one additional CNV was called as a cell-line artifact on the basis of the SNP failure pattern. Strictly speaking it is not possible to distinguish a de novo mutation from a somatic event in this case, without additional biological material from the donor (or perhaps DNA from such a person s offspring).

Power. The power to detect somatic events in the HapMap data is a function of the number of SNPs that are typed within the event and the allele frequencies at those SNPs.

6 Power. The power to detect somatic events in the HapMap data is a function of the number of SNPs that are typed within the event and the allele frequencies at those SNPs. One possible limitation to the power of our analysis is the relationship between SNP density and the location of CNVs. Analysis by our group and others suggest that SNP density is substantially lower near CNV regions. To assess the impact of SNP density we conducted a power simulation using the complete set of 4758 parental deletions. For each CNV locus, we simulated a somatic event in each parent of each of the other 29 families, and recorded the number of times the event would have been detected using the same thresholds of significance as our initial screen (YRI, p < 0.003; CEU, p < ). As expected, the mean power to detect WGTP somatic CNVs (YRI, 0.61; CEU, 0.48) was much greater than the power to detect somatic CNVs identified with the 500K EA (YRI 0.23, CEU, 0.15). Results are shown in LTA Figure 3. The WGTP numbers are likely to be over-inflated somewhat, as the size of each event detected on this platform will be over-estimated on average. Although power is low in many regions, the total number of artifacts called is not likely to be off by more than a factor of 5 based on this analysis. LTA Figure 3: Power to detect artifacts in consortium data. The relationship between CNV size and power to reject the hypothesis that the CNV is a germline event using Phase II HapMap data. Red points indicate CNVs detected with the WGTP; blue 500k EA.

7 HapMap Full-Genome Screen. There has been some speculation that the CEU cell lines, which were collected many years ago, may have undergone more somatic rearrangement than the cell lines collected for the other HapMap samples. The total rate of Type II MI in the Phase I HapMap CEU genotype data, 3.9 x 10-4 is slightly lower than the rate in YRI 5.0 x10-4. This is in contrast to the rate of Type I MIs, which should be created by both somatic and germline deletions, where the rate in YRI is twice the rate in CEU (1.1 x 10-3 and 5.3 x10-4, respectively). As a second approach to identifying cell-line deletions, we disregard the location of CNVs detected by the Consortium and simply sift through the Phase I HapMap data using a sliding window approach. The choice of which window size to use involves a trade off between resolution and power; after running power analysis with different window sizes, we settled on a 100-SNP window (LTA Figure 4). The median size of the 100-SNP window (264kb) falls in the middle of the range of CNV events detected by the consortium (mean over platforms 249kb, median 165kb). The median power is higher in YRI (0.61) than CEU (0.5). For each family, we split the genome into 100-SNP non-overlapping windows and calculate the probability that all Type II MIs within that window are due to random error. The resulting p-values from CEU and YRI were ranked separately and thresholds of significance were determined that control the false discovery rate (FDR) at 0.05 using the method of (Benjamini 1995). Based on this analysis, we retained 65 windows with p < from CEU and 34 windows with p < in YRI.

8 LTA Figure 4: Power of genome screen. Histograms of estimated power with each 100-SNP window within all CEU and YRI families, estimated using the Phase I HapMap data. The results of this analysis are presented in LTA Tables 2 and 3. (These results are from a different analysis than the results presented in Supplementary Table 5). In these tables, we highlight the correspondence between unusual regions identified in the genome-wide screen, preliminary CNV calls removed on the basis of the Phase II data, and genomic regions where no CNVs were detected by either platform. Although we average the expected rate of Type II MIs across the entire window, it is possible that a deletion substantially smaller than the window size can lead to a significant test. By defining the positions of the outermost Type II MIs in a window as the breakpoints of a putative deletion, the median length of CEU somatic events is 78.6kb, with YRI events slightly larger at 86kb; the median number of Type II MIs involved in an event was 3 in each population. Interestingly, 6/65 CEU windows overlap Immunoglobulin loci, while 8/65 overlap the del(2q23-q24) artifact detected in NA07055 (described above). Notably, 4/6 of the most significant somatic artifacts called in the preliminary CNV are detected in this screen. The two that are not detected are on chromosomes 15 and 19; the center responsible for typing the bulk of Phase I SNPs for these chromosomes

9 filtered Type II MIs from their data prior to release. Presumably, a next-generation genome screen using Phase II data will pick up these unusual features. Unless otherwise noted, these unusual regions from the genome-wide screen overlap loci of (often considerable) CNV polymorphism detected in the present study. Although we suspect that many of the significant windows in these regions may simply be an artifact of systematic genotyping error in the presence of high-frequency copy number variation (see Caveats section), there remains the possibility that some may represent true somatic events. Other methods will be required to unravel such complexity.

10 LTA Table 2: List of 34 regions of unusual Type II MI clusters in YRI Phase I HapMap data. Expected Type II MI rate is the expected rate of Type II MIs across the region, given a model of random genotype error and conditional on the platform and center used to type the SNPs. The number of Type I and Type II MIs are broken down for each cluster. Removed from prelim : corresponds to CNV removed from preliminary CNV calls before downstream analysis; no CNV call : no CNV was called on either platform in any population at this locus. Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA19132 NA19131 NA E NA19211 NA19209 NA E NA19173 NA19172 NA E NA19142 NA19140 NA E no CNV call NA18863 NA18861 NA E NA19142 NA19140 NA E NA19100 NA19099 NA E NA18863 NA18861 NA E E NA19205 NA19204 NA E E Removed from prelim NA19205 NA19204 NA E Removed from prelim NA19240 NA19238 NA E E no CNV call NA18860 NA18858 NA E NA19145 NA19143 NA E NA18863 NA18861 NA E NA19194 NA19193 NA E E no CNV call NA18872 NA18870 NA E

11 Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA19103 NA19102 NA E no CNV call NA19161 NA19159 NA E NA18863 NA18861 NA E NA18863 NA18861 NA E NA18863 NA18861 NA E NA18863 NA18861 NA E NA18854 NA18852 NA E NA19120 NA19116 NA E E no CNV call NA19100 NA19099 NA E E no CNV call NA19205 NA19204 NA E no CNV call NA18506 NA18508 NA E no CNV call NA19132 NA19131 NA E no CNV call NA19103 NA19102 NA E NA19221 NA19222 NA E no CNV call NA18857 NA18855 NA E no CNV call NA19173 NA19172 NA E E IgH NA19211 NA19209 NA E no CNV call NA19154 NA19152 NA E E IgL

12 LTA Table 3: List of 65 regions of unusual Type II MI clusters in CEU Phase I HapMap data. Expected Type II MI rate is the expected rate of Type II MIs across the region, given a model of random genotype error and conditional on the platform and center used to type the SNPs. The number of Type I and Type II MIs are broken down for each cluster. Removed from prelim : corresponds to CNV removed from preliminary CNV calls before downstream analysis; no CNV call : no CNV was called on either platform in any population at this locus. Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA12753 NA12763 NA E NA07029 NA07000 NA E no CNV call NA10839 NA12006 NA E NA12801 NA12813 NA E NA10857 NA12044 NA E NA12753 NA12763 NA E no CNV call NA12707 NA12717 NA E NA12707 NA12717 NA E no CNV call NA07348 NA07345 NA E NA06991 NA06985 NA E NA07029 NA07000 NA E no CNV call NA12752 NA12761 NA E no CNV call NA10839 NA12006 NA E NA10835 NA12249 NA E E NA10839 NA12006 NA E E NA07048 NA07055 NA E E q23.1-q24.3 deletion NA07048 NA07055 NA E E q23.1-q24.3 deletion NA07048 NA07055 NA E E q23.1-q24.3 deletion NA07048 NA07055 NA E E q23.1-q24.3 deletion

13 Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA07048 NA07055 NA E E q23.1-q24.3 deletion NA07048 NA07055 NA E E q23.1-q24.3 deletion NA07048 NA07055 NA E E q23.1-q24.3 deletion NA07048 NA07055 NA E E q23.1-q24.3 deletion NA12753 NA12763 NA E NA12865 NA12875 NA E E no CNV call NA07348 NA07345 NA E Removed from prelim NA07348 NA07345 NA E Removed from prelim NA12864 NA12873 NA E no CNV call NA12865 NA12875 NA E NA12801 NA12813 NA no CNV call NA07019 NA07056 NA no CNV call NA10846 NA12145 NA E NA12740 NA12751 NA E NA12802 NA12815 NA E NA07048 NA07055 NA E Removed from prelim NA10854 NA11840 NA E NA10860 NA11993 NA E no CNV call NA10839 NA12006 NA E NA12740 NA12751 NA E E NA12753 NA12763 NA E E no CNV call NA10851 NA12057 NA E E no CNV call NA12752 NA12761 NA E E no CNV call NA07019 NA07056 NA E E no CNV call NA07348 NA07345 NA E E Removed from prelim NA12752 NA12761 NA E E no CNV call NA12740 NA12751 NA E E no CNV call

14 Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA12753 NA12763 NA E no CNV call NA10847 NA12239 NA E NA12801 NA12813 NA E no CNV call NA10857 NA12044 NA E no CNV call NA10846 NA12145 NA E no CNV call NA10846 NA12145 NA no CNV call NA12752 NA12761 NA E no CNV call NA10839 NA12006 NA E NA10830 NA12236 NA E E IgH NA10860 NA11993 NA E NA10855 NA11832 NA E E no CNV call NA12864 NA12873 NA E NA12878 NA12892 NA E no CNV call NA12864 NA12873 NA E E IgL NA07029 NA07000 NA E IgL NA12864 NA12873 NA E IgL NA10860 NA11993 NA E IgL NA12878 NA12892 NA IgL NA12753 NA12763 NA E

15 Proportion of Type II MIs due to somatic deletion. Based on these results, it appears that only a small fraction of any cell-line genome has undergone somatic rearrangement. Although estimating extremely small proportions can be difficult, the abundance of SNP data gave us the confidence to attempt to estimate what fraction of the genome has experienced rearrangement in the typical HapMap individual. We formulate the problem as a mixture model. In this case, the total number of Type II MIs in each population is due to a mixture of 2 processes, somatic rearrangement and genotyping error. If we have good estimates of the rates at which Type II MIs occur under each of these processes (θ 1 and θ 2 ), we may be able to estimate the extent to which each process contributes to our total data. Our data is the count of Type II MIs within each of N 100-SNP windows across all 30 families, tabulated separately for the CEU and YRI populations using the Phase I data. There were 315, SNP windows in CEU, 306,912 in YRI. The number of Type II MIs within each 100- SNP window is modeled as a binomial random variable k. The likelihood function for the mixture parameter π is The Type II MI rate due to genotyping error, θ 2, and the Type II MI rate due to somatic deletions, θ 1, are estimated as follows. θ 2 is an 11-element vector, created by tabulating the frequency of Type II MI for each combination of center/platform. To estimate θ 1, we simulate cell line deletions spanning each 100-SNP window and record the proportion of Type II MIs, averaging across all windows and all families. The object of our inference is the true value of π, the proportion of 100-SNP windows overlapping regions of somatic rearrangement. After a first-pass analysis over the entire range of π, the likelihood function was evaluated over a grid of 100 equally spaced points from 0.9 to 1.0; the results are shown in LTA Figure 5. It is interesting to note that despite a lower Type II error rate in CEU, the results of this analysis suggest that a greater percentage of CEU genomes are contained in artifacts when compared to YRI genomes.

16 LTA Figure 5: Mixture model analysis. Plot of the log L(π) evaluated over a grid of values; the maximum likelihood estimate for π is indicated with a vertical line for CEU ( , red) and YRI ( , blue). The scale of the y-axis shows the value of the likelihood function evaluated for CEU data; YRI values shown are L+10,000 for display purposes. Caveats. The results of this work and previous work suggest that large somatic deletions in the trio parents of the HapMap data should be easy to detect using patterns of SNP failures, when the deletion occurs in region of the genome that harbors little or no population copy number variation. It will be a much more difficult problem to detect cell line artifacts at regions where there is already substantial germline copy number variation. Unfortunately, such regions may be the most inclined to experience somatic rearrangement if CNV frequency is related to the underlying mutation rate (Lam and Jeffreys 2006). At high deletion frequencies (perhaps > 40%), SNP genotyping algorithms sometimes cluster CNV status instead of SNP genotype status. During this analysis, it became clear that Type II MIs were sometimes created when two hemizygous parents (called as SNP homozygotes) give rise to a homozygous null child (erroneously called a SNP heterozygote). As mentioned in the introduction, there are other phenomena that could create Type II MIs that aren't captured in our simple model. A homozygous tract of > 100Mb on 1q

17 on a CEU individual was first thought to be a large cell line deletion by (Conrad et al. 2006), but subsequent analysis revealed it to be a case of uniparental isodisomy. Such events could occur by mitotic recombination, and would produce a pattern of SNP failures identical to a somatic deletion. In cases where large somatic deletions are predicted on the basis of SNP data but are not detected in the underlying intensity data, UPD is one possible explanation. Another possible explanation for discordant results between the SNP-based method and the array-based methods used in the current paper is the use of different lots of cells, at different points of time. Cell lines that are mosaic for artifacts should also behave unpredictably when typed on various platforms at various points of time. Finally, several sample mix-ups were identified from unusual clusters of MIs in an earlier analysis of the Phase I data (Conrad et al. 2006). Unresolved sample mix-ups could continue to contribute to the signal we are detecting here. References Benjamini, Y., and Hochberg, Yosef Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Journal of the Royal Statistical Society 57: Conrad, D.F., T.D. Andrews, N.P. Carter, M.E. Hurles, and J.K. Pritchard A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 38: International HapMap Consortium A haplotype map of the human genome. Nature 437: Lam, K.W. and A.J. Jeffreys Processes of copy-number change in human DNA: the dynamics of {alpha}-globin gene deletion. Proc Natl Acad Sci U S A 103: McCarrol, S.A., T.N. Hadnott, G.H. Perry, P.C. Sabeti, M.C. Zody, J.C. Barrett, S. Dallaire, S.B. Gabriel, C. Lee, M.J. Daly, and D.M. Altshuler Common deletion polymorphisms in the human genome. Nat Genet 38:

Global variation in copy number in the human genome

Global variation in copy number in the human genome Redon et. al. Nature 444:444-454 (2006) 12.03.2007 Tarmo Puurand Study 270 individuals (HapMap collection) Affymetrix 500K Whole Genome TilePath (WGTP)