LTA Analysis of HapMap Genotype Data

Similar documents
Global variation in copy number in the human genome

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

GENETIC LINKAGE ANALYSIS

CHROMOSOMAL MICROARRAY (CGH+SNP)

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Tutorial on Genome-Wide Association Studies

Nature Genetics: doi: /ng Supplementary Figure 1

Integrated detection and population-genetic analysis. of SNPs and copy number variation

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

Genomic structural variation

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

Identification of regions with common copy-number variations using SNP array

Introduction to the Genetics of Complex Disease

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

Lecture 17: Human Genetics. I. Types of Genetic Disorders. A. Single gene disorders

Pedigree Analysis Why do Pedigrees? Goals of Pedigree Analysis Basic Symbols More Symbols Y-Linked Inheritance

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

Single Gene (Monogenic) Disorders. Mendelian Inheritance: Definitions. Mendelian Inheritance: Definitions

Genetics Review. Alleles. The Punnett Square. Genotype and Phenotype. Codominance. Incomplete Dominance

(b) What is the allele frequency of the b allele in the new merged population on the island?

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Dan Koller, Ph.D. Medical and Molecular Genetics

Copy Number Variations and Association Mapping Advanced Topics in Computa8onal Genomics

ADVANCED PGT SERVICES

Computational Systems Biology: Biology X

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing

Nature Biotechnology: doi: /nbt.1904

Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases

A. Incorrect! Cells contain the units of genetic they are not the unit of heredity.

Interactive analysis and quality assessment of single-cell copy-number variations

Integrated detection and population-genetic analysis of SNPs and copy number variation

Integrated detection and population-genetic analysis of SNPs and copy number variation

Structural Variation and Medical Genomics

Agilent s Copy Number Variation (CNV) Portfolio

0.1% variance attributed to scattered single base-pair changes SNPs

Problem 3: Simulated Rheumatoid Arthritis Data

GENOME-WIDE ASSOCIATION STUDIES

Global variation in copy number in the human genome

Understanding DNA Copy Number Data

Pedigree Construction Notes

Unit 5 Review Name: Period:

BIOLOGY - CLUTCH CH.15 - CHROMOSOMAL THEORY OF INHERITANCE

Complex Traits Activity INSTRUCTION MANUAL. ANT 2110 Introduction to Physical Anthropology Professor Julie J. Lesnik

CNV detection. Introduction and detection in NGS data. G. Demidov 1,2. NGSchool2016. Centre for Genomic Regulation. CNV detection. G.

GENETICS - NOTES-

Analysis of CGH and SNP arrays for the detection of chromosomal aberrations in single cells

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays

Exam #2 BSC Fall. NAME_Key correct answers in BOLD FORM A

The vagaries of non-traditional mendelian recessive inheritance in uniparental disomy: AA x Aa = aa!

On Missing Data and Genotyping Errors in Association Studies

Title. general populations. Author(s) ichiro. Citation Gene, 512(2), pp ; Issue Date

Introduction to Genetics and Genomics

DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007

Non-Mendelian inheritance

Chapter 15: The Chromosomal Basis of Inheritance

Name: PS#: Biol 3301 Midterm 1 Spring 2012

Patterns of Single-Gene Inheritance Cont.

Associating Copy Number and SNP Variation with Human Disease. Autism Segmental duplication Neurobehavioral, includes social disability

Supplementary Figures

Supplementary information. Supplementary figure 1. Flow chart of study design

A gene is a sequence of DNA that resides at a particular site on a chromosome the locus (plural loci). Genetic linkage of genes on a single

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022

CHAPTER 10 BLOOD GROUPS: ABO AND Rh

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Genetic Assessment and Counseling

Statistical Evaluation of Sibling Relationship

Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer

Genetics and Genomics in Medicine Chapter 8 Questions

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Chapter 11 Patterns of Chromosomal Inheritance

SUPPLEMENTARY INFORMATION

Unit 8.1: Human Chromosomes and Genes

Lab 5: Testing Hypotheses about Patterns of Inheritance

Statistical power and significance testing in large-scale genetic studies

Imputation of Missing Genotypes from Sparse to High Density using Long-Range Phasing

Supplementary Information. Supplementary Figures

Structural Variants and Susceptibility to Common Human Disorders Dr. Xavier Estivill

Basic Definitions. Dr. Mohammed Hussein Assi MBChB MSc DCH (UK) MRCPCH

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Multiple Copy Number Variations in a Patient with Developmental Delay ASCLS- March 31, 2016

Multimarker Genetic Analysis Methods for High Throughput Array Data

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

SUPPLEMENTARY INFORMATION

Chapter 02 Mendelian Inheritance

Modeling genetic inheritance of copy number variations

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin,

Human population sub-structure and genetic association studies

Laws of Inheritance. Bởi: OpenStaxCollege

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

Introduction to LOH and Allele Specific Copy Number User Forum

Supplementary Information. Data Identifies FAN1 at 15q13.3 as a Susceptibility. Gene for Schizophrenia and Autism

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY.

Research Strategy: 1. Background and Significance

Transcription:

LTA Analysis of HapMap Genotype Data Introduction. This supplement to Global variation in copy number in the human genome, by Redon et al., describes the details of the LTA analysis used to screen HapMap CNV calls for somatic artifacts on the basis of the HapMap genotype data (as presented in Supplementary Table 5). Additional analyses of the HapMap data based on the LTA approach are also described in this supplement. It is generally accepted that an analysis of high-quality SNP genotype data from related individuals can yield information about the location of segregating germline deletions. Recently, methods have been developed that exploit unusual features of such data (i.e., Mendelian inheritance errors (MIs), departures from Hardy-Weinberg equilibrium, patterns of uncalled genotypes) and successfully applied to the high-density genomic SNP data generated by the HapMap project to detect experimentally validated deletions (Conrad et al. 2006; McCarrol et al. 2006). In principle, it should be possible to follow a similar tactic to identify deletions that occur in the soma, for example in cell lines generated from somatic tissue. We set out to explore the feasibility of such an approach, develop a method (if possible), and apply it to the Consortium CNV data to detect potential cell-line artifacts. The data used for these analyses are the Consortium data (both 500K EA and WGTP calls) and the Phase I HapMap data (release 20, http://www.hapmap.org) for all autosomes (International HapMap Consortium 2005). We make limited use of the Phase II data when trying to establish which CNV calls made in the present study are likely to be somatic. Filtering of the HapMap data was done as described in (Conrad et al. 2006). Outline of the LTA Method. The pattern that we wish to exploit will come from a deletion occurring in the cell line of a previously copy-number normal trio member, at an autosomal locus where the other two trio members have normal (n=2) copy number. Specifically, we are trying to find clusters of SNPs at which an allele transmitted from parent-to-child has subsequently been deleted in the parent. We call this approach Loss

of Transmitted Allele" analysis or LTA for short. This method will not work if the cell line artifact is a duplication, or if the deletion occurs within a person who was only haploid for the segment to begin with. The method is based on the prediction that a SNP genotyping experiment will call a deletion hemizygote as homozygous for the allele that is present. When an allele that is present only once in all four parental gametes is deleted after transmission to a child, that information is erased and it appears that the child has inherited a de novo mutation. SNPs typed across a cell line deletion would hence be enriched for Mendelian inheritance (MI) errors (e.g., LTA Figure 1.) LTA Figure 1: Example of trio genotype configurations used in this analysis. Proof of Principle. We extended the model described in (Conrad et al. 2006) to include genotype configurations that were informative of a somatic deletion, leading to 8 general trio configurations. To gain a better understanding of the patterns of trio genotypes generated by a somatic deletion, we scored the Phase I HapMap SNP genotypes underlying a known cell-line deletion in NA07055 (del(2q23-q24), greater than 15Mb long), which are presented in Table 1. The run of Type II MIs in the Phase I data caused by this artifact is apparent upon visual inspection (LTA Figure 2).

Trio Class Description Count 1 Type I MI, deletion in mother (germline or somatic) 54 2 Type I MI, deletion in father (germline or somatic) 0 3 Type I MI, deletion in either parent (somatic only) 28 4 Uninformative 3601 5 Compatible with somatic deletion in mother 1351 6 Compatible with somatic deletion in father 546 7 Incompatible with somatic deletion in either parent 312 8 Compatible with somatice deletion in either parent 532 LTA Table 1: Distribution of trio genotype configurations within known cell line artifact LTA Figure 2: Pattern of Type II MIs created by somatic deletion. The physical position of each Type II MI in the Phase I HapMap data from chromosome 2 is plotted against physical position (NCBI34) for all 30 CEU families. The run of MIs caused by a somatic event starting near 150Mb is quite apparent. Data are from release 16c.1. Several important points stand out. First, there is a strong, specific signal in the pattern of MIs from the family of this individual. Individual NA07055 is the mother of the trio. All MIs are either Type II or Type I but from the mother only. The rate of MIs in this stretch is 0.01, a rate twenty-five times above the genome average. Second, within the class of somatic deletion compatible trio configurations, configurations specific to maternal deletions outnumber paternal configurations by an excess of almost 3:1. However, this

indicates that there is some noise in the data and our modeling assumptions do breakdown at points. This phenomenon has been noted in prior work, and can, for instance, lead to the erroneous splitting of a single deletion-compatible stretch into multiple smaller stretches. Screening HapMap data for artifacts. Two major questions can be addressed with this modeling approach. First, how many CNVs called in the present study are not true germline deletions but artifacts of the cell culture process? Second, we may want to ask how many cell line events have occurred within the HapMap as a whole. 500K EA and WGTP HapMap data. Consortium Data. Screening the HapMap CNV calls essentially amounts to a problem of multiple testing, where for each CNV we will test the null hypothesis, This is a germline CNV" based on the trio genotypes at the underlying N HapMap SNP genotypes. We assume that number of Type II MIs in a set of genotypes from a trio is a binomial random variable, and that the genotyping experiment on a given SNP is independent of experiments at all other SNPs (an assumption not likely to hold for some Perlegen SNPs and instances of pooling). We estimate θ c, the rate of Type II MIs, across the entire HapMap, for each combination of genotyping platform and genotyping center, a total of 11 rates. If we see x Type II MIs underlying a called CNV, we calculate the probability that the run of MIs is due to random genotyping error as the probability of observing x or greater Type II MIs in a stretch of N trio genotypes, θ here is taken to be

the arithmetic mean of the expected Type II rate for all SNPs within the CNV. 500K EA and WGTP CNVs are assessed separately, as are CNVs in each (CEU, YRI) population. Results. The data for this analysis was all preliminary deletion calls in the CEU and YRI parents (3252 from the WGTP, 1506 from 500K EA). The method for deciding significant regions was a bit convoluted, but with the ultimate goal of creating a set of conservative (leaning towards over-calling) somatic artifact CNVs. First, all CNV calls with no Type II MIs were removed from the list (2748/3252, WGTP; 1392/1506 500k EA). Notably, only 3.2% (153/4758) of all calls overlapped 2 or more Type II MIs. A p-value was then calculated for each CNV as described above. During this analysis, it became clear that Type II MIs were sometimes created when two hemizygous parents (called as a SNP homozygote) give rise to a homozygous null child (erroneously called a heterozygote). Our explanation of this phenomenon is that at high deletion frequencies (perhaps > 40%), SNP genotyping algorithms often cluster CNV status instead of SNP genotype status. To avoid such false-positives, we removed CNVs from regions with greater than 2 deletion" CNVs segregating in the same population and detected by the same platform. Other, similar, frequency thresholds were tested but did not substantially change the number of artifact calls (data not shown). We note that identifying somatic variants at loci with common germline variation is a challenging problem for any approach using only cell line DNA (see Caveats section). CNVs from the remaining set were deemed significant if they had a significant Bonferroni-corrected p-value, corrected against the number of CNVs left within the same platform-population group (e.g., CEU- WGTP CNVs). In total, we made 16 calls, with uncorrected p-values ranging from 0.002 to 3.3 x 10-189 (presented in Supplementary Table 5). In a separate analysis of offspring genotypes, one additional CNV was called as a cell-line artifact on the basis of the SNP failure pattern. Strictly speaking it is not possible to distinguish a de novo mutation from a somatic event in this case, without additional biological material from the donor (or perhaps DNA from such a person s offspring).

Power. The power to detect somatic events in the HapMap data is a function of the number of SNPs that are typed within the event and the allele frequencies at those SNPs. One possible limitation to the power of our analysis is the relationship between SNP density and the location of CNVs. Analysis by our group and others suggest that SNP density is substantially lower near CNV regions. To assess the impact of SNP density we conducted a power simulation using the complete set of 4758 parental deletions. For each CNV locus, we simulated a somatic event in each parent of each of the other 29 families, and recorded the number of times the event would have been detected using the same thresholds of significance as our initial screen (YRI, p < 0.003; CEU, p < 0.0021). As expected, the mean power to detect WGTP somatic CNVs (YRI, 0.61; CEU, 0.48) was much greater than the power to detect somatic CNVs identified with the 500K EA (YRI 0.23, CEU, 0.15). Results are shown in LTA Figure 3. The WGTP numbers are likely to be over-inflated somewhat, as the size of each event detected on this platform will be over-estimated on average. Although power is low in many regions, the total number of artifacts called is not likely to be off by more than a factor of 5 based on this analysis. LTA Figure 3: Power to detect artifacts in consortium data. The relationship between CNV size and power to reject the hypothesis that the CNV is a germline event using Phase II HapMap data. Red points indicate CNVs detected with the WGTP; blue 500k EA.

HapMap Full-Genome Screen. There has been some speculation that the CEU cell lines, which were collected many years ago, may have undergone more somatic rearrangement than the cell lines collected for the other HapMap samples. The total rate of Type II MI in the Phase I HapMap CEU genotype data, 3.9 x 10-4 is slightly lower than the rate in YRI 5.0 x10-4. This is in contrast to the rate of Type I MIs, which should be created by both somatic and germline deletions, where the rate in YRI is twice the rate in CEU (1.1 x 10-3 and 5.3 x10-4, respectively). As a second approach to identifying cell-line deletions, we disregard the location of CNVs detected by the Consortium and simply sift through the Phase I HapMap data using a sliding window approach. The choice of which window size to use involves a trade off between resolution and power; after running power analysis with different window sizes, we settled on a 100-SNP window (LTA Figure 4). The median size of the 100-SNP window (264kb) falls in the middle of the range of CNV events detected by the consortium (mean over platforms 249kb, median 165kb). The median power is higher in YRI (0.61) than CEU (0.5). For each family, we split the genome into 100-SNP non-overlapping windows and calculate the probability that all Type II MIs within that window are due to random error. The resulting p-values from CEU and YRI were ranked separately and thresholds of significance were determined that control the false discovery rate (FDR) at 0.05 using the method of (Benjamini 1995). Based on this analysis, we retained 65 windows with p < 0.000216 from CEU and 34 windows with p < 0.000117 in YRI.

LTA Figure 4: Power of genome screen. Histograms of estimated power with each 100-SNP window within all CEU and YRI families, estimated using the Phase I HapMap data. The results of this analysis are presented in LTA Tables 2 and 3. (These results are from a different analysis than the results presented in Supplementary Table 5). In these tables, we highlight the correspondence between unusual regions identified in the genome-wide screen, preliminary CNV calls removed on the basis of the Phase II data, and genomic regions where no CNVs were detected by either platform. Although we average the expected rate of Type II MIs across the entire window, it is possible that a deletion substantially smaller than the window size can lead to a significant test. By defining the positions of the outermost Type II MIs in a window as the breakpoints of a putative deletion, the median length of CEU somatic events is 78.6kb, with YRI events slightly larger at 86kb; the median number of Type II MIs involved in an event was 3 in each population. Interestingly, 6/65 CEU windows overlap Immunoglobulin loci, while 8/65 overlap the del(2q23-q24) artifact detected in NA07055 (described above). Notably, 4/6 of the most significant somatic artifacts called in the preliminary CNV are detected in this screen. The two that are not detected are on chromosomes 15 and 19; the center responsible for typing the bulk of Phase I SNPs for these chromosomes

filtered Type II MIs from their data prior to release. Presumably, a next-generation genome screen using Phase II data will pick up these unusual features. Unless otherwise noted, these unusual regions from the genome-wide screen overlap loci of (often considerable) CNV polymorphism detected in the present study. Although we suspect that many of the significant windows in these regions may simply be an artifact of systematic genotyping error in the presence of high-frequency copy number variation (see Caveats section), there remains the possibility that some may represent true somatic events. Other methods will be required to unravel such complexity.

LTA Table 2: List of 34 regions of unusual Type II MI clusters in YRI Phase I HapMap data. Expected Type II MI rate is the expected rate of Type II MIs across the region, given a model of random genotype error and conditional on the platform and center used to type the SNPs. The number of Type I and Type II MIs are broken down for each cluster. Removed from prelim : corresponds to CNV removed from preliminary CNV calls before downstream analysis; no CNV call : no CNV was called on either platform in any population at this locus. Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA19132 NA19131 NA19130 1 2271746 2506638 2.56E-05 0.00165008 0 0 4 NA19211 NA19209 NA19210 1 16684633 16762981 2.88E-05 0.00170027 0 1 4 NA19173 NA19172 NA19171 1 72479773 72500711 2.13E-05 0.0015729 0 0 4 NA19142 NA19140 NA19141 1 111091292 111099706 1.96E-05 0.00153944 0 0 4 no CNV call NA18863 NA18861 NA18862 1 149574444 149579554 2.23E-05 0.00159247 0 0 4 NA19142 NA19140 NA19141 1 193469848 193567155 2.36E-05 0.00161567 1 0 4 NA19100 NA19099 NA19098 1 193502780 193706732 2.36E-05 0.00161567 0 0 4 NA18863 NA18861 NA18862 2 56196069 56243026 3.53E-05 8.47E-05 0 0 2 NA19205 NA19204 NA19203 2 152607691 152613692 1.20E-05 4.94E-05 0 1 2 Removed from prelim NA19205 NA19204 NA19203 2 153234852 153408561 8.44E-13 0.00010248 0 3 5 Removed from prelim NA19240 NA19238 NA19239 2 195960304 196013260 2.97E-05 7.77E-05 0 0 2 no CNV call NA18860 NA18858 NA18859 3 164007292 164054574 1.27E-06 0.000199612 0 0 3 NA19145 NA19143 NA19144 4 133123383 133211475 1.56E-05 0.00046427 0 0 3 NA18863 NA18861 NA18862 4 145112687 145362154 1.56E-05 0.00046374 0 0 3 NA19194 NA19193 NA19192 5 170969171 170969444 1.97E-05 6.32E-05 0 0 2 no CNV call NA18872 NA18870 NA18871 6 26760120 27003543 2.79E-05 0.00168717 0 1 4

Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA19103 NA19102 NA19101 6 34142419 34321274 2.63E-05 0.00166081 0 1 4 no CNV call NA19161 NA19159 NA19160 7 75523985 75859846 1.94E-05 0.00049894 0 0 3 NA18863 NA18861 NA18862 7 75970283 76218945 2.02E-07 0.000481 0 0 4 NA18863 NA18861 NA18862 7 89470259 89488957 1.64E-05 0.000471377 0 0 3 NA18863 NA18861 NA18862 7 126180421 126195740 1.77E-05 0.0004839 0 0 3 NA18863 NA18861 NA18862 7 126439398 126443830 1.79E-05 0.0004859 0 0 3 NA18854 NA18852 NA18853 8 12356911 12474620 1.31E-06 0.000201681 0 0 3 NA19120 NA19116 NA19119 8 71430183 71470666 1.30E-05 5.13E-05 0 0 2 no CNV call NA19100 NA19099 NA19098 8 124948189 125032627 1.69E-05 5.85E-05 0 0 2 no CNV call NA19205 NA19204 NA19203 8 143991808 143993541 9.80E-05 0.00014138 0 0 2 no CNV call NA18506 NA18508 NA18507 10 22838757 22973453 1.72E-05 0.00148909 0 0 4 no CNV call NA19132 NA19131 NA19130 12 6915708 7133345 1.21E-05 0.00135925 0 0 4 no CNV call NA19103 NA19102 NA19101 12 36292899 36540509 1.97E-07 0.00123551 0 0 5 NA19221 NA19222 NA19223 13 18893347 18989263 2.78E-05 0.0016848 0 0 4 no CNV call NA18857 NA18855 NA18856 13 113839930 113891307 2.99E-05 0.001717 1 0 4 no CNV call NA19173 NA19172 NA19171 14 105392635 105831373 2.88E-05 7.64E-05 1 0 2 IgH NA19211 NA19209 NA19210 17 39609781 39610053 9.46E-05 0.000138842 0 0 2 no CNV call NA19154 NA19152 NA19153 22 21348800 21513685 3.10E-08 5.77E-05 1 0 3 IgL

LTA Table 3: List of 65 regions of unusual Type II MI clusters in CEU Phase I HapMap data. Expected Type II MI rate is the expected rate of Type II MIs across the region, given a model of random genotype error and conditional on the platform and center used to type the SNPs. The number of Type I and Type II MIs are broken down for each cluster. Removed from prelim : corresponds to CNV removed from preliminary CNV calls before downstream analysis; no CNV call : no CNV was called on either platform in any population at this locus. Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA12753 NA12763 NA12762 1 76166 807391 1.27E-05 0.001377 0 0 4 NA07029 NA07000 NA06994 1 4529766 4714715 7.33E-06 0.001196 0 0 4 no CNV call NA10839 NA12006 NA12005 1 16669895 16981451 1.24E-05 0.001369 0 0 4 NA12801 NA12813 NA12812 1 16689923 16981451 1.24E-05 0.001369 1 0 4 NA10857 NA12044 NA12043 1 116194079 116258701 9.53E-06 0.00128 0 0 4 NA12753 NA12763 NA12762 1 144367112 144507262 8.52E-06 0.001243 0 0 4 no CNV call NA12707 NA12717 NA12716 1 145895756 145974405 3.65E-07 0.001402 0 0 5 NA12707 NA12717 NA12716 1 147491983 147677237 1.17E-10 0.001349 0 0 7 no CNV call NA07348 NA07345 NA07357 1 149164560 149399723 3.73E-09 0.00123 1 0 6 NA06991 NA06985 NA06993 1 149374481 149399723 1.92E-07 0.00123 1 0 5 NA07029 NA07000 NA06994 1 160093751 160147513 8.82E-06 0.001254 1 0 4 no CNV call NA12752 NA12761 NA12760 1 165681682 165859249 5.93E-06 0.001133 0 0 4 no CNV call NA10839 NA12006 NA12005 1 235089125 235157125 7.04E-06 0.001184 0 0 4 NA10835 NA12249 NA12248 2 74372340 74601694 2.07E-05 6.49E-05 0 0 2 NA10839 NA12006 NA12005 2 97628640 97716175 1.20E-05 4.94E-05 0 0 2 NA07048 NA07055 NA07034 2 151153215 151159969 1.41E-08 4.44E-05 1 0 3 2q23.1-q24.3 deletion NA07048 NA07055 NA07034 2 151699069 151781189 1.29E-10 7.58E-05 0 0 4 2q23.1-q24.3 deletion NA07048 NA07055 NA07034 2 151979433 151997267 9.92E-06 4.48E-05 0 0 2 2q23.1-q24.3 deletion NA07048 NA07055 NA07034 2 155308445 155428988 3.68E-08 6.11E-05 3 0 3 2q23.1-q24.3 deletion

Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA07048 NA07055 NA07034 2 158117047 158188594 1.28E-08 4.29E-05 0 0 3 2q23.1-q24.3 deletion NA07048 NA07055 NA07034 2 160806539 160812068 3.16E-05 8.02E-05 2 0 2 2q23.1-q24.3 deletion NA07048 NA07055 NA07034 2 161739675 161873088 2.54E-08 5.40E-05 0 0 3 2q23.1-q24.3 deletion NA07048 NA07055 NA07034 2 165192099 165242741 9.23E-06 4.33E-05 3 0 2 2q23.1-q24.3 deletion NA12753 NA12763 NA12762 3 131342780 131407107 3.02E-06 0.000267 0 1 3 NA12865 NA12875 NA12874 4 46936641 46990113 1.57E-05 5.64E-05 0 0 2 no CNV call NA07348 NA07345 NA07357 4 190853064 190907148 9.68E-06 0.000395 1 0 3 Removed from prelim NA07348 NA07345 NA07357 4 191032772 191046861 1.00E-05 0.0004 0 0 3 Removed from prelim NA12864 NA12873 NA12872 6 1381797 1565626 9.96E-06 0.001294 0 0 4 no CNV call NA12865 NA12875 NA12874 6 29963924 29993593 9.07E-06 0.001264 0 0 4 NA12801 NA12813 NA12812 6 52310351 52474721 0.000183 0.00107 0 0 3 no CNV call NA07019 NA07056 NA07022 6 52355078 52508784 0.000183 0.00107 0 0 3 no CNV call NA10846 NA12145 NA12144 6 79024288 79089088 1.27E-22 0.001258 0 1 13 NA12740 NA12751 NA12750 6 103845442 103868154 4.68E-11 0.001182 0 0 7 NA12802 NA12815 NA12814 6 105599420 105698258 1.05E-05 0.001311 0 0 4 NA07048 NA07055 NA07034 7 38091904 38124908 4.66E-06 0.000309 0 3 3 Removed from prelim NA10854 NA11840 NA11839 7 65967542 66338547 7.91E-08 0.00038 0 0 4 NA10860 NA11993 NA11992 8 26818151 26878210 1.87E-06 0.000227 0 0 3 no CNV call NA10839 NA12006 NA12005 8 39393310 39451477 7.02E-11 0.000249 0 0 5 NA12740 NA12751 NA12750 8 50768163 50768208 4.18E-05 9.21E-05 0 0 2 NA12753 NA12763 NA12762 8 75852017 75916088 2.79E-05 7.53E-05 0 1 2 no CNV call NA10851 NA12057 NA12056 9 428286 439798 2.70E-05 7.41E-05 0 0 2 no CNV call NA12752 NA12761 NA12760 9 1523064 1569834 3.54E-05 8.48E-05 0 0 2 no CNV call NA07019 NA07056 NA07022 9 3385848 3434392 2.53E-05 7.17E-05 0 0 2 no CNV call NA07348 NA07345 NA07357 9 7364808 7451862 3.15E-05 8.00E-05 0 1 2 Removed from prelim NA12752 NA12761 NA12760 9 22237395 22261287 2.62E-05 7.29E-05 0 0 2 no CNV call NA12740 NA12751 NA12750 9 118450398 118454125 2.88E-05 7.65E-05 0 0 2 no CNV call

Expected Type I Type I Type II MI MI MI Type Child ID Mother ID Father ID Chr Start Stop P-value Rate mother father II MI Comments NA12753 NA12763 NA12762 10 13590878 13700863 8.94E-06 0.001259 0 0 4 no CNV call NA10847 NA12239 NA12146 10 88877798 89007800 3.63E-07 0.001401 0 1 5 NA12801 NA12813 NA12812 12 39018021 39030055 2.21E-05 0.000522 0 0 3 no CNV call NA10857 NA12044 NA12043 12 42909711 42988771 8.55E-05 0.000825 0 0 3 no CNV call NA10846 NA12145 NA12144 12 87370554 87400866 6.24E-05 0.000741 0 0 3 no CNV call NA10846 NA12145 NA12144 12 121220732 121337383 0.000102 0.000876 0 1 3 no CNV call NA12752 NA12761 NA12760 13 28868010 29052743 8.43E-06 0.00124 0 0 4 no CNV call NA10839 NA12006 NA12005 13 113253980 113509398 1.12E-05 0.001334 0 0 4 NA10830 NA12236 NA12154 14 105377886 105382838 3.79E-05 8.78E-05 1 1 2 IgH NA10860 NA11993 NA11992 16 20162598 20460369 0.000189 1.90E-06 0 0 1 NA10855 NA11832 NA11831 19 50103781 50252903 4.09E-05 9.11E-05 0 0 2 no CNV call NA12864 NA12873 NA12872 20 44373357 44520617 9.22E-06 0.001269 0 1 4 NA12878 NA12892 NA12891 20 54495971 54584835 9.68E-06 0.001285 0 0 4 no CNV call NA12864 NA12873 NA12872 22 20799464 20998193 1.59E-24 9.81E-05 6 0 9 IgL NA07029 NA07000 NA06994 22 21035437 21042363 1.08E-06 0.000189 0 0 3 IgL NA12864 NA12873 NA12872 22 21036340 21199051 4.98E-09 0.000189 9 0 4 IgL NA10860 NA11993 NA11992 22 21055701 21087205 1.08E-06 0.000189 5 3 3 IgL NA12878 NA12892 NA12891 22 21067339 21194102 0.000175 0.000189 1 0 2 IgL NA12753 NA12763 NA12762 22 47462572 47466791 5.22E-05 0.000103 0 0 2

Proportion of Type II MIs due to somatic deletion. Based on these results, it appears that only a small fraction of any cell-line genome has undergone somatic rearrangement. Although estimating extremely small proportions can be difficult, the abundance of SNP data gave us the confidence to attempt to estimate what fraction of the genome has experienced rearrangement in the typical HapMap individual. We formulate the problem as a mixture model. In this case, the total number of Type II MIs in each population is due to a mixture of 2 processes, somatic rearrangement and genotyping error. If we have good estimates of the rates at which Type II MIs occur under each of these processes (θ 1 and θ 2 ), we may be able to estimate the extent to which each process contributes to our total data. Our data is the count of Type II MIs within each of N 100-SNP windows across all 30 families, tabulated separately for the CEU and YRI populations using the Phase I data. There were 315,420 100-SNP windows in CEU, 306,912 in YRI. The number of Type II MIs within each 100- SNP window is modeled as a binomial random variable k. The likelihood function for the mixture parameter π is The Type II MI rate due to genotyping error, θ 2, and the Type II MI rate due to somatic deletions, θ 1, are estimated as follows. θ 2 is an 11-element vector, created by tabulating the frequency of Type II MI for each combination of center/platform. To estimate θ 1, we simulate cell line deletions spanning each 100-SNP window and record the proportion of Type II MIs, averaging across all windows and all families. The object of our inference is the true value of π, the proportion of 100-SNP windows overlapping regions of somatic rearrangement. After a first-pass analysis over the entire range of π, the likelihood function was evaluated over a grid of 100 equally spaced points from 0.9 to 1.0; the results are shown in LTA Figure 5. It is interesting to note that despite a lower Type II error rate in CEU, the results of this analysis suggest that a greater percentage of CEU genomes are contained in artifacts when compared to YRI genomes.

LTA Figure 5: Mixture model analysis. Plot of the log L(π) evaluated over a grid of values; the maximum likelihood estimate for π is indicated with a vertical line for CEU (0.00033, red) and YRI (0.00015, blue). The scale of the y-axis shows the value of the likelihood function evaluated for CEU data; YRI values shown are L+10,000 for display purposes. Caveats. The results of this work and previous work suggest that large somatic deletions in the trio parents of the HapMap data should be easy to detect using patterns of SNP failures, when the deletion occurs in region of the genome that harbors little or no population copy number variation. It will be a much more difficult problem to detect cell line artifacts at regions where there is already substantial germline copy number variation. Unfortunately, such regions may be the most inclined to experience somatic rearrangement if CNV frequency is related to the underlying mutation rate (Lam and Jeffreys 2006). At high deletion frequencies (perhaps > 40%), SNP genotyping algorithms sometimes cluster CNV status instead of SNP genotype status. During this analysis, it became clear that Type II MIs were sometimes created when two hemizygous parents (called as SNP homozygotes) give rise to a homozygous null child (erroneously called a SNP heterozygote). As mentioned in the introduction, there are other phenomena that could create Type II MIs that aren't captured in our simple model. A homozygous tract of > 100Mb on 1q

on a CEU individual was first thought to be a large cell line deletion by (Conrad et al. 2006), but subsequent analysis revealed it to be a case of uniparental isodisomy. Such events could occur by mitotic recombination, and would produce a pattern of SNP failures identical to a somatic deletion. In cases where large somatic deletions are predicted on the basis of SNP data but are not detected in the underlying intensity data, UPD is one possible explanation. Another possible explanation for discordant results between the SNP-based method and the array-based methods used in the current paper is the use of different lots of cells, at different points of time. Cell lines that are mosaic for artifacts should also behave unpredictably when typed on various platforms at various points of time. Finally, several sample mix-ups were identified from unusual clusters of MIs in an earlier analysis of the Phase I data (Conrad et al. 2006). Unresolved sample mix-ups could continue to contribute to the signal we are detecting here. References Benjamini, Y., and Hochberg, Yosef. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Journal of the Royal Statistical Society 57: 289-300. Conrad, D.F., T.D. Andrews, N.P. Carter, M.E. Hurles, and J.K. Pritchard. 2006. A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 38: 75-81. International HapMap Consortium. 2005. A haplotype map of the human genome. Nature 437: 1299-1320. Lam, K.W. and A.J. Jeffreys. 2006. Processes of copy-number change in human DNA: the dynamics of {alpha}-globin gene deletion. Proc Natl Acad Sci U S A 103: 8921-8927. McCarrol, S.A., T.N. Hadnott, G.H. Perry, P.C. Sabeti, M.C. Zody, J.C. Barrett, S. Dallaire, S.B. Gabriel, C. Lee, M.J. Daly, and D.M. Altshuler. 2006. Common deletion polymorphisms in the human genome. Nat Genet 38: 86-92.