Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Here we compare the results of this study to potentially overlapping results from four earlier studies of human structural variation. These studies used three different experimental approaches: 1. Representational oligonucleotide microarray (ROMA) analysis, which involves hybridization of a size-selected set of genomic restriction fragments to an oligonucleotide microarray; 2. comparative genomic hybridization to microarrays of bacterial artificial chromosomes (BAC array CGH); 3. Sequencing the ends of 589,275 fosmids from a single individual, and searching for paired end reads that map more than 48 kb apart on the reference sequence (which identified deletions in that individual relative to the reference sequence, since virtually no fosmids have inserts that large); and the current work, which searches for particular aberrant patterns of genotypes in SNP genotype data. Method Ref. Requires Individuals assayed Potential deletion variants identified ROMA 1 Microarray 19 76 CNPs (of which an unknown subset are deletions) BAC array CGH 2 Microarray 55 255 LCVs (of which an unknown subset are deletions) BAC array CGH 3 Microarray 47 119 CNPs (of which an unknown subset are deletions) Fosmid end reads 4 Resequencing 1 101 deletions SNP genotypes this work SNP genotypes 269 540 deletions Deletions vs. multi-copy duplications Copy number variation can result from either a deletion variant (a haplotype containing no copies of the sequence) or a multi-copy duplication (different haplotypes carrying different positive numbers of copies of the sequence). ROMA and BAC array CGH identify sites of copy number variation, or gains and loss of copy number in a proband relative to a reference sample. Because absolute copy number in the reference sample is not known, a copy number loss identified by ROMA or BAC array CGH could represent a deletion variant in the proband, or a multi-copy duplication that is present in more copies in the reference sample than in the proband.

Sensitivity as a function of variant size The sensitivity of all four approaches to detecting variants is strongly related to the size of the variant. This is evident from the technical requirements of all four approaches. ROMA BAC array CGH Fosmid end reads SNP genotypes Factor determining size sensitivity Requirement of differential hybridization to at least three consecutive probes (which are on average 32 kb apart) Requirement of detectable differential hybridization to a BAC probe (about 150 kb in size) The deletion must be larger than natural variation in fosmid insert sizes (+/- 8 kb). SNP density (at least two distinct SNPs must yield the same pattern of aberrant genotypes) Most CNPs discovered by ROMA are larger than 100 kb (only 7 are smaller than 20 kb):

The sizes of deletion variants identified by Tuzun et al. (2005) from fosmid end pair sequencing are estimated from the apparent discrepancy of fosmid insert sizes (relative to the expected 40 kb). A discrepancy of at least 8 kb is required for discovery, due to natural variation in the sizes of fosmid inserts.

The size distributions of deletion variants identified in this work is estimated from the distance spanned by the aberrant SNP genotypes.

How different methods locate variants To assess whether different approaches have identified the same variant, it is important to understand how each approach identifies the location of a variant. No current approach identifies the exact breakpoints of a variant (though the resequencing of fosmids that contain variants will ultimately accomplish this). The current work identifies the SNPs that are covered by a deletion variant. These are inner boundaries : the breakpoints of the deletion should lie outside of these SNPs. The resolution of these boundaries depends on SNP density, which is about one SNP per 3 kb in the current version of HapMap but will be one SNP per 1 kb in future versions. The fosmid-end-sequencing approach identifies paired end reads that flank a deletion variant. These are outer boundaries : the breakpoints of the deletion should lie inside of these sequence reads. These boundaries are initially at least 48 kb apart, but the discovery of additional, overlapping fosmids that also cover the deletion variant can refine these boundaries. ROMA identifies a series of restriction fragments that show differential hybridization between a proband and a reference sample. These are thought to lie inside the variant and therefore to be inner boundaries. The principal limit on precision is that the restriction fragments are separated by an average of 32 kb genomic distance. BAC array CGH identifies a BAC-sized region (about 150 kb) in which a significant amount of sequence is present in greater or fewer copies in a proband relative to a reference sample. It identifies a neighborhood that contains, overlaps with, or is contained by a large variant. Mutual discoveries Mutual discoveries between this work and the fosmid-end-sequencing method To assess the mutual discoveries between this work and the fosmidend-sequencing approach, we therefore looked for deletion variants

from our set (540 variants) that fell completely inside deletion variants identified from the fosmid approach (102 variants). We found 28 such mutual discoveries (vs. less than one expected by chance). 25 of these 28 variants were identified in more than one individual in our study, suggesting that they are common variants (making them more likely to have been sampled in the single individual from whom the fosmid library was constructed). As the fosmid-end-pair-sequencing approach will ultimately be applied to additional individuals (including some of the same individuals sampled for HapMap), we expect these approaches to converge toward agreement on a set of common deletion variants. The end-pair-sequencing approach also detects insertions relative to the reference sequence, which our approach does not; we found no overlaps between the insertions identified in that work and the deletions identified in the present work. This work Tuzun et al., 2005 Chrom SNP Rightmost SNP Left end-read boundary Right end-read boundary chr1 34,606,761 34,610,715 34,591,979 34,617,030 chr1 72,137,668 72,176,870 72,104,907 72,193,110 chr1 109,527,309 109,534,259 109,522,724 109,551,608 chr1 149,771,758 149,800,260 149,766,538 149,825,054 chr1 149,977,953 149,986,389 149,963,017 149,990,180 chr2 89,039,268 89,049,267 89,008,235 89,065,342 chr2 147,075,728 147,086,685 147,071,619 147,096,571 chr3 163,833,596 163,943,569 163,829,860 163,953,604 chr3 194,196,286 194,205,086 194,187,307 194,212,927 chr3 194,457,389 194,459,618 194,444,572 194,483,617 chr4 9,969,524 9,980,122 9,949,506 9,989,442 chr6 103,784,319 103,807,031 103,757,028 103,813,103 chr7 97,008,440 97,012,729 96,997,304 97,035,029 chr7 109,002,325 109,011,761 108,987,475 109,022,463 chr7 115,492,184 115,494,416 115,472,521 115,507,536 chr7 141,456,537 141,472,512 141,455,775 141,511,462 chr7 141,921,685 141,931,471 141,902,964 141,956,537 chr8 6,810,705 6,811,452 6,802,213 6,847,036 chr8 51,082,185 51,083,978 51,077,741 51,094,841 chr11 4,940,386 4,941,077 4,923,545 4,949,682 chr11 55,147,167 55,149,063 55,134,385 55,245,719 chr14 68,010,231 68,011,603 67,992,406 68,020,456 chr14 104,215,047 104,275,522 104,202,520 104,369,924 chr15 18,840,317 18,844,987 18,831,471 18,864,056 chr15 32,437,866 32,525,037 32,401,286 32,556,820

chr20 1,564,704 1,567,374 1,546,392 1,594,647 chr20 14,789,361 14,818,472 14,747,001 14,944,555 chr22 37,615,466 37,624,865 37,593,346 37,639,623 Mutual discoveries between this work and ROMA We looked for all places in which we found a deletion that overlapped with a ROMA CNP and covered at least 20% of the region assigned to the CNP. There were four mutual discoveries: one on chr6, one on chr15, one on chr14 (the immunoglobulin heavy chain locus), and one on chr22 (the immunoglobulin lambda locus). All four mutual discoveries involved common variants (that had been observed multiple times in one or both of the two studies). This work Sebat et al., 2004 Rightmost SNP probe Rightmost probe Chr SNP chr6 78,995,494 79,027,965 78,997,800 79,090,884 chr15 32,437,866 32,525,037 32,410,643 32,581,135 chr14 104,485,754 104,965,621 104,230,277 104,993,730 chr22 21,026,944 21,558,650 21,127,641 21,512,863 Mutual discoveries between this work and BAC array CGH We looked for all places in which we found a deletion variant that covered at least 20% of a BAC probe that had identified an LCV/CNP in the earlier studies. There were three mutual discoveries with Iafrate et al. (2004): one on chr4, one on chr14 (the immunoglobulin heavy chain locus), and one on chrx. All three mutual discoveries involved common variants (that had been observed multiple times in one or both of the two studies). Chr SNP This work Iafrate et al., 2005 Rightmost SNP BAC left end BAC right end chr4 34,677,422 34,724,191 34,674,501 34,823,905 chr14 104,485,754 104,965,621 104,767,866 105,076,137 chrx 91,086,005 91,109,766 90,900,000 91,100,000

There were six mutual discoveries with Sharp et al. (2005), including the immunoglobulin lambda and heavy chain loci: This work Sharp et al., 2005 Chr marker Rightmost marker BAC left end BAC right end chr4 70,447,409 70,542,965 70,432,219 70,591,332 chrx 46,929,298 47,028,433 46,881,874 47,078,955 46,939,097 47,119,352 chr14 104,215,047 104,275,522 104,194,660 104,377,772 chr14 104,485,754 104,965,621 104,413,088 104,573,219 104,580,604 104,731,664 chr15 32,437,866 32,525,037 32,447,228 32,598,686 chr22 21,026,944 21,558,650 21,389,432 21,565,251 Summary of mutual discoveries We shared 28 mutual discoveries with the fosmid-end-sequencing method, 4 with ROMA, and 3 and 6 with the two studies that used BAC array CGH. (35 shared discoveries total, since four loci were discovered in two earlier studies, and one locus was discovered in three earlier studies.) The larger number of mutual discoveries with the fosmid-end-sequencing method almost certainly reflects the sensitivity of that method for detecting variants in an intermediate size range (8+ kb) that overlaps significantly with the size range of the variants identified here. Fewer than 10% (35/540) of the deletion variants identified in the present work are shared with earlier studies. Are most large CNPs and LCVs duplications or deletions? ROMA and BAC array CGH identify sites of copy number variation, or gains and loss of copy number in a proband relative to a reference sample. Because absolute copy number in the reference sample is not known, a copy number loss identified by ROMA or BAC array CGH could represent a deletion variant in the proband, or a multi-copy duplication that is present in more copies in the reference sample than

in the proband. For example, of the 5 overlaps between our discoveries and the Sharp et al. study, two were reported as copy number gains in the earlier study, perhaps reflecting the presence of the deletion variant in the reference sample. Because most of these CNPs and LCVs are quite large (72% of the ROMA CNPs are larger than 100 kb, and the loci underlying the array CGH discoveries are assumed to be sufficiently large to result in a reproducible differential hybridization to a 150 kb BAC probe), more than 85% of them cover an ample number of HapMap SNPs for detecting common deletion variants if they exist at these sites. Yet deletion variants discovered in the present work appeared to explain only 10 of the 300 variants previously discovered by ROMA and BAC array CGH. We found no SNP support for potential deletion variants underneath 95% of the large (100+ kb) copy number polymorphisms identified by ROMA, despite the fact that 90% of these copy number polymorphisms have many SNPs (at least 20) available for detecting such deletion variants. We suggest that these CNPs are therefore likely to represent multicopy duplications. This possibility was suggested in the earlier studies, and is consistent with the observation that selection may be more tolerant of polysomy than of deletion at scales of hundreds of kilobases (Brewer et al., Am. J. Hum. Genet, 64, 1702-1708, 1999; Lindsley et al., Genetics 71, 157-184, 1972). References 1. Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525-8 (2004). 2. Iafrate, A. J. et al. Detection of large-scale variation in the human genome. Nat Genet 36, 949-51 (2004). 3. Sharp, A. J. et al. Segmental duplications and copy-number variation in the human genome. Am J Hum Genet 77, 78-88 (2005). 4. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat Genet 37, 727-32 (2005).