Supplementary Information Comparison between assembly-based SV calls and array CGH results Genome-wide array assessment of copy number changes, such as array comparative genomic hybridization (acgh), is widely accepted by the scientific community for copy number variations (CNV) detection as CNVs, in principle, result from structural variation events. Array CGH typically has a resolution at the 10kbp scale without exact breakpoints defined, which does not overlap well with the length spectrum coverage and breakpoint features of our assembly-based method. Nevertheless, it is still interesting to see the comparison between the two technologies. We performed acgh between any two of the three genomes (the YH genome, the anonymous reference genome, and a Promega female sample (www.promega.com)) to sort out any putative aberrant copy number changes that are specific to YH genome. In total, 144 CNVs, including 42 multi-probe and 102 single-probe signals were called. Using a reciprocal overlap threshold of 50%, we found 20 (47.6%) multi-probe CNVs and 11 (10.7%) single-probe CNVs were called in acgh had SVs from our assemblybased approaches (Supplementary Dataset). Of note, 19 (61%) out of all 31 overlapping CNVs actually had multiple SV events within their genomic ranges. For example, a copy number loss on chromosome X called by acgh actually involved 60 different deletions and insertion events called from the assembly (Figure S8). This indicates that acgh would not only have ambiguous breakpoints but also aggregates multiple signals into a misleading average result, losing the internal details. Comparison between SV call sets with other studies The HuRef assembly 6 resolved by conventional Sanger sequencing had a better contiguity than both assemblies in our study. Therefore, it would be interesting to see in which SV category we still have a margin for improvement. A comparison between call sets (Table S4a) showed that our assembly-based SVs have a larger reciprocal overlap with HuRef SV calls than those of any other methods, which indicates that our methods have better power to discover SVs. The overlap rate in small indels is much higher than that of large SVs, which could be because: 1) large SVs are more
individual-specific than small indels as they confer higher deleterious impact with stronger negative selection; and/or 2) current assembly contiguity needs to be improved to include large SVs within a contig. We also compared our call sets with Pang et al. 7 s call sets from a range of technologies (Table S4b). In the Pang et al. call sets, those called from SR mapping had the highest overlap with our call sets, rather than those called from PEM or arrays. Considering the fact that indels called by the SR method are smaller than those from other methods, this again suggests that large indels are less likely to overlap between individuals than small ones.
Supplementary figures Figure S1. A gapped alignment plot to indicate a deletion between NCBI36 Chromosome 1, segment from coordination 246,118,102 to 246,124,262, and YH whole genome de novo assembly scaffold6804, breakpoint at 8,953. Scaffold 6804 5000 10000 15000 20000 246110000 246115000 246120000 246125000 246130000 Chormosome 1
Figure S2. Illustration of how read pair and read depth changes on sites of insertion or deletion in reference and assembly. Well-assembled sequence should always achieve a good pair-wise alignment result and read depth since a PE read would be aligned as two single-end reads around indels in a reference and the read depth of deleted regions in the reference should be very low. a. b.
Figure S3. A case of complex structural variation in YH genome. The figure illustrate a ~22kb inversion (pink line, as a cross between assembly and reference) at chromosome 10 with repetitive sequences (gray block) and several other events, including insertions ranging from 1bp to ~18kbp (green line and block), and deletions (violet line and block) among a hyper-mutation region (with over 10 insertion and deletion events spacing less than 200bp). Read depth (uppermost line chart) and PE reads alignment (medium curve chart) show the difficulty for this SV event to be detected by previous approaches including RD, PEM and SR.
Figure S4. The SV distribution of the whole genome and regions with significantly different numbers of SVs between YH and NA18507 genome. Histograms show the number of SVs in a 1-Mb bin on chromosomes. Regions with significant difference between two genomes are marked as purple (YH higher than NA18507) and green (NA18507 higher than YH) on the right of chromosomes.
Figure S5. Stacked histogram showing the portion of SVs of different length ranges overlap with unique and repetitive annotated regions in NCBI human reference genome build 36.
Figure S6. Venn graph showing the amount of affected gene features among those genes overlapping with SVs. CDS (Green); 3-UTR (Red); 5-UTR (Blue); Intron (Yellow). Numbers indicated are the numbers of genes with one or several gene features affected in YH genome; followed by that of the NA18507 genome. 890:;71&/4!"#$% &"#$% 23/023/ 3&022 20! 20! 330!5 &/0!1 /0/ 507 4/3304!!& 707 702 &02 306 3550&/5 6/042 '() *+,-.+
Figure S7. The frequency of structural variations (x-axis) detected in coding sequences showed a negative correlation with their length (y-axis). Mean length (bp) 0 500 1000 1500 0.01 0.05 0.1 0.2 0.5 1 Frequency
Figure S8. Comparison between array CGH signals and assembly-based SV calls showed that acgh signals are averaged from multiple smaller-scaled SV events. Insertion Deletion 91.45Mb 11438bp 38282bp 92.18Mb
Supplementary tables Table S1. Primers, sequences of randomly selected structural variations and Sanger capillary sequencing results for PCR validation. Table S1_PCR validation.xls Table S2. (a) Summary of Fosmid sequences validation results. (b) Details including chromosome and coordination of Fosmid sequences validation results. Table S2_Fosmid validation.xls Table S3. Structural variations predicted on the YH and NA18507 genome were, respectively, compared to sets of variants discovered by alternative approaches. Before the slash (/) are the numbers of overlapping variants of NA18507 genome, after are the numbers of overlapping variants of YH genome. Hyphen (-) means not applicable. The criteria FxOy extends x bp as flanking sequence at both sides of the breakpoints of identified variants for comparison, and require the length of the intersection between the validated and the identified variants to overlap by at least y bp of the length of the union of the intervals. DIP 1, small indels found as gaps in the paired-end alignment between the Fosmid end sequences and the reference; ESP 2, large structural variants that were found by analyzing discordant Fosmid clone-end alignment; Three separate sets of structural variants (maximum parsimony structural variation (MPSV) weighted, MPSV unweighted and probabilistic) predicted by Variation Hunter 3 ; MoDIL 4, the set of variants predicted by MoDIL utility. BreakDancer 5, a merged set with variants predicted by BreakDancerMini and BreakDancerMax; The dbsnp version 130 (v130) set refers to homozygous indels that are 30 bp or shorter in dbsnp version 130. The BreakSeq-YRI set refers to predicted variants in NA18507 by BreakSeq and a breakpoints library.
Table S3_Computational validation.xls Table S4. Comparison between SVs detected in YH genome, Levy et al. 6 and Pang et al. 7 Table S4_Compare to Levy and Pang.xlsx Table S5. Classification of those strongly conserved (dn/ds 0.1) genes containing SVs. Table S5_gene function.xls Supplementary Dataset Supplementary_aCGH.txt
References 1. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-9 (2008). 2. Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56-64 (2008). 3. Hormozdiari, F., Alkan, C., Eichler, E.E. & Sahinalp, S.C. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res 19, 1270-8 (2009). 4. Lee, S., Hormozdiari, F., Alkan, C. & Brudno, M. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nat Methods 6, 473-4 (2009). 5. Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 6, 677-81 (2009). 6. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol 5, e254 (2007). 7. Pang, A.W. et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biol 11, R52 (2010).