SUPPLEMENTARY INFORMATION

Size: px

Start display at page:

Download "SUPPLEMENTARY INFORMATION"

Lucas Alexander
5 years ago
Views:

1 Supplementary methods Genome Sequencing Paired-end and singleton 36 nucleotide-long reads were generated using Illumina GA and GAII instruments using standard protocols 1. Sequences were retained if of high quality (defined as those passing the default quality filtering parameters used in the Illumina GA Pipeline GERALD stage). Sequencing runs with poor average nucleotide quality score graphs (associated with a minority of reads passing the default quality filtering parameters) were discarded. Sequence Alignment High-quality reads were aligned to human reference genome build 36.3 using the GSNAP 2 alignment tool, a derivative of GMAP 3 that was developed for alignment of short, paired reads, allowing up to 5% mismatches. Only the highest scoring alignment(s) was retained. SNPs and indels were identified using the Alpheus software system 4-6. GSNAP is a short-read alignment program based on GMAP that employs a hash table and a compressed version of the reference genome, which is constructed once for that genome. The reference may include arbitrary contigs (up to 4 billion), so that one may also align to a reference transcriptome, with redundancy allowed among the contigs. The hash table contains the locations of a given 12-mer in the genome, subject to sampling. The sampling step occurs during pre-processing of the genome, so that genomic locations are stored only for every third 12-mer in the genome. Sampling is needed to reduce the memory footprint of the program below 4 gigabytes for a human-sized genome. GSNAP can handle short reads of > 24 or more nucleotides (nt), with each read in the input potentially having a varying length. There is theoretically no upper bound on the length of the query sequence, except that this bound is compiled into GSNAP by default at 200 nt; longer sequences can be handled simply by changing this constant at compile time. However, sequences of > 200 nt are probably best handled by other programs, such as GMAP. GSNAP has specialized algorithms for identifying exact mappings, one-mismatch mappings, multiple-mismatch mappings, and indel mappings (including a user-specified number of mismatches). Exact mappings are identified by taking the intersection of genomic positions over a spanning set of 12-mers in the query sequence. The spanning set must contain 12-mers in the same phase modulo 3, to account for the sampling used in pre-processing the genome, so the program must test each of the three possible phases. For spanning set members that overhang the ends of the query sequence by 1 or 2 nt, the relevant genomic positions can be obtained by substituting 1 or 2 wildcard nt, respectively, and taking the union of genomic locations in the hash table. Candidates for one-mismatch mappings are similarly identified by computing an incomplete intersection, in which one 12-mer in the spanning set does not contain the given genomic location. These candidate genomic mappings are then compared against a compressed version of the genome to verify that only one mismatch was present. Candidates for multiple-mismatch mappings are determined by processing a sorted list of genomic locations from all 12-mers in the query sequence. This sorted list is computed efficiently using a heapbased priority queue. For each candidate genomic location, a floor on the number of mismatches can be computed from the pattern of query positions of the 12-mers that match the genomic location. Candidates with a sufficiently low floor (based either on a user-specified limit or on the best mapping determined so far) are then compared against the compressed genome to determine the actual number of mismatches. 1

2 For identifying indel mappings, GSNAP accumulates partial genomic alignments during the multiple-mismatch algorithm, where a partial alignment can be supported by a single 12-mer in the query sequence. These partial alignments are then scanned in genomic order to identify pairs that are sufficiently close to constitute a candidate indel, where the default distances are 30 nucleotides for an insertion and 12 nucleotides for a deletion. These candidate pairs are then compared against the compressed genome to determine the number of mismatches. To identify indels occurring at either end of the query sequence, the program computes floors that exclude the 12-mers on either end. Candidates with a sufficiently low floor are then compared against the compressed genome to identify a possible indel at the end and to count the actual number of mismatches. Although GSNAP allows repetitive regions of the genome to be masked before building the genomic data structure, in typical usage (as herein) the genome is not pre-masked. Therefore, GSNAP is able to align sequences to redundant regions in the genome, including repetitive regions, and report all such alignments. In default mode (as herein), the program reports only the best alignments, those with the fewest mismatches, although the program also can be run to identify and report suboptimal alignments. GSNAP differs from both MAQ and ELAND in that it processes the reference genome first, constructs a hash table of the genome, and then aligns the short reads to the genome. In contrast, MAQ, ELAND, and almost all other short-read alignment programs process the short reads first, construct a hash table of the short reads, and then scan the genome to find matches. To our knowledge, the only short-read aligners that pre-process the genome other than GSNAP are BLAT (which is not specifically designed for this task and is extremely slow), SOAP (which is also extremely slow and requires 14 gigabytes of memory), Bowtie (which cannot identify indels), and Mosaik (which can identify indels, but uses Smith-Waterman alignments plus heuristic filters that appear to generate a high rate of false negatives in our testing). GSNAP does not use the Illumina quality scores (as MAQ and RMAP can do), although that is a feature that may be included in future versions of the program. Pre-release versions of GSNAP are available upon request to TDW at twu@gene.com. SNP and Indel Detection To create reliable filter conditions to exclude false positives in raw SNP calls, we compared massive parallel sequencing data with genotypes from microarrays. Array-based genotyping was performed using the Human BeadChip CNV370-Duo, the Human BeadChip 610-Quad (Illumina, San Diego, CA), and standard procedures 7. Filters optimizing PPV and sensitivity were determined in genomic positions assayed on the Human BeadChip CNV370-Duo where SNPs where identified through genotyping and/or SBS sequencing. Human BeadChip 610-Quad genomic positions that were nonredundant with those on the Human BeadChip CNV370-Duo Beadchip were used for filter validation. SNPs were called that met filters obtained by comparison to Illumina genotypes that optimized PPV and sensitivity (Supplementary Table 3, 4, 7, 8). (1) SNP filters for the autosomes: SNPs called by 4 or more unique reads, 20% or higher aligned reads, and an average quality of 20 or higher; autosomal SNPs were separated into heterozygous and homozygous classes using a 90% cutoff of percent aligned reads calling the SNP, where PPV for determining zygosity of SNPs was determined to be highest. (2) SNP filters for the sex chromosomes: SNPs called by 2 or more unique reads, 50% or higher aligned reads, and an average quality of 20 or higher. For pseudo-autosomal regions on sex chromosomes, the filter condition was same as that of the autosomes. SNPs that fell within gene boundaries were characterized as synonymous or non-synonymous by comparisons to refgene sequences from UCSC genome browser (downloaded Feb , available at 2

3 Indel calls were made with filters identified by comparison to a set of indels resequenced using Sanger technology. (Supplementary Discussion) (1) Indel filters for autosomes: indels called by 4 or more unique reads, 20% or higher of aligned reads, and an average quality of 20 or higher; autosomal indels were separated into heterozygous and homozygous classes using a 60% cutoff of percent aligned reads calling the indel. (2) Indel filters for sex chromosome: indels called by 2 or more unique reads, 50% or higher aligned reads, and an average quality of 20 or higher. For pseudo-autosomal regions on sex chromosomes, the filter condition was the same as that of the autosomes. Adjacent indels were compressed for same-sized indels within 10 nt. Coding domain indels were called by comparison to refgene as described above for SNPs. To exclude apparent indel errors identified due to problems of the reference sequence, we compared RefSeq transcript (NCBI; downloaded Sept. 2, 2008) with the reference genome sequence. Where RefSeq Transcript alleles did not match the corresponding reference genome alleles, no indel call was made. (Supplementary Table 24). Also, those indels which fell within the gene boundaries mentioned above were considered as CD indels. Validation Studies We selected 60 indels for Sanger validation. Those regions were validated using genomic DNA extracted from peripheral blood and amplified by PCR with flanking primers (Supplementary Table 23). PCR amplification was performed in 50 μl with 50 ng genomic DNA, 10 pmol of forward and reverse primer each, standard volume of Ex Taq (Takara), Ex Taq buffer (Takara), and dntps (Takara), at 95 0 C for 10 min, thirty five cycles of 95 0 C for 30s, 60 0 C for 30s, 72 0 C for 30s, and finally 72 0 C for 10 min. PCR products were purified with AccuPrep PCR purification kit (BioNeer Inc., Korea). Then the purified products were validated using ABI 3730xl DNA analyzer and ABI BigDye Terminator cycle sequencing (Applied Biosystems, Foster City, CA). Detection of true positive deletions with paired end GA II data from 1,132 BACs Large deletion in the BAC clone DNA could be detected by paired end GA sequencing data on the basis of stretched pairs and coverage fold changes. Stretched pairs indicate that certain sequence in the reference genome between these pairs are absent in the sample. Because the longest average insert size of paired-end libraries were 390 bp and the standard deviation (SD) values were less than 50 bp, we used 500 bp (> mean + 2 SD) as the conservative threshold value to define stretched pairs. To find CNV deletion, all stretched pairs from data was extracted and retained if 15 or more of them were clustered in the same region. One stretched pair has to be within 500 bp range from neighboring stretched pair in the region (This value was derived by the same reason mentioned above). Subsequently, we filtered out uncertain candidates by applying another criterion of coverage depth to choose accurate deletions. If the coverage value of a candidate region was less than half of coverage of 1 kb flanking regions of both sides, it was retained in the deletion list. Detection of Structural Variations using Microarrays In addition to 1,132 BACs, we detected putative structural variations with microarrays (Illumina BeadChip HumanCNV 370-Duo, HumanCNV 610-Quad, and Agilent 24M Human comparative genome hybridization array). For Illumina BeadChips, normalized bead intensity data and genotype calls were obtained with Illumina BeadStudio 3.1 software. Copy number variants (CNVs) were detected on the basis of 3

4 deflected log R ratios. The Agilent Human comparative genome hybridization (CGH) array, which consists of 24 million oligonucleotide probes, comprising of 24 X 1.1M Agilent acgh platform, was also used. The array is designed by Agilent to cover both coding and non-coding genomic regions, and repeat-masked to omit repeated regions such as Alu elements and variable number tandem repeats (VNTRs). Whole genomic DNA from the test and reference samples were sonicated, labeled and hybridized as per Agilent CGH array protocols. Briefly, for each of 1.1M acgh experiment, 1.5 µg of sample and reference DNA were sonicated for 1 minute to shred the genomic DNA. Random primers (Invitrogen Inc.) were added to both reference and sample DNAs and the tubes were incubated at 95 0 C for 3 minutes. The test and reference samples were labeled with fluorescent Cy-5 and Cy-3 dyes, respectively. The labeled genomic DNA was cleaned with Microcon YM-30 filters and dye-incorporation was quantified using a Nanodrop Spectophotometer to confirm the labeling and cleanup efficacy. The labeled samples were added to Cot-1 DNA (Invitrogen), Blocking agent (Agilent) and Hi-RPM (Agilent) buffer and prehybridized at 37 0 C for 2 hours. After prehybridization, the sample was applied to the 1.1M Agilent array and left for hybridization in a rotating Agilent hybridization oven at 65 0 C for 40 hours. After the hybridization, the array was immersed consecutively in wash buffer 1 and 2 for 5 and 3 minutes, respectively. After the washing, the array was scanned with Agilent DNA microarray scanner at 2 micron resolution. The resulting image was extracted by Agilent s Feature Extraction software. The log2 ratios were analyzed using NEXUS software. Each aberration call was manually checked to confirm the accuracy of the calls. Detection of Amplification with LIPE CNV insertion could be detected in Long Insert sized Paired End (LIPE) data on the basis of compressed pairs which indicated insertion of certain sequence between pair reads. The average insert size of the LIPE library was 2,700 bp and the standard deviation (SD) was about 300 bp. Based on these figures, a compressed pair was defined as one with insert size shorter than 2,000 bp (< mean 2 SD). Similar to method used to detect deletion, compressed pairs were defined to be clustered if one compressed pair was within 3,300 bp (mean + 2 SD) to neighboring compressed pair. To remove redundant pairs present in LIPE library, two or more pairs with both reads aligned to exactly the same position of reference genome were defined as one pair and named de-redundant. If 10 or more deredundant compressed pairs were found in one cluster, we defined this region as a candidate region for amplification. High-Throughput Annotation; Trait-o-matic To produce an initial list of clinically significant variants the Trait-o-matic web-service ( accepts a list of SNPs and parses this list to focus on non-synonymous coding changes. SNPs that are listed in Evidence-Base (a publicly curated database coordinated with Genetests.org, HGMD, OMIM, and SNP frequency data) are flagged for further review in separate sections by the Trait-o-matic service. Coding changes that are determined to be rare (potentially deleterious), based on the BLOSUM100 matrix, are listed in a separate section. The general availability of this web-service will facilitate a standardized interpretation of individual whole genomes by clinicians. The community activity around it should also enable the standardization of the return of data to individuals involved in research or direct-toconsumer activities. We will be replacing this soon with a wrapping webpage that includes evidencebase and the community wiki. 4

5 References Bentley, D.R. et al., Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, (2008). Nacu, S., Kan, Z., Seshagiri, S., & Wu, T., Sequencing the transcriptome in prostate cancer. Poster at the 2008 European Conference on Computational Biology ( (2008). Wu, T.D. & Watanabe, C.K., GMAP: a genomic mapping and alignment program for mrna and EST sequences. Bioinformatics 21, (2005). Miller, N.A. et al., Management of High-Throughput DNA Sequencing Projects: Alpheus. J. Comput. Sci. Syst. Biol. 1, (2008). Mudge, J. et al., Genomic convergence analysis of schizophrenia: mrna sequencing reveals altered synaptic vesicular transport in post-mortem cerebellum. PLoS ONE 3, e3625 (2008). Sugarbaker, D.J. et al., Transcriptome sequencing of malignant pleural mesothelioma tumors. Proc. Natl. Acad. Sci. USA 105, (2008). Steemers, F.J. et al., Whole-genome genotyping with the single-base extension assay. Nat. Methods 3, (2006). 5

6 Supplementary discussions Population characteristics of five individuals whose genomes have been sequenced Dr. James D. Watson father's ancestors were of English descent. His mother's father was Scots and his mother s mother was the daughter of Irish immigrants who arrived in the United States about Dr. J. Craig Venter s ancestors have been traced back to 1821 (paternal) and the 1700s (maternal) in England. Despite lack of detailed ancestry, ancestry informative markers confirm that the HapMap sample, NA18507, is Yoruban African. YH has been described as a Han Chinese. AK1 is a Korean male with at least 4 of generations of Korean ancestors. Genetic heterogeneity, relationships, and migration of human populations have been studied using various biological markers 1. The relatively small, easily sequenced mitochondrial genome has been extensively used to perform distribution and migration analyses, despite being transferred only through the mother s side. We compared the haplogroups and the sequences of mitochondrial DNA from individuals whose personal genomes are currently available (Supplementary Figure 17). The haplogroups for Watson and NA18501 were H and L1, respectively (we were unable to find mitochondrial sequence information of Venter). This is in accord with the characteristics of these populations since the haplogroups H and L1 are representative of Western Europe and West Africa, respectively. YH was categorized as haplogroup M, and further specified to be of type M7c. According to Tanaka and colleagues, M7c occurs rarely in Japan, but is observed more frequently in Sabah and Philippines, while highest diversity can be seen in China 2. We were unable to find individuals with M7c haplotypes upon examination of 72 Koreans and 58 Mongolians via total mitochondrial DNA sequencing (unpublished result). AK1 has been categorized as haplogroup D. The D haplogroup, which is frequently observed amongst East Asians as well as American Indians, has been observed in 31 Koreans (43.1%) and 12 Mongolians (20.7%), respectively (unpublished data). Technical improvement resulting in longer reads The reformulated cleavage reagent was developed by Illumina. It was used for the first time in the current study as a beta-test reagent. This reformulated cleavage reagent is now available in all Illumina reagent kits. Sequences generated using the reformulated cleavage reagent showed lower error rates and GC bias, since the reagent removed thymine fluorophores more completely and reduced background signals (Supplementary Figure 2). Hence the reformulated reagent allowed generation of sequence reads of up to 2 X 106, much longer than ordinary read length (2 X 36). Using this reagent, we succeeded in generating 18.8 Gb of passing sequence per flowcell. Longer reads are advantageous for indel calling and short read de novo assembly. Due to longer reads, we could detect indels up to 29 bases in length (Supplementary Figure 5). Coverage for autosomes and sex chromosomes using different alignment methods There exist at least three ways (multiple, random, and unique alignment) to address the issue of a selected read aligning to one or more places in the reference genome. The multiple alignment method allows the alignment of reads to multiple possible locations with identical match scores, while the random alignment randomly selects only one location. Unique alignment considers only those reads that show alignment to a single place in the reference genome. In an attempt to understand the influence of repetitive sequences, we tested all three methods of alignment. We applied the multiple alignment method (except that we discarded alignments for reads with more than five alignments) for SNP and indel detection, which maximized the efficiency of variant detection. For analyzing whole genome coverage or coverage fold-change analysis for finding CNV regions we applied the random alignment method. To identify the influence of repetitive sequence on observed spikes in average 6

7 coverage in 10 Kb windows, we evaluated the unique alignment method (Supplementary Figure 18, 19). Supplementary Figure 18 represents whole genome coverage plots for all three alignment methods. Except for the locations of spikes of especially high coverage, the overall pattern of alignments was similar. Coverage spikes were most prominent in the multiple alignment method, and least prominent in the unique alignment method. The majority of spikes were located near telomeric and centromeric gap regions, which are mostly repetitive. A closer look at the reference genome sequences around these spikes confirmed that these regions were, in fact, repetitive sequences in most cases (data not shown). Interestingly, some of the less prominent spikes were still present in coverage plots obtained using unique alignments. To investigate whether these residual spikes are artifacts originating from experimental error or specific problems of the GSNAP alignment tool, we utilized short read data of the genome of NA10851, available from 1,000 genome project (ftp://ftp-trace.ncbi.nih.gov/1000genomes; also generated using Illumina GA sequencing) and calculated the coverage with the SOAP alignment tool, allowing only unique alignments. Coverage spikes were observed in NA10851 in identical locations to those observed in AK1 (Supplementary Figure 19), thus confirming that these are not alignment artifacts nor unique to AK1 sequences. Presumably, coverage spikes originate from DNA sequences derived from the sequence gap regions of the human reference genome. Such repetitive sequence considerations also affect the coverage ratio between autosomal and sex chromosomes. Multiple, random, and unique alignment of reads to autosomes and sex chromosomes resulted in vs fold coverage, vs fold coverage, and vs fold coverage, respectively. The autosome to sex chromosome coverage approaches the expected 2:1 ratio only when the unique alignment method is applied. In other words, the inflation of coverage value by repetitive sequence is greater in the sex chromosomes and, in particular, the Y chromosome, where a 38x coverage value is obtained upon multiple alignment (Supplementary Figure 18). For reference, we utilized the short read data of the NA10851 genome mentioned above, and calculated the coverage with the SOAP alignment tool allowing random alignment, and found that the coverage of Y chromosome was also inflated in NA10851 (Supplementary Figure 19). Validation of SNP, indel and CNV by Sanger sequencing Numerous variants detected in Illumina GA sequence data were validated using PCR and Sanger sequencing. Primer information and electropherogram results are shown in Supplementary Table 23 and Supplementary Figure 16, respectively. Of 65 SNP validations attempted, all were confirmed as true SNPs. Also, these validation experiments confirmed the validity of imputed zygosities for all 65 SNPs. Among the 65 SNPs, 29 were randomly selected and 36 were nssnps. They included 28 homo- or hemizygous SNPs and 37 heterozygous SNPs. Among the 65 SNPs, 35 were novel. Such verification, together with the confirmatory SNP BeadArray experiments, provided a high degree of confidence that the SNP filters selected provided accurate SNP calls and zygosity imputation. 60 indels identified in whole genome Illumina sequences were resequenced using PCR and Sanger sequencing. Among them were 13 indels that are not described in Supplementary Table 11, having been removed due to a mismatch between the reference genome and reference transcript. 14 additional indels which were discovered in deep sequencing of the minimum tiling BAC path of chromosome 20 but undercalled in whole genome sequencing were also validated using PCR and Sanger sequencing. Using these results, the indel-detection filters were optimized, and the PPV and sensitivity for indel detection and the PPV for zygosity determination were calculated. In PCR and Sanger sequencing, the homo- or heterozygousity of indels was determined on the basis of the 7

8 presence of asynchronous peaks caused by a frame-shift observed in the chromatogram. For CNV regions, we attempted validation of 40 AK1 deletion regions found from targeted BAC sequencing of 1,132 clones that covered 390 CNV regions selected from the public database (DGV). Some of the deletion regions could not be validated by Sanger sequencing due to failure of primer design within repetitive flanking sequence. Ten regions were successfully amplified using PCR, and all were confirmed by Sanger sequencing. Such verification, together with the confirmatory CGH array experiments, provided a high degree of confidence that the CNV filters selected provided accurate CNV calls and zygosity imputation. Heterozygosity of SNPs Comparative analysis of SNPs from 5 individual genomes revealed a high ratio of heterozygous SNPs in AK1 (Supplementary Table 9). The hetero- to homozygous SNP ratio and the π value (heterozygous SNP density) were 1.5 and 0.74/kb, respectively, second only to those of Yoruba NA A direct comparison between the published individual genomes and AK1 is not readily feasible owing to the differences in sequencing platforms, coverage, alignment algorithms and SNP filter criteria. In particular, the low fold-coverage of several individual genomes (7x) is likely to yield potential undercalls for heterozygous SNPs, further hindering with analysis. When compared, heterozygous SNPs in the AK1 genome were more than those of YH but fewer than those in NA The absolute number of homozygous SNPs in AK1 and YH were similar (1.34 million and 1.35 million, respectively). In contrast, there were about 0.4 million additional heterozygous SNPs in AK1 compared to YH (2.11 and 1.72 million found in AK1 and YH, respectively). Heterozygosity can be elevated as the ancestral diversity broadens. Detection of novel, low allele frequency SNPs (which are more likely to be heterozygous than common SNPs) through precise sequencing also increases heterozygosity. Failure to properly eliminate false positive raw SNPs due to inaccurate filter conditions may contribute to the increase in heterozygosity as well. Much attention is thus required in interpreting heterozygosity. It is unlikely that the number of heterozygous SNPs in AK1 increased as a result of false positives, since the false positive rate of SNP calls in AK1 was well defined at below 0.1% through filter training using microarrays, in addition to the fact that no false positives were identified while validating 65 SNPs via Sanger sequencing (Supplementary Table 3, 4, 7, 8). We investigated the ratio of hetero- to homozygote SNPs using common autosomal SNPs found in AK1, YH and NA As a result, we found a total of 1,251,853 common SNPs, with 0.58, 0.57 and 0.75, hetero- to homozygote SNP ratio in AK1, YH and NA18507, respectively. Similarly, when we compared about 4 million common SNPs defined by HapMap Phase II, the hetero- to homozygote SNP ratio in AK1, YH and NA18057 was 0.93, 1.00 and 1.24, respectively. Thus, it should be noted that the ratio of hetero- to homozygous SNPs varied dependent on the SNP sets used for the analysis, and there was little or no difference between AK1 and YH when we used common SNP sets. Presumably, the difference in heterozygosity of total SNPs found in AK1 and YH may be due to technical differences in finding rare novel SNPs rather than differences between the genomes. One of the technical differences may reside in the predominant usage of single-end libraries (about 60% of total reads) in YH sequencing compared to AK1 sequencing (less than 20% of reads generated from single-end libraries). To understand the heterozygosity issue in more detail, will require further individual genome sequences to be generated using common technical approaches. Sensitivity of SNP and short indel detection Validating sensitivity estimates is more difficult than validating the accuracy of detection for nucleotide variants obtained by diploid, whole genome sequencing. This is primarily due to the 8

9 requirement of additional independent experimentation to find true genetic variations not present in the sequence set. We set the filter condition to represent 95% sensitivity during the process of SNP filter training using the 370k microarray. To confirm the accuracy of this sensitivity value, two additional methods were employed to verify the sensitivity of SNP detection. First, we used the 610k array, containing 240k probes not present on the 370k array used for filter training. Upon hybridization of AK1 genomic DNA to the second microarray, we were able to detect an additional 51,413 homozygous and 63,838 heterozygous SNPs. Comparative analysis of the latter with the SNPs found through diploid whole genome sequencing yielded 97.97% and 90.52% sensitivity for homozygous and heterozygous SNPs, respectively (Supplementary Table 7). Also, we were able to distinguish homozygous and heterozygous SNPs with 99.08% accuracy (Supplementary Table 8). Second, the results obtained from high-depth BAC clone sequencing of chromosome 20 helped confirm sensitivity values. Some advantages of using BAC clones include: easier separation of SNPs from sequencing errors; flexibility of high-depth sequencing with >150x coverage depth through targeted sequencing. From the Mb blocks of pure haploid regions with no clone overlap (from 742 BAC clones that cover chromosome 20 with a minimum tiling path) we were able to obtain a list of SNPs with high confidence (16,101 SNPs) by selecting the regions where >90% of aligned reads contain the SNP % of SNPs (15,140) overlapped with SNPs found in diploid whole genome sequencing. This result further confirms our high sensitivity of SNP detection since it is similar to values obtained with the 370k (94.40%) and 610k (93.84%) arrays. In the case of short indels, there is no comparable high throughput genotyping system available. However, we verified the sensitivity of short indel detection using the high-depth BAC clone sequencing data of chromosome 20, similar to our SNP sensitivity validation. We found 688 short indels in the haploid sequences of chromosome 20, of which 76.60% (527) overlapped with the short indels detected in diploid whole genome sequences. A more detailed discussion of the relatively low sensitivity for indel detection compared to SNP detection with similar filter conditions is described below. Errors in detection of homozygous or heterozygous indels in local repetitive or homopolymeric regions In the process of validating sensitivity of indel detection, we noticed that, in contrast to SNP detection, detection of indels in short reads is not always perfect, as mentioned above. The relatively low sensitivity of detection of short indels is partly explained by problems in indel detection in local repeat regions. An example of this phenomenon is shown in the Supplementary Figure 18 where there is a threebase insertion in the AK1 genome, which was confirmed to be homozygous via Sanger sequencing (the electropherogram can be found in Supplementary Figure 16). From this example, we have found that insertions or deletions in local repeat regions may contribute to the underestimation of indels in two possible ways. First, we cannot determine where the 3 bases are inserted exactly, because we can interpret this insertion as 6 possible combinations (AAC, ACA, or CAA insertion in two different positions for each kind) due to local repeat structure or symmetry. In fact, we found that GSNAP, the alignment program we used, decoded this case as two different insertions, neither of which passed the initial filter conditions. To overcome this problem, we merged close insertions or deletions found in local repeat sequences. A second problem occurred during the alignment process. Supplementary Figure 20 shows that only 4 reads confirmed a deletion among 7 uniquely aligned reads at the site illustrated because the 3 sequence reads containing the indel region near their ends were perfectly aligned to this location as 9

10 wild types. For these reasons, some erroneous underestimation of indels was unavoidable, resulting in a homozygote indel being called a heterozygote or a heterozygote indel as wild type. Hence, threshold of percent reads required to call an indel as a homozygote was defined as lower than that for homozygous SNPs ( 60% and 90%, respectively). Under such filter conditions, PPV for calling homo-/hemizygous and heterozygous indel were estimated to be 100% (45/45) and 93.3% (14/15) from validated indels using the Sanger method. When the read lengths of sequences are long, e.g. ABI 3730 or GS FLX, detection of indels becomes more sensitive, since long reads have lower likelihood of aligning only part of indel region, which cannot detect indel. For example, the number of indels for Venter and Watson, whose genome is sequenced by ABI 3730 and GS FLX, is 214,691 and 222,718, respectively. Upon correction for estimated sensitivity (76.60%) of indel detection in AK1, the likely indel number in AK1 increases to 222,196, which is nearly identical to that of Ventor or Watson. Erroneous estimation is unavoidable unless a new algorithm is developed to distinguish such ambiguously aligning short reads. To our knowledge, this is the first attempt to address such limitations in homo- heterozygote indel calling as well as indel sensitivity. Mismatches between the Reference Genome and Reference Transcript Sequences We detected short indels using the human NCBI reference genome sequence as the primary reference for read alignment. When exonic indels were detected in alignments to the NCBI reference genome, there were instances where homozygous indels detected in AK1 were noted to be wild type when cross-checked with reference transcript sequences (RefSeq Transcript or CDSS or by translation and comparison with reference protein databases or other reference genome sequences, such as the Celera assembly). An example of this is the Trehalase gene, in which AK1 showed a homozygous one base insertion. This appears to be due to the presence of single base deletion in the NCBI reference genome, rather than an error in the RefSeq Transcript database since AK1 showed a normal phenotype for this gene and since RefSeq transcript differs from the NCBI reference genome at this position. It should be noted that the same indel difference was found in YH and NA18507 genomes. Upon taking the codon frame of exons into consideration, the NCBI reference genome sequence was determined to contain quite a few such differences. At present, the origin of these discrepancies is still undefined and yet to be clarified. However, for the present manuscript, the most parsimonious response was to omit reporting an insertion in the AK1 Trehalase gene, nor indels in other genes where there was discordance between the NCBI reference genome and the other databases mentioned above, and where the indel caused a frame-shift mutaiton, since the results would have had significant impact on the application of such knowledge to personalized medicine. Due to the aforementioned reasons, exonic indels found in the AK1 genome with discrepancy between the NCBI reference genome and RefSeq Transcript database were excluded from further consideration. The list of 7,910 positions where we found such differences in the reference genome is shown in Supplementary Table 24. Comparison of nssnps and genes containing nssnps in AK1, YH, and NA18507 Nonsynonymous SNP (nssnp) have high likelihood of association with a phenotype and are therefore considered prominently in personalized medicine. The determination of exonic SNPs and how many of these are nssnps can vary in respect to the definitions given to genes and the reported nucleotide boundaries of exons. By applying uniform search methods, we looked for nssnps from AK1, YH and NA18507 genome utilizing the information of genes and exon coordinates obtained through the UCSC genome browser. As a result, we found 10,162, 9,039, and 11,251 nssnps for AK1, YH and NA18507, respectively (Figure 3a). 3,775 of these were common among the 3 individuals whereas 10

11 3,511, 2,373, and 5,509 nssnps were unique to each AK1, YH and NA Interestingly, the order of nssnp numbers as well as the ratio of hetero- to homozygous SNPs were both YH < AK1 < NA This suggests that the composition of individual genomes, as well as the methods used for SNP detection, influences the number of nssnps. nssnps common to AK1 and YH alone was 1,900, which is twice as much of that common to AK1 and NA18507 alone, probably reflective of the relative closeness of genetic ancestry between AK1 and YH. The number of genes containing at least one nssnps was 5,480, 5,184, and 6,036 in AK1, YH, and NA18507, respectively. Among these, 3,123 nssnps were found common to all 3 genomes. Of nssnps found in AK1, 34.6% were unique, 18.7% overlapped with YH, 9.6% overlapped with NA18507, and 37.1% were common to all 3 genomes. The percent of genes containing nssnps was 16.8% for those unique to AK1, 14.7% for those common to AK1 and YH, 11.5% common to AK1 and NA18507, and 57.0% common to all 3 genomes. The ratio of genes containing nssnps and total nssnps seems to increase as the genomes become genetically more distant (Figure 3a). 2,060 of 5,480 genes (37.6%) in AK1 contained 2 or more nssnps which was similar to the ratio obtained in YH (36.4%) and NA18507 (39.7%). 47, 34, and 37 genes were found to contain 10 or more nssnps in AK1, YH, and NA18507, respectively. Advantage of BAC sequencing for CNV detection We examined the characteristics of CNV regions detected by Illumina GA sequencing using BAC clones. Since nearly 100 Kb of the DNA insert in BAC clones were derived from a single chromatid, CNVs were detected with greater clarity than in diploid, whole genome shotgun sequencing. We were also able to select target regions and sequence BACs to obtain high depth of coverage of CNV-rich regions, again making CNV detection much easier. We identified 40 highly reliable deletions through sequencing of 1,132 BACs representing 390 frequent CNV regions, selected from DGV. These deletions featured increases in stretched, paired sequences, sudden coverage drops and loss of heterozygous SNP calls. We were able to confirm these CNV candidates using Agilent 24M CGH and genotyping microarrays. 24M array CGH We used a customized CGH array (acgh) designed for effective screening of CNV regions (CNVR) in individual human genomes and to obtain accurate true positive CNVR information from AK1. The Agilent Human acgh, which consists of 24 million oligonuclotide probes, was used for these experiments (the detailed information on the probes will be posted through a website). The array was designed to cover both coding and non-coding genomic regions, and repeat-masked to omit repetitive regions such as Alu elements and variable number tandem repeats (VNTRs). As a control, NA10851 from the HapMap collection was run side by side, and the result analyzed with NEXUS software. The significance threshold was set at 1.0E-6, maximum contiguous probe spacing (Kbp) 1000, and the minimum number of probes per segment at 5. CNVRs were categorized using average log2ratio as standard with high gain 0.6, gain 0.2, loss -0.2, and big loss -1.0 (which are commonly used criteria). Direct comparison between Illumina GA sequencing results and acgh is not straight forward due to differences between platforms and in the control sample used. Comparisons will require further research and analysis. 11

12 Reference 1. Cavalli-Sforza, L.L., Menozzi, P., & Piazza, A. The History and Geography of Human Genes. (Princeton University Press, Princeton, 1994) 2. Tanaka, M. et al., Mitochondrial genome variation in Eastern Asian and the peopling of Japan. Genome Res. 14, (2004) 12

13 Supplementary Figures Supplementary Figure 1. Workflow summarizing AK1 genome sequencing. 13

paired-end reads. (b) Graph showing acceptable error rates of the PhiX control lane in 2 x 106 bp reads.

14 Supplementary Figure 2. Illumina GA II flowcell generating 18 billion bases of 2 x 106 bp sequences. (a) Sequences generated from all 8 lanes of a flowcell sequenced on an Illumina GA II sequencer, showing quality scores greater than 20 in both paired-end reads. (b) Graph showing acceptable error rates of the PhiX control lane in 2 x 106 bp reads. (c) Complete removal of the thymidine fluorophore in the reformulated cleavage reagent (bottom) prevents increase in the thymidine background signal (noise) compared to the previous reagents (top). (a) (b) (C) 14

15 Supplementary Figure 3. Flow chart summarizing methods underpinning SNP results. SNP filter conditions were set by comparing SNPs identified in whole genome sequences to SNPs identified on the Illumina BeadChip 370K, and were validated by comparing non-redundant SNPs identified on the 610K genotyping array with those identified in whole genome sequencing. This process was reiterated until filter sets were optimized. The resulting final filter set was used to generate SNP calls. 15

16 Supplementary Figure 4. Distribution of the percentage of genome sequencing reads calling the alternate allele for SNPs shown to be heterozygous (light grey) or homozygous (dark grey) through genotyping on the Illumina BeadChip 370K. 16

17 Supplementary Figure 5. Histogram of the distribution of indels by size. 17

18 Supplementary Figure 6. Histogram of the distribution of CDS indels by size. 18

19 Supplementary Figure 7. Correlation between SNP-indel densities on all chromosomes (per 10 Kb window). From top, mean coverage; SNP density; indel density; SNP and indel densities and depth of sequence coverage (moving average of ten 10 Kb windows and, for depth of sequence, normalized to average of the entire chromosome). Provided as separate figure (pdf). SuppFig7_AK1_SNP_indels.pdf Depicted below is a preview of the full version. 19

20 Supplementary Figure 8. Overview of approaches employed for structural variant detection. Structural variants were identified by 5 different methods and, confirmed using results from whole genome sequencing. 20

21 Supplementary Figure 9. Plots showing deletion candidates (between broken green vertical lines) found in 1,132 BAC clones. When available, both chromatids were shown to demonstrate heterozygous deletion. Blue or grey lines depict fold coverage. Provided as separate figure (pdf). SuppFig9_deletion_1132BAC_covplot.pdf Depicted below is a preview of the full version. 21

22 Supplementary Figure 10. Coverage plot showing deletion regions confirmed with whole genome sequencing. Candidate deletion regions from haploid sequencing, 370K & 610K microarray, and 24 M probe array CGH were confirmed using whole genome sequence at the deletion regions (between the two vertical green lines). The black horizontal line in the lower part of the plot represents reference genome contigs, and short black and red vertical lines depict hetero- and homozygous SNPs, respectively. The red horizontal line shows the mean coverage value of 30. Provided as separate figure (pdf). SuppFig10_AK1_deletions_covplot_heterozygosity_ver2.pdf Depicted below is a preview of the full version. 22

23 Supplementary Figure 11. Histogram showing distribution of CNVs by size detected using Agilent custom 24 M CGH array. 23

Supplementary Figure 12. Coverage plot showing CNV gain regions from 370K, 610K and Agilent array CGH platforms that were confirmed by coverage increases observed in whole genome sequencing.

24 Supplementary Figure 12. Coverage plot showing CNV gain regions from 370K, 610K and Agilent array CGH platforms that were confirmed by coverage increases observed in whole genome sequencing. The black horizontal line in the lower part of the plot represents reference genome contigs, and short black and red vertical lines depict hetero- and homozygous SNPs, respectively. The red horizontal line shows the mean coverage value of 30. CNV gain region boundaries are depicted by two vertical green lines. Provided as separate figure (pdf). SuppFig12_AK1_Gains_array_covplot_heterozygosity.pdf Depicted below is a preview of the full version. 24

Supplementary Figure 13. Coverage plot showing CNV gain regions from BAC end sequencing that were confirmed by coverage increases observed in whole genome sequencing.

25 Supplementary Figure 13. Coverage plot showing CNV gain regions from BAC end sequencing that were confirmed by coverage increases observed in whole genome sequencing. The black horizontal line in the lower part of the plot represents reference genome contigs, and short black and red vertical lines depict hetero- and homozygous SNPs, respectively. The red horizontal line shows the mean coverage value of 30. CNV gain region boundaries are depicted by two vertical green lines. Provided as separate figure (pdf). SuppFig13_BACendins3_rand_heterozygosity.pdf Depicted below is a preview of the full version. 25

26 Supplementary Figure 14. Coverage plot showing CNV gain regions identified with long-insert paired-end sequencing and confirmed by coverage increases observed in whole genome sequencing. CNV gain region boundaries are depicted by two vertical green lines. The black horizontal line in the lower part of the plot represents reference genome contigs, and short black and red vertical lines depict hetero- and homozygous SNPs, respectively. The red horizontal line shows the mean coverage value of 30. Provided as separate figure (pdf). SuppFig14_LIPE_gain.pdf Depicted below is a preview of the full version. 26

27 Supplementary Figure 15. Chromosomal analysis of AK1 showing a normal karyotype. 27

28 Supplementary Figure 16. Validation of SNPs, indels, and CNV losses by Sanger sequencing. Provided as separate figure (pdf). SuppFig16_AK1_validation_all.pdf Depicted below is a preview of the full version. 28

29 Supplementary Figure 17. Mitochondrial haplogroup of 4 individual genome sequenced using next generation sequencing technology, and their migration route according to Mitomap ( 29

30 Supplementary Figure 18. Graph depicting mean fold-coverage along each chromosome by using either of multiple, random, and unique alignments in AK1. The black line near the bottom of each graph shows the alignment with human reference contigs. The red line displays the mean coverage of 30-fold. Provided as separate figure (pdf). SuppFig18_AK1_106randUniq_covplot_oneline.pdf Depicted below is a preview of the full version. 30

31 Supplementary Figure 19. Graph depicting mean fold-coverage along each chromosome by unique alignments using the SOAP alignment tool in NA The figure shows spikes that overlap with those of AK1. Provided as separate figure (pdf). SuppFig19_NA10851soapUniq.pdf Depicted below is a preview of the full version. 31

32 Supplementary Figure 20. A representative example showing limitations of indel detection in local repetitive or homopolymeric sequence. 32

33 Supplementary Tables Supplementary Table 1. List of BAC clones with unique sequence at both ends. SuppTable01_kogenome6_double_end-clone_1132_742.xls Depicted below is a preview of the full version. 33

34 Supplementary Table 2. Sequencing run summary for AK1 genome using GA. Sequencer Run* Template DNA # Reads Read Length (bp) bp Generated Library** Insert Size (bp) S1 gdna 30,153, ,085,517,396 SE1 - S2 gdna 29,365, ,057,173,372 SE1 - S3 gdna 26,337, ,161,376 SE1 - S4 gdna 26,905, ,597,784 SE2 - S5 gdna 32,075, ,154,703,924 SE1,SE2 - S6 gdna 34,323, ,235,652,408 SE1 - S7 gdna 31,444, ,132,015,464 SE1 - S8 gdna 29,645, ,067,255,316 SE1 - S9 gdna 35,342, ,272,332,664 SE1 - S10 gdna 37,819, ,361,495,196 SE1 - S11 gdna 43,027, ,548,978,192 SE1 - S12 gdna 43,741, ,574,677,692 SE1 - S13 gdna 40,206, ,447,448,364 SE1 - S14 gdna 79,097, ,847,494,700 PE1 - P1 gdna 28,131,026 2 x 36 2,025,433,872 PE1,PE2 130,390 P2 gdna 45,861,896 2 x 36 3,302,056,512 PE1 130 P3 gdna 36,516,167 2 x 36 2,629,164,024 PE1 130 P4 gdna 36,211,167 2 x 36 2,607,204,024 PE1 130 P5 gdna 59,918,876 2 x 36 4,314,159,072 PE1 130 P6 gdna 57,012,545 2 x 36 4,104,903,240 PE1 130 P7 gdna 75,097,500 2 x 36 5,407,020,000 PE1 130 P8 gdna 53,246,895 2 x 36 3,833,776,440 PE1 130 P9 gdna 72,835,614 2 x 36 5,244,164,208 PE1 130 P10 gdna 78,516,427 2 x 36 5,653,182,744 PE1 130 P11 gdna 72,130,403 2 x 36 5,193,389,016 PE1 130 P12 gdna 67,711,888 2 x 36 4,875,255,936 PE1 130 P13 gdna 65,120,193 2 x 36 4,688,653,896 PE2 390 P14 gdna 74,961,071 2 x 36 5,397,197,112 PE2 390 P15 gdna 61,661,384 2 x 88 10,852,403,584 PE2 390 P16 gdna 88,708,061 2 x ,806,108,932 LIPE 2700 SB1 Chr.20 BAC 17,160, ,793,156 BAC1 - PB1 Chr.20 BAC 26,813,486 2 x 36 1,930,570,992 BAC2 130 PB2 Chr.20 BAC 64,973,571 2 x 36 4,678,097,112 BAC2 130 PB3 Chr.20 BAC 49,523,603 2 x 36 3,565,699,416 BAC3 130 PB4 Chr.20 BAC 48,346,754 2 x 36 3,480,966,288 BAC4 390 PB5 WG BAC 52,846,314 2 x 36 3,804,934,608 BAC5 390 PB6 WG BAC 64,217,376 2 x 36 4,623,651,072 BAC6 390 Total Reads 3,097,371,573 Total bp 130,337,289,104 * Sequencer Run ** Library S: Whole Genome Single End Run SE1: Single End library 1 P: Whole Genome Paired End Run SE2: Single End library 2 SB: BAC Single End Run PB: BAC Paired End Run PE1: Paired End library 1, insert size 130bp PE2: Paired End library 2, insert size 390bp BAC1: chr.20 BAC Single End library BAC2: chr.20 BAC Paired End library 1, 742BACs BAC3: chr.20 BAC Paired End library 2, 742BACs BAC4: chr.20 BAC Paired End library 3, 43BACs BAC5: whole chr. BAC Paired End library, ~600BACs BAC6: whole chr. BAC Paired End library, ~600BACs LIPE: Long Insert size Paired End library 34

Supplementary Table 3. List of filter training conditions obtained by comparing SNP calls identified through sequencing with SNPs genotyped on the Illumina 370K.

35 Supplementary Table 3. List of filter training conditions obtained by comparing SNP calls identified through sequencing with SNPs genotyped on the Illumina 370K. Filters that optimize PPV and sensitivity for autosomal and sex chromosome SNPs were chosen and are highlighted in yellow. SuppTable03_370k_filters_ver200.xls Depicted below is a preview of the full version. 35

36 Supplementary Table 4. List of filter training conditions for differentiating heterozygous and homozygous SNP calls identified through sequencing with SNPs genotyped on the Illumina 370K. Yellow highlights the final filter. minpct unique_call avgq Minimum pct for detecting homozygous SNPs Correctly identified true positive SNPs Het called as Homo Hom called as Het Total Het/Homo Mismatched , % , % optimized % % PPV 36

37 Supplementary Table 5. Statistics of SNPs identified in AK1 genome sequences through application of optimized filters. Total Homo/Hemizygous Heterozygous Total SNPs Autosomal SNPs Sex Chromosomal SNPs CDS nssnps Novel * non-exonic UTR CDS * Compared with dbsnp

38 Supplementary Table 6. List of nssnps detected in the AK1 genome. SuppTable6_nsSnp_AK1.xls Depicted below is a preview of the full version. 38

39 Supplementary Table 7. Validation of filter conditions for detection of SNPs by comparison of non-redundant SNPs detected with an Illumina Beadchip 610K with those detected by whole genome sequencing for autosomal and sex chromosomes. Diploid region (autosomes & pseudo-autosome) het minpct reads_call unique_cal l pct avgq maxq True Positives False Positives PPV Sens Absolute Sens raw all % raw hetero % raw homo % optimized all % optimized hetero % optimized homo % Haploid (Sex chromosome excluding pseudo-autosomal region) het minpct reads_call unique_ca ll pct avgq maxq True Positives False Positives PPV Relative Sens Absolute Sens raw hemi optimized hemi

40 Supplementary Table 8. Validation of filter conditions for differentiation of hetero- and homozygous SNP calls by comparison of non-redundant SNPs detected using Illumina 610K with those detected by whole genome sequencing. minpct unique_call avgq Minimum pct for detecting homozygous SNPs Correctly identified true positive SNPs Het called as Homo Hom called as Het Total Het/Homo Mismatched PPV optimized % 40

41 Supplementary Table 9. Comparative analysis of SNPs and indels found in five personal genomes. Genome Method Depth db NCBI Bases % Cov. # SNPs 1 SNP # hetero # homo % new # ns # cds # indels % new # indels # cd Het/Hom π (SNP) SNP Ref of NCBI Ref NCBI Density SNP 1 SNP 1 SNPs 2 SNP SNP indels same 3 indels SNP (10-4 ) Venter Sanger Watson Roche YH Illumina NA18507 Illumina AK1 Illumina % : In reads aligned to NCBI ref., with application of filters and validated by genotyping array data 2: Not in dbsnp 3: # indels of size < 3 nt 41

42 Supplementary Table 10. Summary statistics for short indels identified in AK1 genome. Diploid Region Haploid Region Total Novel * Total Novel Homozygous 67,212 36,975 3,988 2,709 non-exonic Heterozygous 96,994 64, Subtotal 164, ,674 3,988 2,709 Homozygous UTR Heterozygous 1, Subtotal 1,762 1, Homozygous CDS Heterozygous Subtotal Sum Total 166, ,833 4,027 2,736 * Compared with dbsnp 129 indel size frequency

43 Supplementary Table 11. List of 170,202 short indels of AK1 genome. SuppTable11_AK1_Indel_090506_list.txt Depicted below is a preview of the full version. 43

44 Supplementary Table 12. List of 212 CDS indels for AK1 genome. SuppTable12_AK1_CDS_indels.xls Depicted below is a preview of the full version. 44

45 Supplementary Table 13. OMIM annotation for 26 homozygous CD indels of AK1 genome. Gene Symbol OMIM ID Official Full Name Disease Relationship ABCF ATP-BINDING CASSETTE, SUBFAMILY F, MEMBER 1 - APOBEC3H APOLIPOPROTEIN B mrna EDITING ENZYME, CATALYTIC POLYPEPTIDE-LIKE 3H; APOBEC3H Susceptibility to retroviral infections; the anti-hiv activity of APOBEC3H ATG9B AUTOPHAGY 9, S. CEREVISIAE, HOMOLOG OF, B - CDCP CUB DOMAIN-CONTAINING PROTEIN 2 Predicted to have oxidoreductase activity and is potentially involved in cholesterol biosynthesis CLTCL CLATHRIN, HEAVY POLYPEPTIDE-LIKE 1; CLTCL1 - CREB3L camp RESPONSE ELEMENT-BINDING PROTEIN 3-LIKE 2 member of the old astrocyte specifically induced substance (OASIS) DNA binding and basic leucine zipper dimerization (bzip) family of transcription factors CYP4F Expression analysis of CYP4F8 shows that high levels are found in epithelial linings and CYTOCHROME P450, FAMILY 4, SUBFAMILY F, POLYPEPTIDE 8; CYP4F8 epidermis of psoriatic lesions DHDH DIMERIC DIHYDRODIOL DEHYDROGENASE; DHDH - EMCN ENDOMUCIN Interfere with the assembly of focal adhesion complexes and inhibits interaction between cells and the extracellular matrix FAM26B CALCIUM HOMEOSTASIS MODULATOR 2; CALHM2 - GAS2L GROWTH ARREST-SPECIFIC 2-LIKE 1; GAS2L1 - ITIH INTER-ALPHA-TRYPSIN INHIBITOR, HEAVY CHAIN 5 Suggested that ITIH5 may play a role in breast tumorigenesis MUC MUCIN 19, OLIGOMERIC; MUC19 - PADI PEPTIDYLARGININE DEIMINASE, TYPE VI Common variants on 1p36 and 1q42 are associated with cutaneous basal cell carcinoma but not with melanoma or pigmentation traits; PAX PAIRED BOX GENE 8 Essential for the formation of thyroxine-producing follicular cells PLCD PHOSPHOLIPASE C, DELTA-3; PLCD3 - PNLIPRP PANCREATIC LIPASE-RELATED PROTEIN 2; PNLIPRP2 - PRLHR PROLACTIN-RELEASING HORMONE RECEPTOR PRLHR is a 7-transmembrane domain receptor for prolactin-releasing hormone REXO RNA EXONUCLEASE 1, S. CEREVISIAE, HOMOLOG OF; REXO1 - RHBG RHESUS BLOOD GROUP, B GLYCOPROTEIN - RTTN ROTATIN; RTTN RTTN is required for the early developmental processes of left-right (L-R) specification and axial rotation and may play a role in notochord development SARM STERILE ALPHA AND TIR MOTIFS-CONTAINING PROTEIN 1; SARM1 A candidate gene in the onset of hereditary infectious/inflammatory diseases. SEBOX SKIN-, EMBRYO-, BRAIN-, AND OOCYTE-SPECIFIC HOMEOBOX - SLC38A SOLUTE CARRIER FAMILY 38 (AMINO ACID TRANSPORTER), MEMBER 3 A putative tumour suppressor gene SMAD SMAD family member 5 Associated with schizophrenia; up-regulated Smad5 mediates apoptosis of gastric epithelial cells induced by Helicobacter pylori infection; activation by BMP-2 in lung cancer cells ZNF ZINC FINGER PROTEIN

46 Supplementary Table 14. Pearson correlations between SNP and indel densities in AK1 and YH genomes. SuppTable14_snp_indel_correlation_ver300.xls Depicted below is a preview of the full version. 46

47 Supplementary Table 15. List of CNV regions identified by the custom Agilent 24M probe CGH array. SuppTable15_AK1_Agilent24M_CNVR.xls Depicted below is a preview of the full version. 47

48 Supplementary Table 16. List of CNV regions identified by the Illumina Beadchip 370K and 610K. Platform Chr Length Start Stop CopyNumber Confidence 1st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence:

49 Supplementary Table 17. List of deletions in AK1 genome confirmed by whole genome GA sequencing. SuppTable17_deletion_list_ver3.xls Depicted below is a preview of the full version. 49

50 Supplementary Table 18. List of regions with copy number gains in AK1 genome generated by Illumina Beadchip 370K, 610K and the Agilent 24M CGH array, and confirmed Copy Number Gains detected using microarrays and confirmed using Genome Analyzer Index Chr Size (nt) Start End Affected Loci Accession # * Method(s) PDE4DIP Variation_0685 Agilent24M Variation_4249 Variation_ RNASEH1 Agilent24M K1,370K2,610K,Agilent24M SPAST Agilent24M MRPL19, C2orf3 610K,Agilent24M K,Agilent24M abpart 610K ZC3H6 Agilent24M Agilent24M BFSP2 Agilent24M FGF12 Agilent24M MUC20 Agilent24M MUC4 Agilent24M DEFB131, LOC Variation_ K Variation_31214 Agilent24M FLJ43080 Agilent24M PCDHB7, PCDHB8 Agilent24M C6orf127, CLPS Agilent24M Agilent24M FAM115C Agilent24M Variation_22883 Agilent24M Variation_ Variation_38082 Agilent24M GTF3C5, CEL Agilent24M CACNA1B Agilent24M Variation_ K Variation_24747 Variation_24748 Variation_ DNAJC15 Agilent24M Agilent24M HNRNPA1L2, SUGT1 610K,Agilent24M Agilent24M Agilent24M Variation_4853 Agilent24M LOC727832, POTEB, OR4M2 Variation_ K,Agilent24M Variation_ LOC K,Agilent24M K Agilent24M Variation_5434 Agilent24M CES1 Agilent24M Agilent24M FKSG24, RAB3A Agilent24M ZNF676, LOC342994, ZNF98, ZNF492, ZNF99 370K1,370K2,610K,Agilent24M FFAR3 Variation_32249 Agilent24M Variation_7201 Variation_ ZNF468 Variation_ K,Agilent24M Variation_ VPREB1 Variation_10609 Agilent24M Agilent24M 45 X Agilent24M * Compared with database of genomic variants (DGV) (Nov ) 50

51 Supplementary Table 19. List of regions with copy number gains in AK1 genome identified by BAC end sequencing and confirmed by whole genome GA sequencing. Copy Number Gains using BAC end sequences Index Chr Estimated Size Start Stop Gene Accession # Methods RMND5A BAC end sequencing LOC BAC end sequencing FRG2C, BC BAC end sequencing BAC end sequencing CTAGE4,FLJ43692,OR2A1,OR2A7,ARHGEVariation_31398 BAC end sequencing Variation_ ANKRD30A BAC end sequencing HYDIN, KIAA1864 BAC end sequencing A26B2 BAC end sequencing 9 Y RBMY1 BAC end sequencing * Compared with database of genomic variants (DGV) (Nov ) 51

52 Supplementary Table 20. List of regions with copy number gains in AK1 genome detected by compressed pairs in LIPE (long-inserted paired-end) whole genome sequencing and confirmed by whole genome coverage depth. Copy Number Gains using LIPE Index Chr Expected size of Gains (bp) Start Stop Gene Accession # * avg_insert_size Method EPHA8 Variation_ LIPE LIPE DLEC LIPE SORCS LIPE Variation_ LIPE LIPE ICA LIPE LIPE LIPE GOLM LIPE LIPE LIPE LIPE LIPE CLEC17A LIPE LIPE PEPD LIPE HNF4A LIPE LIPE LIPE LIPE 22 X LIPE 23 X LIPE * Compared with database of genomic variants (DGV) (Nov ) 52

53 Supplementary Table 21. List of gene ontologies for genes in which nssnps were found, showing the total number of genes containing nssnps and the number of genes containing nssnps that were common to AK1, YH, and NA SuppTable21_GO_analysis.xls Depicted below is a preview of the full version. 53

54 Supplementary Table 22. List of 773 SNPs with potential clinical implications identified by the Trait-o-matic algorithm. SuppTable22_AK1_773_TraitOmatic.xls Depicted below is a preview of the full version. 54

55 Supplementary Table 23. PCR primers used for validation of SNPs, short indels, and CNVs by Sanger sequencing. SuppTable23_all_validation_list_primers_ver600.xls Depicted below is a preview of the full version. 55

Nature Biotechnology: doi: /nbt.1904

Nature Biotechnology: doi: /nbt.1904 Supplementary Information Comparison between assembly-based SV calls and array CGH results Genome-wide array assessment of copy number changes, such as array comparative genomic hybridization (acgh), is