SUPPLEMENTARY INFORMATION

Size: px
Start display at page:

Download "SUPPLEMENTARY INFORMATION"

Transcription

1 Supplementary methods Genome Sequencing Paired-end and singleton 36 nucleotide-long reads were generated using Illumina GA and GAII instruments using standard protocols 1. Sequences were retained if of high quality (defined as those passing the default quality filtering parameters used in the Illumina GA Pipeline GERALD stage). Sequencing runs with poor average nucleotide quality score graphs (associated with a minority of reads passing the default quality filtering parameters) were discarded. Sequence Alignment High-quality reads were aligned to human reference genome build 36.3 using the GSNAP 2 alignment tool, a derivative of GMAP 3 that was developed for alignment of short, paired reads, allowing up to 5% mismatches. Only the highest scoring alignment(s) was retained. SNPs and indels were identified using the Alpheus software system 4-6. GSNAP is a short-read alignment program based on GMAP that employs a hash table and a compressed version of the reference genome, which is constructed once for that genome. The reference may include arbitrary contigs (up to 4 billion), so that one may also align to a reference transcriptome, with redundancy allowed among the contigs. The hash table contains the locations of a given 12-mer in the genome, subject to sampling. The sampling step occurs during pre-processing of the genome, so that genomic locations are stored only for every third 12-mer in the genome. Sampling is needed to reduce the memory footprint of the program below 4 gigabytes for a human-sized genome. GSNAP can handle short reads of > 24 or more nucleotides (nt), with each read in the input potentially having a varying length. There is theoretically no upper bound on the length of the query sequence, except that this bound is compiled into GSNAP by default at 200 nt; longer sequences can be handled simply by changing this constant at compile time. However, sequences of > 200 nt are probably best handled by other programs, such as GMAP. GSNAP has specialized algorithms for identifying exact mappings, one-mismatch mappings, multiple-mismatch mappings, and indel mappings (including a user-specified number of mismatches). Exact mappings are identified by taking the intersection of genomic positions over a spanning set of 12-mers in the query sequence. The spanning set must contain 12-mers in the same phase modulo 3, to account for the sampling used in pre-processing the genome, so the program must test each of the three possible phases. For spanning set members that overhang the ends of the query sequence by 1 or 2 nt, the relevant genomic positions can be obtained by substituting 1 or 2 wildcard nt, respectively, and taking the union of genomic locations in the hash table. Candidates for one-mismatch mappings are similarly identified by computing an incomplete intersection, in which one 12-mer in the spanning set does not contain the given genomic location. These candidate genomic mappings are then compared against a compressed version of the genome to verify that only one mismatch was present. Candidates for multiple-mismatch mappings are determined by processing a sorted list of genomic locations from all 12-mers in the query sequence. This sorted list is computed efficiently using a heapbased priority queue. For each candidate genomic location, a floor on the number of mismatches can be computed from the pattern of query positions of the 12-mers that match the genomic location. Candidates with a sufficiently low floor (based either on a user-specified limit or on the best mapping determined so far) are then compared against the compressed genome to determine the actual number of mismatches. 1

2 For identifying indel mappings, GSNAP accumulates partial genomic alignments during the multiple-mismatch algorithm, where a partial alignment can be supported by a single 12-mer in the query sequence. These partial alignments are then scanned in genomic order to identify pairs that are sufficiently close to constitute a candidate indel, where the default distances are 30 nucleotides for an insertion and 12 nucleotides for a deletion. These candidate pairs are then compared against the compressed genome to determine the number of mismatches. To identify indels occurring at either end of the query sequence, the program computes floors that exclude the 12-mers on either end. Candidates with a sufficiently low floor are then compared against the compressed genome to identify a possible indel at the end and to count the actual number of mismatches. Although GSNAP allows repetitive regions of the genome to be masked before building the genomic data structure, in typical usage (as herein) the genome is not pre-masked. Therefore, GSNAP is able to align sequences to redundant regions in the genome, including repetitive regions, and report all such alignments. In default mode (as herein), the program reports only the best alignments, those with the fewest mismatches, although the program also can be run to identify and report suboptimal alignments. GSNAP differs from both MAQ and ELAND in that it processes the reference genome first, constructs a hash table of the genome, and then aligns the short reads to the genome. In contrast, MAQ, ELAND, and almost all other short-read alignment programs process the short reads first, construct a hash table of the short reads, and then scan the genome to find matches. To our knowledge, the only short-read aligners that pre-process the genome other than GSNAP are BLAT (which is not specifically designed for this task and is extremely slow), SOAP (which is also extremely slow and requires 14 gigabytes of memory), Bowtie (which cannot identify indels), and Mosaik (which can identify indels, but uses Smith-Waterman alignments plus heuristic filters that appear to generate a high rate of false negatives in our testing). GSNAP does not use the Illumina quality scores (as MAQ and RMAP can do), although that is a feature that may be included in future versions of the program. Pre-release versions of GSNAP are available upon request to TDW at twu@gene.com. SNP and Indel Detection To create reliable filter conditions to exclude false positives in raw SNP calls, we compared massive parallel sequencing data with genotypes from microarrays. Array-based genotyping was performed using the Human BeadChip CNV370-Duo, the Human BeadChip 610-Quad (Illumina, San Diego, CA), and standard procedures 7. Filters optimizing PPV and sensitivity were determined in genomic positions assayed on the Human BeadChip CNV370-Duo where SNPs where identified through genotyping and/or SBS sequencing. Human BeadChip 610-Quad genomic positions that were nonredundant with those on the Human BeadChip CNV370-Duo Beadchip were used for filter validation. SNPs were called that met filters obtained by comparison to Illumina genotypes that optimized PPV and sensitivity (Supplementary Table 3, 4, 7, 8). (1) SNP filters for the autosomes: SNPs called by 4 or more unique reads, 20% or higher aligned reads, and an average quality of 20 or higher; autosomal SNPs were separated into heterozygous and homozygous classes using a 90% cutoff of percent aligned reads calling the SNP, where PPV for determining zygosity of SNPs was determined to be highest. (2) SNP filters for the sex chromosomes: SNPs called by 2 or more unique reads, 50% or higher aligned reads, and an average quality of 20 or higher. For pseudo-autosomal regions on sex chromosomes, the filter condition was same as that of the autosomes. SNPs that fell within gene boundaries were characterized as synonymous or non-synonymous by comparisons to refgene sequences from UCSC genome browser (downloaded Feb , available at 2

3 Indel calls were made with filters identified by comparison to a set of indels resequenced using Sanger technology. (Supplementary Discussion) (1) Indel filters for autosomes: indels called by 4 or more unique reads, 20% or higher of aligned reads, and an average quality of 20 or higher; autosomal indels were separated into heterozygous and homozygous classes using a 60% cutoff of percent aligned reads calling the indel. (2) Indel filters for sex chromosome: indels called by 2 or more unique reads, 50% or higher aligned reads, and an average quality of 20 or higher. For pseudo-autosomal regions on sex chromosomes, the filter condition was the same as that of the autosomes. Adjacent indels were compressed for same-sized indels within 10 nt. Coding domain indels were called by comparison to refgene as described above for SNPs. To exclude apparent indel errors identified due to problems of the reference sequence, we compared RefSeq transcript (NCBI; downloaded Sept. 2, 2008) with the reference genome sequence. Where RefSeq Transcript alleles did not match the corresponding reference genome alleles, no indel call was made. (Supplementary Table 24). Also, those indels which fell within the gene boundaries mentioned above were considered as CD indels. Validation Studies We selected 60 indels for Sanger validation. Those regions were validated using genomic DNA extracted from peripheral blood and amplified by PCR with flanking primers (Supplementary Table 23). PCR amplification was performed in 50 μl with 50 ng genomic DNA, 10 pmol of forward and reverse primer each, standard volume of Ex Taq (Takara), Ex Taq buffer (Takara), and dntps (Takara), at 95 0 C for 10 min, thirty five cycles of 95 0 C for 30s, 60 0 C for 30s, 72 0 C for 30s, and finally 72 0 C for 10 min. PCR products were purified with AccuPrep PCR purification kit (BioNeer Inc., Korea). Then the purified products were validated using ABI 3730xl DNA analyzer and ABI BigDye Terminator cycle sequencing (Applied Biosystems, Foster City, CA). Detection of true positive deletions with paired end GA II data from 1,132 BACs Large deletion in the BAC clone DNA could be detected by paired end GA sequencing data on the basis of stretched pairs and coverage fold changes. Stretched pairs indicate that certain sequence in the reference genome between these pairs are absent in the sample. Because the longest average insert size of paired-end libraries were 390 bp and the standard deviation (SD) values were less than 50 bp, we used 500 bp (> mean + 2 SD) as the conservative threshold value to define stretched pairs. To find CNV deletion, all stretched pairs from data was extracted and retained if 15 or more of them were clustered in the same region. One stretched pair has to be within 500 bp range from neighboring stretched pair in the region (This value was derived by the same reason mentioned above). Subsequently, we filtered out uncertain candidates by applying another criterion of coverage depth to choose accurate deletions. If the coverage value of a candidate region was less than half of coverage of 1 kb flanking regions of both sides, it was retained in the deletion list. Detection of Structural Variations using Microarrays In addition to 1,132 BACs, we detected putative structural variations with microarrays (Illumina BeadChip HumanCNV 370-Duo, HumanCNV 610-Quad, and Agilent 24M Human comparative genome hybridization array). For Illumina BeadChips, normalized bead intensity data and genotype calls were obtained with Illumina BeadStudio 3.1 software. Copy number variants (CNVs) were detected on the basis of 3

4 deflected log R ratios. The Agilent Human comparative genome hybridization (CGH) array, which consists of 24 million oligonucleotide probes, comprising of 24 X 1.1M Agilent acgh platform, was also used. The array is designed by Agilent to cover both coding and non-coding genomic regions, and repeat-masked to omit repeated regions such as Alu elements and variable number tandem repeats (VNTRs). Whole genomic DNA from the test and reference samples were sonicated, labeled and hybridized as per Agilent CGH array protocols. Briefly, for each of 1.1M acgh experiment, 1.5 µg of sample and reference DNA were sonicated for 1 minute to shred the genomic DNA. Random primers (Invitrogen Inc.) were added to both reference and sample DNAs and the tubes were incubated at 95 0 C for 3 minutes. The test and reference samples were labeled with fluorescent Cy-5 and Cy-3 dyes, respectively. The labeled genomic DNA was cleaned with Microcon YM-30 filters and dye-incorporation was quantified using a Nanodrop Spectophotometer to confirm the labeling and cleanup efficacy. The labeled samples were added to Cot-1 DNA (Invitrogen), Blocking agent (Agilent) and Hi-RPM (Agilent) buffer and prehybridized at 37 0 C for 2 hours. After prehybridization, the sample was applied to the 1.1M Agilent array and left for hybridization in a rotating Agilent hybridization oven at 65 0 C for 40 hours. After the hybridization, the array was immersed consecutively in wash buffer 1 and 2 for 5 and 3 minutes, respectively. After the washing, the array was scanned with Agilent DNA microarray scanner at 2 micron resolution. The resulting image was extracted by Agilent s Feature Extraction software. The log2 ratios were analyzed using NEXUS software. Each aberration call was manually checked to confirm the accuracy of the calls. Detection of Amplification with LIPE CNV insertion could be detected in Long Insert sized Paired End (LIPE) data on the basis of compressed pairs which indicated insertion of certain sequence between pair reads. The average insert size of the LIPE library was 2,700 bp and the standard deviation (SD) was about 300 bp. Based on these figures, a compressed pair was defined as one with insert size shorter than 2,000 bp (< mean 2 SD). Similar to method used to detect deletion, compressed pairs were defined to be clustered if one compressed pair was within 3,300 bp (mean + 2 SD) to neighboring compressed pair. To remove redundant pairs present in LIPE library, two or more pairs with both reads aligned to exactly the same position of reference genome were defined as one pair and named de-redundant. If 10 or more deredundant compressed pairs were found in one cluster, we defined this region as a candidate region for amplification. High-Throughput Annotation; Trait-o-matic To produce an initial list of clinically significant variants the Trait-o-matic web-service ( accepts a list of SNPs and parses this list to focus on non-synonymous coding changes. SNPs that are listed in Evidence-Base (a publicly curated database coordinated with Genetests.org, HGMD, OMIM, and SNP frequency data) are flagged for further review in separate sections by the Trait-o-matic service. Coding changes that are determined to be rare (potentially deleterious), based on the BLOSUM100 matrix, are listed in a separate section. The general availability of this web-service will facilitate a standardized interpretation of individual whole genomes by clinicians. The community activity around it should also enable the standardization of the return of data to individuals involved in research or direct-toconsumer activities. We will be replacing this soon with a wrapping webpage that includes evidencebase and the community wiki. 4

5 References Bentley, D.R. et al., Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, (2008). Nacu, S., Kan, Z., Seshagiri, S., & Wu, T., Sequencing the transcriptome in prostate cancer. Poster at the 2008 European Conference on Computational Biology ( (2008). Wu, T.D. & Watanabe, C.K., GMAP: a genomic mapping and alignment program for mrna and EST sequences. Bioinformatics 21, (2005). Miller, N.A. et al., Management of High-Throughput DNA Sequencing Projects: Alpheus. J. Comput. Sci. Syst. Biol. 1, (2008). Mudge, J. et al., Genomic convergence analysis of schizophrenia: mrna sequencing reveals altered synaptic vesicular transport in post-mortem cerebellum. PLoS ONE 3, e3625 (2008). Sugarbaker, D.J. et al., Transcriptome sequencing of malignant pleural mesothelioma tumors. Proc. Natl. Acad. Sci. USA 105, (2008). Steemers, F.J. et al., Whole-genome genotyping with the single-base extension assay. Nat. Methods 3, (2006). 5

6 Supplementary discussions Population characteristics of five individuals whose genomes have been sequenced Dr. James D. Watson father's ancestors were of English descent. His mother's father was Scots and his mother s mother was the daughter of Irish immigrants who arrived in the United States about Dr. J. Craig Venter s ancestors have been traced back to 1821 (paternal) and the 1700s (maternal) in England. Despite lack of detailed ancestry, ancestry informative markers confirm that the HapMap sample, NA18507, is Yoruban African. YH has been described as a Han Chinese. AK1 is a Korean male with at least 4 of generations of Korean ancestors. Genetic heterogeneity, relationships, and migration of human populations have been studied using various biological markers 1. The relatively small, easily sequenced mitochondrial genome has been extensively used to perform distribution and migration analyses, despite being transferred only through the mother s side. We compared the haplogroups and the sequences of mitochondrial DNA from individuals whose personal genomes are currently available (Supplementary Figure 17). The haplogroups for Watson and NA18501 were H and L1, respectively (we were unable to find mitochondrial sequence information of Venter). This is in accord with the characteristics of these populations since the haplogroups H and L1 are representative of Western Europe and West Africa, respectively. YH was categorized as haplogroup M, and further specified to be of type M7c. According to Tanaka and colleagues, M7c occurs rarely in Japan, but is observed more frequently in Sabah and Philippines, while highest diversity can be seen in China 2. We were unable to find individuals with M7c haplotypes upon examination of 72 Koreans and 58 Mongolians via total mitochondrial DNA sequencing (unpublished result). AK1 has been categorized as haplogroup D. The D haplogroup, which is frequently observed amongst East Asians as well as American Indians, has been observed in 31 Koreans (43.1%) and 12 Mongolians (20.7%), respectively (unpublished data). Technical improvement resulting in longer reads The reformulated cleavage reagent was developed by Illumina. It was used for the first time in the current study as a beta-test reagent. This reformulated cleavage reagent is now available in all Illumina reagent kits. Sequences generated using the reformulated cleavage reagent showed lower error rates and GC bias, since the reagent removed thymine fluorophores more completely and reduced background signals (Supplementary Figure 2). Hence the reformulated reagent allowed generation of sequence reads of up to 2 X 106, much longer than ordinary read length (2 X 36). Using this reagent, we succeeded in generating 18.8 Gb of passing sequence per flowcell. Longer reads are advantageous for indel calling and short read de novo assembly. Due to longer reads, we could detect indels up to 29 bases in length (Supplementary Figure 5). Coverage for autosomes and sex chromosomes using different alignment methods There exist at least three ways (multiple, random, and unique alignment) to address the issue of a selected read aligning to one or more places in the reference genome. The multiple alignment method allows the alignment of reads to multiple possible locations with identical match scores, while the random alignment randomly selects only one location. Unique alignment considers only those reads that show alignment to a single place in the reference genome. In an attempt to understand the influence of repetitive sequences, we tested all three methods of alignment. We applied the multiple alignment method (except that we discarded alignments for reads with more than five alignments) for SNP and indel detection, which maximized the efficiency of variant detection. For analyzing whole genome coverage or coverage fold-change analysis for finding CNV regions we applied the random alignment method. To identify the influence of repetitive sequence on observed spikes in average 6

7 coverage in 10 Kb windows, we evaluated the unique alignment method (Supplementary Figure 18, 19). Supplementary Figure 18 represents whole genome coverage plots for all three alignment methods. Except for the locations of spikes of especially high coverage, the overall pattern of alignments was similar. Coverage spikes were most prominent in the multiple alignment method, and least prominent in the unique alignment method. The majority of spikes were located near telomeric and centromeric gap regions, which are mostly repetitive. A closer look at the reference genome sequences around these spikes confirmed that these regions were, in fact, repetitive sequences in most cases (data not shown). Interestingly, some of the less prominent spikes were still present in coverage plots obtained using unique alignments. To investigate whether these residual spikes are artifacts originating from experimental error or specific problems of the GSNAP alignment tool, we utilized short read data of the genome of NA10851, available from 1,000 genome project (ftp://ftp-trace.ncbi.nih.gov/1000genomes; also generated using Illumina GA sequencing) and calculated the coverage with the SOAP alignment tool, allowing only unique alignments. Coverage spikes were observed in NA10851 in identical locations to those observed in AK1 (Supplementary Figure 19), thus confirming that these are not alignment artifacts nor unique to AK1 sequences. Presumably, coverage spikes originate from DNA sequences derived from the sequence gap regions of the human reference genome. Such repetitive sequence considerations also affect the coverage ratio between autosomal and sex chromosomes. Multiple, random, and unique alignment of reads to autosomes and sex chromosomes resulted in vs fold coverage, vs fold coverage, and vs fold coverage, respectively. The autosome to sex chromosome coverage approaches the expected 2:1 ratio only when the unique alignment method is applied. In other words, the inflation of coverage value by repetitive sequence is greater in the sex chromosomes and, in particular, the Y chromosome, where a 38x coverage value is obtained upon multiple alignment (Supplementary Figure 18). For reference, we utilized the short read data of the NA10851 genome mentioned above, and calculated the coverage with the SOAP alignment tool allowing random alignment, and found that the coverage of Y chromosome was also inflated in NA10851 (Supplementary Figure 19). Validation of SNP, indel and CNV by Sanger sequencing Numerous variants detected in Illumina GA sequence data were validated using PCR and Sanger sequencing. Primer information and electropherogram results are shown in Supplementary Table 23 and Supplementary Figure 16, respectively. Of 65 SNP validations attempted, all were confirmed as true SNPs. Also, these validation experiments confirmed the validity of imputed zygosities for all 65 SNPs. Among the 65 SNPs, 29 were randomly selected and 36 were nssnps. They included 28 homo- or hemizygous SNPs and 37 heterozygous SNPs. Among the 65 SNPs, 35 were novel. Such verification, together with the confirmatory SNP BeadArray experiments, provided a high degree of confidence that the SNP filters selected provided accurate SNP calls and zygosity imputation. 60 indels identified in whole genome Illumina sequences were resequenced using PCR and Sanger sequencing. Among them were 13 indels that are not described in Supplementary Table 11, having been removed due to a mismatch between the reference genome and reference transcript. 14 additional indels which were discovered in deep sequencing of the minimum tiling BAC path of chromosome 20 but undercalled in whole genome sequencing were also validated using PCR and Sanger sequencing. Using these results, the indel-detection filters were optimized, and the PPV and sensitivity for indel detection and the PPV for zygosity determination were calculated. In PCR and Sanger sequencing, the homo- or heterozygousity of indels was determined on the basis of the 7

8 presence of asynchronous peaks caused by a frame-shift observed in the chromatogram. For CNV regions, we attempted validation of 40 AK1 deletion regions found from targeted BAC sequencing of 1,132 clones that covered 390 CNV regions selected from the public database (DGV). Some of the deletion regions could not be validated by Sanger sequencing due to failure of primer design within repetitive flanking sequence. Ten regions were successfully amplified using PCR, and all were confirmed by Sanger sequencing. Such verification, together with the confirmatory CGH array experiments, provided a high degree of confidence that the CNV filters selected provided accurate CNV calls and zygosity imputation. Heterozygosity of SNPs Comparative analysis of SNPs from 5 individual genomes revealed a high ratio of heterozygous SNPs in AK1 (Supplementary Table 9). The hetero- to homozygous SNP ratio and the π value (heterozygous SNP density) were 1.5 and 0.74/kb, respectively, second only to those of Yoruba NA A direct comparison between the published individual genomes and AK1 is not readily feasible owing to the differences in sequencing platforms, coverage, alignment algorithms and SNP filter criteria. In particular, the low fold-coverage of several individual genomes (7x) is likely to yield potential undercalls for heterozygous SNPs, further hindering with analysis. When compared, heterozygous SNPs in the AK1 genome were more than those of YH but fewer than those in NA The absolute number of homozygous SNPs in AK1 and YH were similar (1.34 million and 1.35 million, respectively). In contrast, there were about 0.4 million additional heterozygous SNPs in AK1 compared to YH (2.11 and 1.72 million found in AK1 and YH, respectively). Heterozygosity can be elevated as the ancestral diversity broadens. Detection of novel, low allele frequency SNPs (which are more likely to be heterozygous than common SNPs) through precise sequencing also increases heterozygosity. Failure to properly eliminate false positive raw SNPs due to inaccurate filter conditions may contribute to the increase in heterozygosity as well. Much attention is thus required in interpreting heterozygosity. It is unlikely that the number of heterozygous SNPs in AK1 increased as a result of false positives, since the false positive rate of SNP calls in AK1 was well defined at below 0.1% through filter training using microarrays, in addition to the fact that no false positives were identified while validating 65 SNPs via Sanger sequencing (Supplementary Table 3, 4, 7, 8). We investigated the ratio of hetero- to homozygote SNPs using common autosomal SNPs found in AK1, YH and NA As a result, we found a total of 1,251,853 common SNPs, with 0.58, 0.57 and 0.75, hetero- to homozygote SNP ratio in AK1, YH and NA18507, respectively. Similarly, when we compared about 4 million common SNPs defined by HapMap Phase II, the hetero- to homozygote SNP ratio in AK1, YH and NA18057 was 0.93, 1.00 and 1.24, respectively. Thus, it should be noted that the ratio of hetero- to homozygous SNPs varied dependent on the SNP sets used for the analysis, and there was little or no difference between AK1 and YH when we used common SNP sets. Presumably, the difference in heterozygosity of total SNPs found in AK1 and YH may be due to technical differences in finding rare novel SNPs rather than differences between the genomes. One of the technical differences may reside in the predominant usage of single-end libraries (about 60% of total reads) in YH sequencing compared to AK1 sequencing (less than 20% of reads generated from single-end libraries). To understand the heterozygosity issue in more detail, will require further individual genome sequences to be generated using common technical approaches. Sensitivity of SNP and short indel detection Validating sensitivity estimates is more difficult than validating the accuracy of detection for nucleotide variants obtained by diploid, whole genome sequencing. This is primarily due to the 8

9 requirement of additional independent experimentation to find true genetic variations not present in the sequence set. We set the filter condition to represent 95% sensitivity during the process of SNP filter training using the 370k microarray. To confirm the accuracy of this sensitivity value, two additional methods were employed to verify the sensitivity of SNP detection. First, we used the 610k array, containing 240k probes not present on the 370k array used for filter training. Upon hybridization of AK1 genomic DNA to the second microarray, we were able to detect an additional 51,413 homozygous and 63,838 heterozygous SNPs. Comparative analysis of the latter with the SNPs found through diploid whole genome sequencing yielded 97.97% and 90.52% sensitivity for homozygous and heterozygous SNPs, respectively (Supplementary Table 7). Also, we were able to distinguish homozygous and heterozygous SNPs with 99.08% accuracy (Supplementary Table 8). Second, the results obtained from high-depth BAC clone sequencing of chromosome 20 helped confirm sensitivity values. Some advantages of using BAC clones include: easier separation of SNPs from sequencing errors; flexibility of high-depth sequencing with >150x coverage depth through targeted sequencing. From the Mb blocks of pure haploid regions with no clone overlap (from 742 BAC clones that cover chromosome 20 with a minimum tiling path) we were able to obtain a list of SNPs with high confidence (16,101 SNPs) by selecting the regions where >90% of aligned reads contain the SNP % of SNPs (15,140) overlapped with SNPs found in diploid whole genome sequencing. This result further confirms our high sensitivity of SNP detection since it is similar to values obtained with the 370k (94.40%) and 610k (93.84%) arrays. In the case of short indels, there is no comparable high throughput genotyping system available. However, we verified the sensitivity of short indel detection using the high-depth BAC clone sequencing data of chromosome 20, similar to our SNP sensitivity validation. We found 688 short indels in the haploid sequences of chromosome 20, of which 76.60% (527) overlapped with the short indels detected in diploid whole genome sequences. A more detailed discussion of the relatively low sensitivity for indel detection compared to SNP detection with similar filter conditions is described below. Errors in detection of homozygous or heterozygous indels in local repetitive or homopolymeric regions In the process of validating sensitivity of indel detection, we noticed that, in contrast to SNP detection, detection of indels in short reads is not always perfect, as mentioned above. The relatively low sensitivity of detection of short indels is partly explained by problems in indel detection in local repeat regions. An example of this phenomenon is shown in the Supplementary Figure 18 where there is a threebase insertion in the AK1 genome, which was confirmed to be homozygous via Sanger sequencing (the electropherogram can be found in Supplementary Figure 16). From this example, we have found that insertions or deletions in local repeat regions may contribute to the underestimation of indels in two possible ways. First, we cannot determine where the 3 bases are inserted exactly, because we can interpret this insertion as 6 possible combinations (AAC, ACA, or CAA insertion in two different positions for each kind) due to local repeat structure or symmetry. In fact, we found that GSNAP, the alignment program we used, decoded this case as two different insertions, neither of which passed the initial filter conditions. To overcome this problem, we merged close insertions or deletions found in local repeat sequences. A second problem occurred during the alignment process. Supplementary Figure 20 shows that only 4 reads confirmed a deletion among 7 uniquely aligned reads at the site illustrated because the 3 sequence reads containing the indel region near their ends were perfectly aligned to this location as 9

10 wild types. For these reasons, some erroneous underestimation of indels was unavoidable, resulting in a homozygote indel being called a heterozygote or a heterozygote indel as wild type. Hence, threshold of percent reads required to call an indel as a homozygote was defined as lower than that for homozygous SNPs ( 60% and 90%, respectively). Under such filter conditions, PPV for calling homo-/hemizygous and heterozygous indel were estimated to be 100% (45/45) and 93.3% (14/15) from validated indels using the Sanger method. When the read lengths of sequences are long, e.g. ABI 3730 or GS FLX, detection of indels becomes more sensitive, since long reads have lower likelihood of aligning only part of indel region, which cannot detect indel. For example, the number of indels for Venter and Watson, whose genome is sequenced by ABI 3730 and GS FLX, is 214,691 and 222,718, respectively. Upon correction for estimated sensitivity (76.60%) of indel detection in AK1, the likely indel number in AK1 increases to 222,196, which is nearly identical to that of Ventor or Watson. Erroneous estimation is unavoidable unless a new algorithm is developed to distinguish such ambiguously aligning short reads. To our knowledge, this is the first attempt to address such limitations in homo- heterozygote indel calling as well as indel sensitivity. Mismatches between the Reference Genome and Reference Transcript Sequences We detected short indels using the human NCBI reference genome sequence as the primary reference for read alignment. When exonic indels were detected in alignments to the NCBI reference genome, there were instances where homozygous indels detected in AK1 were noted to be wild type when cross-checked with reference transcript sequences (RefSeq Transcript or CDSS or by translation and comparison with reference protein databases or other reference genome sequences, such as the Celera assembly). An example of this is the Trehalase gene, in which AK1 showed a homozygous one base insertion. This appears to be due to the presence of single base deletion in the NCBI reference genome, rather than an error in the RefSeq Transcript database since AK1 showed a normal phenotype for this gene and since RefSeq transcript differs from the NCBI reference genome at this position. It should be noted that the same indel difference was found in YH and NA18507 genomes. Upon taking the codon frame of exons into consideration, the NCBI reference genome sequence was determined to contain quite a few such differences. At present, the origin of these discrepancies is still undefined and yet to be clarified. However, for the present manuscript, the most parsimonious response was to omit reporting an insertion in the AK1 Trehalase gene, nor indels in other genes where there was discordance between the NCBI reference genome and the other databases mentioned above, and where the indel caused a frame-shift mutaiton, since the results would have had significant impact on the application of such knowledge to personalized medicine. Due to the aforementioned reasons, exonic indels found in the AK1 genome with discrepancy between the NCBI reference genome and RefSeq Transcript database were excluded from further consideration. The list of 7,910 positions where we found such differences in the reference genome is shown in Supplementary Table 24. Comparison of nssnps and genes containing nssnps in AK1, YH, and NA18507 Nonsynonymous SNP (nssnp) have high likelihood of association with a phenotype and are therefore considered prominently in personalized medicine. The determination of exonic SNPs and how many of these are nssnps can vary in respect to the definitions given to genes and the reported nucleotide boundaries of exons. By applying uniform search methods, we looked for nssnps from AK1, YH and NA18507 genome utilizing the information of genes and exon coordinates obtained through the UCSC genome browser. As a result, we found 10,162, 9,039, and 11,251 nssnps for AK1, YH and NA18507, respectively (Figure 3a). 3,775 of these were common among the 3 individuals whereas 10

11 3,511, 2,373, and 5,509 nssnps were unique to each AK1, YH and NA Interestingly, the order of nssnp numbers as well as the ratio of hetero- to homozygous SNPs were both YH < AK1 < NA This suggests that the composition of individual genomes, as well as the methods used for SNP detection, influences the number of nssnps. nssnps common to AK1 and YH alone was 1,900, which is twice as much of that common to AK1 and NA18507 alone, probably reflective of the relative closeness of genetic ancestry between AK1 and YH. The number of genes containing at least one nssnps was 5,480, 5,184, and 6,036 in AK1, YH, and NA18507, respectively. Among these, 3,123 nssnps were found common to all 3 genomes. Of nssnps found in AK1, 34.6% were unique, 18.7% overlapped with YH, 9.6% overlapped with NA18507, and 37.1% were common to all 3 genomes. The percent of genes containing nssnps was 16.8% for those unique to AK1, 14.7% for those common to AK1 and YH, 11.5% common to AK1 and NA18507, and 57.0% common to all 3 genomes. The ratio of genes containing nssnps and total nssnps seems to increase as the genomes become genetically more distant (Figure 3a). 2,060 of 5,480 genes (37.6%) in AK1 contained 2 or more nssnps which was similar to the ratio obtained in YH (36.4%) and NA18507 (39.7%). 47, 34, and 37 genes were found to contain 10 or more nssnps in AK1, YH, and NA18507, respectively. Advantage of BAC sequencing for CNV detection We examined the characteristics of CNV regions detected by Illumina GA sequencing using BAC clones. Since nearly 100 Kb of the DNA insert in BAC clones were derived from a single chromatid, CNVs were detected with greater clarity than in diploid, whole genome shotgun sequencing. We were also able to select target regions and sequence BACs to obtain high depth of coverage of CNV-rich regions, again making CNV detection much easier. We identified 40 highly reliable deletions through sequencing of 1,132 BACs representing 390 frequent CNV regions, selected from DGV. These deletions featured increases in stretched, paired sequences, sudden coverage drops and loss of heterozygous SNP calls. We were able to confirm these CNV candidates using Agilent 24M CGH and genotyping microarrays. 24M array CGH We used a customized CGH array (acgh) designed for effective screening of CNV regions (CNVR) in individual human genomes and to obtain accurate true positive CNVR information from AK1. The Agilent Human acgh, which consists of 24 million oligonuclotide probes, was used for these experiments (the detailed information on the probes will be posted through a website). The array was designed to cover both coding and non-coding genomic regions, and repeat-masked to omit repetitive regions such as Alu elements and variable number tandem repeats (VNTRs). As a control, NA10851 from the HapMap collection was run side by side, and the result analyzed with NEXUS software. The significance threshold was set at 1.0E-6, maximum contiguous probe spacing (Kbp) 1000, and the minimum number of probes per segment at 5. CNVRs were categorized using average log2ratio as standard with high gain 0.6, gain 0.2, loss -0.2, and big loss -1.0 (which are commonly used criteria). Direct comparison between Illumina GA sequencing results and acgh is not straight forward due to differences between platforms and in the control sample used. Comparisons will require further research and analysis. 11

12 Reference 1. Cavalli-Sforza, L.L., Menozzi, P., & Piazza, A. The History and Geography of Human Genes. (Princeton University Press, Princeton, 1994) 2. Tanaka, M. et al., Mitochondrial genome variation in Eastern Asian and the peopling of Japan. Genome Res. 14, (2004) 12

13 Supplementary Figures Supplementary Figure 1. Workflow summarizing AK1 genome sequencing. 13

14 Supplementary Figure 2. Illumina GA II flowcell generating 18 billion bases of 2 x 106 bp sequences. (a) Sequences generated from all 8 lanes of a flowcell sequenced on an Illumina GA II sequencer, showing quality scores greater than 20 in both paired-end reads. (b) Graph showing acceptable error rates of the PhiX control lane in 2 x 106 bp reads. (c) Complete removal of the thymidine fluorophore in the reformulated cleavage reagent (bottom) prevents increase in the thymidine background signal (noise) compared to the previous reagents (top). (a) (b) (C) 14

15 Supplementary Figure 3. Flow chart summarizing methods underpinning SNP results. SNP filter conditions were set by comparing SNPs identified in whole genome sequences to SNPs identified on the Illumina BeadChip 370K, and were validated by comparing non-redundant SNPs identified on the 610K genotyping array with those identified in whole genome sequencing. This process was reiterated until filter sets were optimized. The resulting final filter set was used to generate SNP calls. 15

16 Supplementary Figure 4. Distribution of the percentage of genome sequencing reads calling the alternate allele for SNPs shown to be heterozygous (light grey) or homozygous (dark grey) through genotyping on the Illumina BeadChip 370K. 16

17 Supplementary Figure 5. Histogram of the distribution of indels by size. 17

18 Supplementary Figure 6. Histogram of the distribution of CDS indels by size. 18

19 Supplementary Figure 7. Correlation between SNP-indel densities on all chromosomes (per 10 Kb window). From top, mean coverage; SNP density; indel density; SNP and indel densities and depth of sequence coverage (moving average of ten 10 Kb windows and, for depth of sequence, normalized to average of the entire chromosome). Provided as separate figure (pdf). SuppFig7_AK1_SNP_indels.pdf Depicted below is a preview of the full version. 19

20 Supplementary Figure 8. Overview of approaches employed for structural variant detection. Structural variants were identified by 5 different methods and, confirmed using results from whole genome sequencing. 20

21 Supplementary Figure 9. Plots showing deletion candidates (between broken green vertical lines) found in 1,132 BAC clones. When available, both chromatids were shown to demonstrate heterozygous deletion. Blue or grey lines depict fold coverage. Provided as separate figure (pdf). SuppFig9_deletion_1132BAC_covplot.pdf Depicted below is a preview of the full version. 21

22 Supplementary Figure 10. Coverage plot showing deletion regions confirmed with whole genome sequencing. Candidate deletion regions from haploid sequencing, 370K & 610K microarray, and 24 M probe array CGH were confirmed using whole genome sequence at the deletion regions (between the two vertical green lines). The black horizontal line in the lower part of the plot represents reference genome contigs, and short black and red vertical lines depict hetero- and homozygous SNPs, respectively. The red horizontal line shows the mean coverage value of 30. Provided as separate figure (pdf). SuppFig10_AK1_deletions_covplot_heterozygosity_ver2.pdf Depicted below is a preview of the full version. 22

23 Supplementary Figure 11. Histogram showing distribution of CNVs by size detected using Agilent custom 24 M CGH array. 23

24 Supplementary Figure 12. Coverage plot showing CNV gain regions from 370K, 610K and Agilent array CGH platforms that were confirmed by coverage increases observed in whole genome sequencing. The black horizontal line in the lower part of the plot represents reference genome contigs, and short black and red vertical lines depict hetero- and homozygous SNPs, respectively. The red horizontal line shows the mean coverage value of 30. CNV gain region boundaries are depicted by two vertical green lines. Provided as separate figure (pdf). SuppFig12_AK1_Gains_array_covplot_heterozygosity.pdf Depicted below is a preview of the full version. 24

25 Supplementary Figure 13. Coverage plot showing CNV gain regions from BAC end sequencing that were confirmed by coverage increases observed in whole genome sequencing. The black horizontal line in the lower part of the plot represents reference genome contigs, and short black and red vertical lines depict hetero- and homozygous SNPs, respectively. The red horizontal line shows the mean coverage value of 30. CNV gain region boundaries are depicted by two vertical green lines. Provided as separate figure (pdf). SuppFig13_BACendins3_rand_heterozygosity.pdf Depicted below is a preview of the full version. 25

26 Supplementary Figure 14. Coverage plot showing CNV gain regions identified with long-insert paired-end sequencing and confirmed by coverage increases observed in whole genome sequencing. CNV gain region boundaries are depicted by two vertical green lines. The black horizontal line in the lower part of the plot represents reference genome contigs, and short black and red vertical lines depict hetero- and homozygous SNPs, respectively. The red horizontal line shows the mean coverage value of 30. Provided as separate figure (pdf). SuppFig14_LIPE_gain.pdf Depicted below is a preview of the full version. 26

27 Supplementary Figure 15. Chromosomal analysis of AK1 showing a normal karyotype. 27

28 Supplementary Figure 16. Validation of SNPs, indels, and CNV losses by Sanger sequencing. Provided as separate figure (pdf). SuppFig16_AK1_validation_all.pdf Depicted below is a preview of the full version. 28

29 Supplementary Figure 17. Mitochondrial haplogroup of 4 individual genome sequenced using next generation sequencing technology, and their migration route according to Mitomap ( 29

30 Supplementary Figure 18. Graph depicting mean fold-coverage along each chromosome by using either of multiple, random, and unique alignments in AK1. The black line near the bottom of each graph shows the alignment with human reference contigs. The red line displays the mean coverage of 30-fold. Provided as separate figure (pdf). SuppFig18_AK1_106randUniq_covplot_oneline.pdf Depicted below is a preview of the full version. 30

31 Supplementary Figure 19. Graph depicting mean fold-coverage along each chromosome by unique alignments using the SOAP alignment tool in NA The figure shows spikes that overlap with those of AK1. Provided as separate figure (pdf). SuppFig19_NA10851soapUniq.pdf Depicted below is a preview of the full version. 31

32 Supplementary Figure 20. A representative example showing limitations of indel detection in local repetitive or homopolymeric sequence. 32

33 Supplementary Tables Supplementary Table 1. List of BAC clones with unique sequence at both ends. SuppTable01_kogenome6_double_end-clone_1132_742.xls Depicted below is a preview of the full version. 33

34 Supplementary Table 2. Sequencing run summary for AK1 genome using GA. Sequencer Run* Template DNA # Reads Read Length (bp) bp Generated Library** Insert Size (bp) S1 gdna 30,153, ,085,517,396 SE1 - S2 gdna 29,365, ,057,173,372 SE1 - S3 gdna 26,337, ,161,376 SE1 - S4 gdna 26,905, ,597,784 SE2 - S5 gdna 32,075, ,154,703,924 SE1,SE2 - S6 gdna 34,323, ,235,652,408 SE1 - S7 gdna 31,444, ,132,015,464 SE1 - S8 gdna 29,645, ,067,255,316 SE1 - S9 gdna 35,342, ,272,332,664 SE1 - S10 gdna 37,819, ,361,495,196 SE1 - S11 gdna 43,027, ,548,978,192 SE1 - S12 gdna 43,741, ,574,677,692 SE1 - S13 gdna 40,206, ,447,448,364 SE1 - S14 gdna 79,097, ,847,494,700 PE1 - P1 gdna 28,131,026 2 x 36 2,025,433,872 PE1,PE2 130,390 P2 gdna 45,861,896 2 x 36 3,302,056,512 PE1 130 P3 gdna 36,516,167 2 x 36 2,629,164,024 PE1 130 P4 gdna 36,211,167 2 x 36 2,607,204,024 PE1 130 P5 gdna 59,918,876 2 x 36 4,314,159,072 PE1 130 P6 gdna 57,012,545 2 x 36 4,104,903,240 PE1 130 P7 gdna 75,097,500 2 x 36 5,407,020,000 PE1 130 P8 gdna 53,246,895 2 x 36 3,833,776,440 PE1 130 P9 gdna 72,835,614 2 x 36 5,244,164,208 PE1 130 P10 gdna 78,516,427 2 x 36 5,653,182,744 PE1 130 P11 gdna 72,130,403 2 x 36 5,193,389,016 PE1 130 P12 gdna 67,711,888 2 x 36 4,875,255,936 PE1 130 P13 gdna 65,120,193 2 x 36 4,688,653,896 PE2 390 P14 gdna 74,961,071 2 x 36 5,397,197,112 PE2 390 P15 gdna 61,661,384 2 x 88 10,852,403,584 PE2 390 P16 gdna 88,708,061 2 x ,806,108,932 LIPE 2700 SB1 Chr.20 BAC 17,160, ,793,156 BAC1 - PB1 Chr.20 BAC 26,813,486 2 x 36 1,930,570,992 BAC2 130 PB2 Chr.20 BAC 64,973,571 2 x 36 4,678,097,112 BAC2 130 PB3 Chr.20 BAC 49,523,603 2 x 36 3,565,699,416 BAC3 130 PB4 Chr.20 BAC 48,346,754 2 x 36 3,480,966,288 BAC4 390 PB5 WG BAC 52,846,314 2 x 36 3,804,934,608 BAC5 390 PB6 WG BAC 64,217,376 2 x 36 4,623,651,072 BAC6 390 Total Reads 3,097,371,573 Total bp 130,337,289,104 * Sequencer Run ** Library S: Whole Genome Single End Run SE1: Single End library 1 P: Whole Genome Paired End Run SE2: Single End library 2 SB: BAC Single End Run PB: BAC Paired End Run PE1: Paired End library 1, insert size 130bp PE2: Paired End library 2, insert size 390bp BAC1: chr.20 BAC Single End library BAC2: chr.20 BAC Paired End library 1, 742BACs BAC3: chr.20 BAC Paired End library 2, 742BACs BAC4: chr.20 BAC Paired End library 3, 43BACs BAC5: whole chr. BAC Paired End library, ~600BACs BAC6: whole chr. BAC Paired End library, ~600BACs LIPE: Long Insert size Paired End library 34

35 Supplementary Table 3. List of filter training conditions obtained by comparing SNP calls identified through sequencing with SNPs genotyped on the Illumina 370K. Filters that optimize PPV and sensitivity for autosomal and sex chromosome SNPs were chosen and are highlighted in yellow. SuppTable03_370k_filters_ver200.xls Depicted below is a preview of the full version. 35

36 Supplementary Table 4. List of filter training conditions for differentiating heterozygous and homozygous SNP calls identified through sequencing with SNPs genotyped on the Illumina 370K. Yellow highlights the final filter. minpct unique_call avgq Minimum pct for detecting homozygous SNPs Correctly identified true positive SNPs Het called as Homo Hom called as Het Total Het/Homo Mismatched , % , % optimized % % PPV 36

37 Supplementary Table 5. Statistics of SNPs identified in AK1 genome sequences through application of optimized filters. Total Homo/Hemizygous Heterozygous Total SNPs Autosomal SNPs Sex Chromosomal SNPs CDS nssnps Novel * non-exonic UTR CDS * Compared with dbsnp

38 Supplementary Table 6. List of nssnps detected in the AK1 genome. SuppTable6_nsSnp_AK1.xls Depicted below is a preview of the full version. 38

39 Supplementary Table 7. Validation of filter conditions for detection of SNPs by comparison of non-redundant SNPs detected with an Illumina Beadchip 610K with those detected by whole genome sequencing for autosomal and sex chromosomes. Diploid region (autosomes & pseudo-autosome) het minpct reads_call unique_cal l pct avgq maxq True Positives False Positives PPV Sens Absolute Sens raw all % raw hetero % raw homo % optimized all % optimized hetero % optimized homo % Haploid (Sex chromosome excluding pseudo-autosomal region) het minpct reads_call unique_ca ll pct avgq maxq True Positives False Positives PPV Relative Sens Absolute Sens raw hemi optimized hemi

40 Supplementary Table 8. Validation of filter conditions for differentiation of hetero- and homozygous SNP calls by comparison of non-redundant SNPs detected using Illumina 610K with those detected by whole genome sequencing. minpct unique_call avgq Minimum pct for detecting homozygous SNPs Correctly identified true positive SNPs Het called as Homo Hom called as Het Total Het/Homo Mismatched PPV optimized % 40

41 Supplementary Table 9. Comparative analysis of SNPs and indels found in five personal genomes. Genome Method Depth db NCBI Bases % Cov. # SNPs 1 SNP # hetero # homo % new # ns # cds # indels % new # indels # cd Het/Hom π (SNP) SNP Ref of NCBI Ref NCBI Density SNP 1 SNP 1 SNPs 2 SNP SNP indels same 3 indels SNP (10-4 ) Venter Sanger Watson Roche YH Illumina NA18507 Illumina AK1 Illumina % : In reads aligned to NCBI ref., with application of filters and validated by genotyping array data 2: Not in dbsnp 3: # indels of size < 3 nt 41

42 Supplementary Table 10. Summary statistics for short indels identified in AK1 genome. Diploid Region Haploid Region Total Novel * Total Novel Homozygous 67,212 36,975 3,988 2,709 non-exonic Heterozygous 96,994 64, Subtotal 164, ,674 3,988 2,709 Homozygous UTR Heterozygous 1, Subtotal 1,762 1, Homozygous CDS Heterozygous Subtotal Sum Total 166, ,833 4,027 2,736 * Compared with dbsnp 129 indel size frequency

43 Supplementary Table 11. List of 170,202 short indels of AK1 genome. SuppTable11_AK1_Indel_090506_list.txt Depicted below is a preview of the full version. 43

44 Supplementary Table 12. List of 212 CDS indels for AK1 genome. SuppTable12_AK1_CDS_indels.xls Depicted below is a preview of the full version. 44

45 Supplementary Table 13. OMIM annotation for 26 homozygous CD indels of AK1 genome. Gene Symbol OMIM ID Official Full Name Disease Relationship ABCF ATP-BINDING CASSETTE, SUBFAMILY F, MEMBER 1 - APOBEC3H APOLIPOPROTEIN B mrna EDITING ENZYME, CATALYTIC POLYPEPTIDE-LIKE 3H; APOBEC3H Susceptibility to retroviral infections; the anti-hiv activity of APOBEC3H ATG9B AUTOPHAGY 9, S. CEREVISIAE, HOMOLOG OF, B - CDCP CUB DOMAIN-CONTAINING PROTEIN 2 Predicted to have oxidoreductase activity and is potentially involved in cholesterol biosynthesis CLTCL CLATHRIN, HEAVY POLYPEPTIDE-LIKE 1; CLTCL1 - CREB3L camp RESPONSE ELEMENT-BINDING PROTEIN 3-LIKE 2 member of the old astrocyte specifically induced substance (OASIS) DNA binding and basic leucine zipper dimerization (bzip) family of transcription factors CYP4F Expression analysis of CYP4F8 shows that high levels are found in epithelial linings and CYTOCHROME P450, FAMILY 4, SUBFAMILY F, POLYPEPTIDE 8; CYP4F8 epidermis of psoriatic lesions DHDH DIMERIC DIHYDRODIOL DEHYDROGENASE; DHDH - EMCN ENDOMUCIN Interfere with the assembly of focal adhesion complexes and inhibits interaction between cells and the extracellular matrix FAM26B CALCIUM HOMEOSTASIS MODULATOR 2; CALHM2 - GAS2L GROWTH ARREST-SPECIFIC 2-LIKE 1; GAS2L1 - ITIH INTER-ALPHA-TRYPSIN INHIBITOR, HEAVY CHAIN 5 Suggested that ITIH5 may play a role in breast tumorigenesis MUC MUCIN 19, OLIGOMERIC; MUC19 - PADI PEPTIDYLARGININE DEIMINASE, TYPE VI Common variants on 1p36 and 1q42 are associated with cutaneous basal cell carcinoma but not with melanoma or pigmentation traits; PAX PAIRED BOX GENE 8 Essential for the formation of thyroxine-producing follicular cells PLCD PHOSPHOLIPASE C, DELTA-3; PLCD3 - PNLIPRP PANCREATIC LIPASE-RELATED PROTEIN 2; PNLIPRP2 - PRLHR PROLACTIN-RELEASING HORMONE RECEPTOR PRLHR is a 7-transmembrane domain receptor for prolactin-releasing hormone REXO RNA EXONUCLEASE 1, S. CEREVISIAE, HOMOLOG OF; REXO1 - RHBG RHESUS BLOOD GROUP, B GLYCOPROTEIN - RTTN ROTATIN; RTTN RTTN is required for the early developmental processes of left-right (L-R) specification and axial rotation and may play a role in notochord development SARM STERILE ALPHA AND TIR MOTIFS-CONTAINING PROTEIN 1; SARM1 A candidate gene in the onset of hereditary infectious/inflammatory diseases. SEBOX SKIN-, EMBRYO-, BRAIN-, AND OOCYTE-SPECIFIC HOMEOBOX - SLC38A SOLUTE CARRIER FAMILY 38 (AMINO ACID TRANSPORTER), MEMBER 3 A putative tumour suppressor gene SMAD SMAD family member 5 Associated with schizophrenia; up-regulated Smad5 mediates apoptosis of gastric epithelial cells induced by Helicobacter pylori infection; activation by BMP-2 in lung cancer cells ZNF ZINC FINGER PROTEIN

46 Supplementary Table 14. Pearson correlations between SNP and indel densities in AK1 and YH genomes. SuppTable14_snp_indel_correlation_ver300.xls Depicted below is a preview of the full version. 46

47 Supplementary Table 15. List of CNV regions identified by the custom Agilent 24M probe CGH array. SuppTable15_AK1_Agilent24M_CNVR.xls Depicted below is a preview of the full version. 47

48 Supplementary Table 16. List of CNV regions identified by the Illumina Beadchip 370K and 610K. Platform Chr Length Start Stop CopyNumber Confidence 1st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: st 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: nd 370k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence: k CNV Confidence:

49 Supplementary Table 17. List of deletions in AK1 genome confirmed by whole genome GA sequencing. SuppTable17_deletion_list_ver3.xls Depicted below is a preview of the full version. 49

50 Supplementary Table 18. List of regions with copy number gains in AK1 genome generated by Illumina Beadchip 370K, 610K and the Agilent 24M CGH array, and confirmed Copy Number Gains detected using microarrays and confirmed using Genome Analyzer Index Chr Size (nt) Start End Affected Loci Accession # * Method(s) PDE4DIP Variation_0685 Agilent24M Variation_4249 Variation_ RNASEH1 Agilent24M K1,370K2,610K,Agilent24M SPAST Agilent24M MRPL19, C2orf3 610K,Agilent24M K,Agilent24M abpart 610K ZC3H6 Agilent24M Agilent24M BFSP2 Agilent24M FGF12 Agilent24M MUC20 Agilent24M MUC4 Agilent24M DEFB131, LOC Variation_ K Variation_31214 Agilent24M FLJ43080 Agilent24M PCDHB7, PCDHB8 Agilent24M C6orf127, CLPS Agilent24M Agilent24M FAM115C Agilent24M Variation_22883 Agilent24M Variation_ Variation_38082 Agilent24M GTF3C5, CEL Agilent24M CACNA1B Agilent24M Variation_ K Variation_24747 Variation_24748 Variation_ DNAJC15 Agilent24M Agilent24M HNRNPA1L2, SUGT1 610K,Agilent24M Agilent24M Agilent24M Variation_4853 Agilent24M LOC727832, POTEB, OR4M2 Variation_ K,Agilent24M Variation_ LOC K,Agilent24M K Agilent24M Variation_5434 Agilent24M CES1 Agilent24M Agilent24M FKSG24, RAB3A Agilent24M ZNF676, LOC342994, ZNF98, ZNF492, ZNF99 370K1,370K2,610K,Agilent24M FFAR3 Variation_32249 Agilent24M Variation_7201 Variation_ ZNF468 Variation_ K,Agilent24M Variation_ VPREB1 Variation_10609 Agilent24M Agilent24M 45 X Agilent24M * Compared with database of genomic variants (DGV) (Nov ) 50

51 Supplementary Table 19. List of regions with copy number gains in AK1 genome identified by BAC end sequencing and confirmed by whole genome GA sequencing. Copy Number Gains using BAC end sequences Index Chr Estimated Size Start Stop Gene Accession # Methods RMND5A BAC end sequencing LOC BAC end sequencing FRG2C, BC BAC end sequencing BAC end sequencing CTAGE4,FLJ43692,OR2A1,OR2A7,ARHGEVariation_31398 BAC end sequencing Variation_ ANKRD30A BAC end sequencing HYDIN, KIAA1864 BAC end sequencing A26B2 BAC end sequencing 9 Y RBMY1 BAC end sequencing * Compared with database of genomic variants (DGV) (Nov ) 51

52 Supplementary Table 20. List of regions with copy number gains in AK1 genome detected by compressed pairs in LIPE (long-inserted paired-end) whole genome sequencing and confirmed by whole genome coverage depth. Copy Number Gains using LIPE Index Chr Expected size of Gains (bp) Start Stop Gene Accession # * avg_insert_size Method EPHA8 Variation_ LIPE LIPE DLEC LIPE SORCS LIPE Variation_ LIPE LIPE ICA LIPE LIPE LIPE GOLM LIPE LIPE LIPE LIPE LIPE CLEC17A LIPE LIPE PEPD LIPE HNF4A LIPE LIPE LIPE LIPE 22 X LIPE 23 X LIPE * Compared with database of genomic variants (DGV) (Nov ) 52

53 Supplementary Table 21. List of gene ontologies for genes in which nssnps were found, showing the total number of genes containing nssnps and the number of genes containing nssnps that were common to AK1, YH, and NA SuppTable21_GO_analysis.xls Depicted below is a preview of the full version. 53

54 Supplementary Table 22. List of 773 SNPs with potential clinical implications identified by the Trait-o-matic algorithm. SuppTable22_AK1_773_TraitOmatic.xls Depicted below is a preview of the full version. 54

55 Supplementary Table 23. PCR primers used for validation of SNPs, short indels, and CNVs by Sanger sequencing. SuppTable23_all_validation_list_primers_ver600.xls Depicted below is a preview of the full version. 55

Nature Biotechnology: doi: /nbt.1904

Nature Biotechnology: doi: /nbt.1904 Supplementary Information Comparison between assembly-based SV calls and array CGH results Genome-wide array assessment of copy number changes, such as array comparative genomic hybridization (acgh), is

More information

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction Optimization strategy of Copy Number Variant calling using Multiplicom solutions Michael Vyverman, PhD; Laura Standaert, PhD and Wouter Bossuyt, PhD Abstract Copy number variations (CNVs) represent a significant

More information

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library Marilou Wijdicks International Product Manager Research For Life Science Research Only. Not for Use in Diagnostic Procedures.

More information

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Here we compare the results of this study to potentially overlapping results from four earlier studies

More information

Genomic structural variation

Genomic structural variation Genomic structural variation Mario Cáceres The new genomic variation DNA sequence differs across individuals much more than researchers had suspected through structural changes A huge amount of structural

More information

Introduction to LOH and Allele Specific Copy Number User Forum

Introduction to LOH and Allele Specific Copy Number User Forum Introduction to LOH and Allele Specific Copy Number User Forum Jonathan Gerstenhaber Introduction to LOH and ASCN User Forum Contents 1. Loss of heterozygosity Analysis procedure Types of baselines 2.

More information

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit APPLICATION NOTE Ion PGM System Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit Key findings The Ion PGM System, in concert with the Ion ReproSeq PGS View Kit and Ion Reporter

More information

Global variation in copy number in the human genome

Global variation in copy number in the human genome Global variation in copy number in the human genome Redon et. al. Nature 444:444-454 (2006) 12.03.2007 Tarmo Puurand Study 270 individuals (HapMap collection) Affymetrix 500K Whole Genome TilePath (WGTP)

More information

Agilent s Copy Number Variation (CNV) Portfolio

Agilent s Copy Number Variation (CNV) Portfolio Technical Overview Agilent s Copy Number Variation (CNV) Portfolio Abstract Copy Number Variation (CNV) is now recognized as a prevalent form of structural variation in the genome contributing to human

More information

ChIP-seq data analysis

ChIP-seq data analysis ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017 Contents Background ChIP-seq protocol ChIP-seq data analysis Transcriptional regulation Transcriptional

More information

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc. Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Topics Overview of Data Processing Pipeline Overview of Data Files 2 DNA Nano-Ball (DNB) Read Structure Genome : acgtacatgcattcacacatgcttagctatctctcgccag

More information

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Application Note Authors John McGuigan, Megan Manion,

More information

LTA Analysis of HapMap Genotype Data

LTA Analysis of HapMap Genotype Data LTA Analysis of HapMap Genotype Data Introduction. This supplement to Global variation in copy number in the human genome, by Redon et al., describes the details of the LTA analysis used to screen HapMap

More information

CNV Detection and Interpretation in Genomic Data

CNV Detection and Interpretation in Genomic Data CNV Detection and Interpretation in Genomic Data Benjamin W. Darbro, M.D., Ph.D. Assistant Professor of Pediatrics Director of the Shivanand R. Patil Cytogenetics and Molecular Laboratory Overview What

More information

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers Gordon Blackshields Senior Bioinformatician Source BioScience 1 To Cancer Genetics Studies

More information

Below, we included the point-to-point response to the comments of both reviewers.

Below, we included the point-to-point response to the comments of both reviewers. To the Editor and Reviewers: We would like to thank the editor and reviewers for careful reading, and constructive suggestions for our manuscript. According to comments from both reviewers, we have comprehensively

More information

DETECTION OF LOW FREQUENCY CXCR4-USING HIV-1 WITH ULTRA-DEEP PYROSEQUENCING. John Archer. Faculty of Life Sciences University of Manchester

DETECTION OF LOW FREQUENCY CXCR4-USING HIV-1 WITH ULTRA-DEEP PYROSEQUENCING. John Archer. Faculty of Life Sciences University of Manchester DETECTION OF LOW FREQUENCY CXCR4-USING HIV-1 WITH ULTRA-DEEP PYROSEQUENCING John Archer Faculty of Life Sciences University of Manchester HIV Dynamics and Evolution, 2008, Santa Fe, New Mexico. Overview

More information

Identifying Mutations Responsible for Rare Disorders Using New Technologies

Identifying Mutations Responsible for Rare Disorders Using New Technologies Identifying Mutations Responsible for Rare Disorders Using New Technologies Jacek Majewski, Department of Human Genetics, McGill University, Montreal, QC Canada Mendelian Diseases Clear mode of inheritance

More information

Colorspace & Matching

Colorspace & Matching Colorspace & Matching Outline Color space and 2-base-encoding Quality Values and filtering Mapping algorithm and considerations Estimate accuracy Coverage 2 2008 Applied Biosystems Color Space Properties

More information

Structural Variation and Medical Genomics

Structural Variation and Medical Genomics Structural Variation and Medical Genomics Andrew King Department of Biomedical Informatics July 8, 2014 You already know about small scale genetic mutations Single nucleotide polymorphism (SNPs) Deletions,

More information

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies Cytogenetics 101: Clinical Research and Molecular Genetic Technologies Topics for Today s Presentation 1 Classical vs Molecular Cytogenetics 2 What acgh? 3 What is FISH? 4 What is NGS? 5 How can these

More information

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc. Variant Classification Author: Mike Thiesen, Golden Helix, Inc. Overview Sequencing pipelines are able to identify rare variants not found in catalogs such as dbsnp. As a result, variants in these datasets

More information

MRC-Holland MLPA. Description version 08; 30 March 2015

MRC-Holland MLPA. Description version 08; 30 March 2015 SALSA MLPA probemix P351-C1 / P352-D1 PKD1-PKD2 P351-C1 lot C1-0914: as compared to the previous version B2 lot B2-0511 one target probe has been removed and three reference probes have been replaced.

More information

Hands-On Ten The BRCA1 Gene and Protein

Hands-On Ten The BRCA1 Gene and Protein Hands-On Ten The BRCA1 Gene and Protein Objective: To review transcription, translation, reading frames, mutations, and reading files from GenBank, and to review some of the bioinformatics tools, such

More information

Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing

Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing www.sciencemag.org/cgi/content/full/science.1186802/dc1 Supporting Online Material for Analysis of Genetic Inheritance in a Family Quartet by Whole-Genome Sequencing Jared C. Roach, Gustavo Glusman, Arian

More information

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Supplementary Materials RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Junhee Seok 1*, Weihong Xu 2, Ronald W. Davis 2, Wenzhong Xiao 2,3* 1 School of Electrical Engineering,

More information

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB Analysis Kits Next-generation performance in liquid biopsies 2 Accelerating clinical research From liquid biopsy to next-generation

More information

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014 Challenges of CGH array testing in children with developmental delay Dr Sally Davies 17 th September 2014 CGH array What is CGH array? Understanding the test Benefits Results to expect Consent issues Ethical

More information

Interactive analysis and quality assessment of single-cell copy-number variations

Interactive analysis and quality assessment of single-cell copy-number variations Interactive analysis and quality assessment of single-cell copy-number variations Tyler Garvin, Robert Aboukhalil, Jude Kendall, Timour Baslan, Gurinder S. Atwal, James Hicks, Michael Wigler, Michael C.

More information

Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009

Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009 Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009 1 Abstract A stretch of chimpanzee DNA was annotated using tools including BLAST, BLAT, and Genscan. Analysis of Genscan predicted genes revealed

More information

Investigating rare diseases with Agilent NGS solutions

Investigating rare diseases with Agilent NGS solutions Investigating rare diseases with Agilent NGS solutions Chitra Kotwaliwale, Ph.D. 1 Rare diseases affect 350 million people worldwide 7,000 rare diseases 80% are genetic 60 million affected in the US, Europe

More information

A complete next-generation sequencing workfl ow for circulating cell-free DNA isolation and analysis

A complete next-generation sequencing workfl ow for circulating cell-free DNA isolation and analysis APPLICATION NOTE Cell-Free DNA Isolation Kit A complete next-generation sequencing workfl ow for circulating cell-free DNA isolation and analysis Abstract Circulating cell-free DNA (cfdna) has been shown

More information

Analysis with SureCall 2.1

Analysis with SureCall 2.1 Analysis with SureCall 2.1 Danielle Fletcher Field Application Scientist July 2014 1 Stages of NGS Analysis Primary analysis, base calling Control Software FASTQ file reads + quality 2 Stages of NGS Analysis

More information

Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer

Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer Supplementary Information Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer Lars A. Forsberg, Chiara Rasi, Niklas Malmqvist, Hanna Davies, Saichand

More information

Shape-based retrieval of CNV regions in read coverage data. Sangkyun Hong and Jeehee Yoon*

Shape-based retrieval of CNV regions in read coverage data. Sangkyun Hong and Jeehee Yoon* 254 Int. J. Data Mining and Bioinformatics, Vol. 9, No. 3, 2014 Shape-based retrieval of CNV regions in read coverage data Sangkyun Hong and Jeehee Yoon* Department of Computer Engineering, Hallym University

More information

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project Introduction RNA splicing is a critical step in eukaryotic gene

More information

CNV detection. Introduction and detection in NGS data. G. Demidov 1,2. NGSchool2016. Centre for Genomic Regulation. CNV detection. G.

CNV detection. Introduction and detection in NGS data. G. Demidov 1,2. NGSchool2016. Centre for Genomic Regulation. CNV detection. G. Introduction and detection in NGS data 1,2 1 Genomic and Epigenomic Variation in Disease group, Centre for Genomic Regulation 2 Universitat Pompeu Fabra NGSchool2016 methods: methods Outline methods: methods

More information

Most severely affected will be the probe for exon 15. Please keep an eye on the D-fragments (especially the 96 nt fragment).

Most severely affected will be the probe for exon 15. Please keep an eye on the D-fragments (especially the 96 nt fragment). SALSA MLPA probemix P343-C3 Autism-1 Lot C3-1016. As compared to version C2 (lot C2-0312) five reference probes have been replaced, one reference probe added and several lengths have been adjusted. Warning:

More information

Understanding DNA Copy Number Data

Understanding DNA Copy Number Data Understanding DNA Copy Number Data Adam B. Olshen Department of Epidemiology and Biostatistics Helen Diller Family Comprehensive Cancer Center University of California, San Francisco http://cc.ucsf.edu/people/olshena_adam.php

More information

Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep

Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep Tom Walsh, PhD Division of Medical Genetics University of Washington Next generation sequencing Sanger sequencing gold

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature11396 A total of 2078 samples from a large sequencing project at decode were used in this study, 219 samples from 78 trios with two grandchildren who were not also members of other trios,

More information

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits Accelerating clinical research Next-generation sequencing (NGS) has the ability to interrogate many different genes and detect

More information

MEDICAL GENOMICS LABORATORY. Next-Gen Sequencing and Deletion/Duplication Analysis of NF1 Only (NF1-NG)

MEDICAL GENOMICS LABORATORY. Next-Gen Sequencing and Deletion/Duplication Analysis of NF1 Only (NF1-NG) Next-Gen Sequencing and Deletion/Duplication Analysis of NF1 Only (NF1-NG) Ordering Information Acceptable specimen types: Fresh blood sample (3-6 ml EDTA; no time limitations associated with receipt)

More information

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University Role of Chemical lexposure in Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University CNV Discovery Reference Genetic

More information

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016 Introduction to genetic variation He Zhang Bioinformatics Core Facility 6/22/2016 Outline Basic concepts of genetic variation Genetic variation in human populations Variation and genetic disorders Databases

More information

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection Dr Elaine Kenny Neuropsychiatric Genetics Research Group Institute of Molecular Medicine Trinity College Dublin

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC. Supplementary Figure 1 Rates of different mutation types in CRC. (a) Stratification by mutation type indicates that C>T mutations occur at a significantly greater rate than other types. (b) As for the

More information

Identification of genomic alterations in cervical cancer biopsies by exome sequencing

Identification of genomic alterations in cervical cancer biopsies by exome sequencing Chapter- 4 Identification of genomic alterations in cervical cancer biopsies by exome sequencing 105 4.1 INTRODUCTION Athough HPV has been identified as the prime etiological factor for cervical cancer,

More information

MRC-Holland MLPA. Description version 19;

MRC-Holland MLPA. Description version 19; SALSA MLPA probemix P6-B2 SMA Lot B2-712, B2-312, B2-111, B2-511: As compared to the previous version B1 (lot B1-11), the 88 and 96 nt DNA Denaturation control fragments have been replaced (QDX2). SPINAL

More information

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used

More information

SCALPEL MICRO-ASSEMBLY APPROACH TO DETECT INDELS WITHIN EXOME-CAPTURE DATA. Giuseppe Narzisi, PhD Schatz Lab

SCALPEL MICRO-ASSEMBLY APPROACH TO DETECT INDELS WITHIN EXOME-CAPTURE DATA. Giuseppe Narzisi, PhD Schatz Lab SCALPEL MICRO-ASSEMBLY APPROACH TO DETECT INDELS WITHIN EXOME-CAPTURE DATA Giuseppe Narzisi, PhD Schatz Lab November 14, 2013 Micro-Assembly Approach to detect INDELs 2 Outline Scalpel micro-assembly pipeline

More information

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Stanford Biostatistics Workshop Pierre Neuvial with Henrik Bengtsson and Terry Speed Department of Statistics, UC Berkeley

More information

High-throughput transcriptome sequencing

High-throughput transcriptome sequencing High-throughput transcriptome sequencing Erik Kristiansson (erik.kristiansson@zool.gu.se) Department of Zoology Department of Neuroscience and Physiology University of Gothenburg, Sweden Outline Genome

More information

SALSA MLPA KIT P060-B2 SMA

SALSA MLPA KIT P060-B2 SMA SALSA MLPA KIT P6-B2 SMA Lot 111, 511: As compared to the previous version B1 (lot 11), the 88 and 96 nt DNA Denaturation control fragments have been replaced (QDX2). Please note that, in contrast to the

More information

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data. Supplementary Figure 1 PCA for ancestry in SNV data. (a) EIGENSTRAT principal-component analysis (PCA) of SNV genotype data on all samples. (b) PCA of only proband SNV genotype data. (c) PCA of SNV genotype

More information

MRC-Holland MLPA. Description version 07; 26 November 2015

MRC-Holland MLPA. Description version 07; 26 November 2015 SALSA MLPA probemix P266-B1 CLCNKB Lot B1-0415, B1-0911. As compared to version A1 (lot A1-0908), one target probe for CLCNKB (exon 11) has been replaced. In addition, one reference probe has been replaced

More information

SALSA MLPA probemix P169-C2 HIRSCHSPRUNG-1 Lot C As compared to version C1 (lot C1-0612), the length of one probe has been adjusted.

SALSA MLPA probemix P169-C2 HIRSCHSPRUNG-1 Lot C As compared to version C1 (lot C1-0612), the length of one probe has been adjusted. mix P169-C2 HIRSCHSPRUNG-1 Lot C2-0915. As compared to version C1 (lot C1-0612), the length of one has been adjusted. Hirschsprung disease (HSCR), or aganglionic megacolon, is a congenital disorder characterised

More information

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY.

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. SAMPLE REPORT SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. RESULTS SNP Array Copy Number Variations Result: LOSS,

More information

Implementation of the DDD/ClinGen OGT (CytoSure v3) Microarray

Implementation of the DDD/ClinGen OGT (CytoSure v3) Microarray Implementation of the DDD/ClinGen OGT (CytoSure v3) Microarray OGT UGM Birmingham 08/09/2016 Dom McMullan Birmingham Women's NHS Trust WM chromosomal microarray (CMA) testing Population of ~6 million (10%)

More information

Illuminating the genetics of complex human diseases

Illuminating the genetics of complex human diseases Illuminating the genetics of complex human diseases Michael Schatz Sept 27, 2012 Beyond the Genome @mike_schatz / #BTG2012 Outline 1. De novo mutations in human diseases 1. Autism Spectrum Disorder 2.

More information

iplex genotyping IDH1 and IDH2 assays utilized the following primer sets (forward and reverse primers along with extension primers).

iplex genotyping IDH1 and IDH2 assays utilized the following primer sets (forward and reverse primers along with extension primers). Supplementary Materials Supplementary Methods iplex genotyping IDH1 and IDH2 assays utilized the following primer sets (forward and reverse primers along with extension primers). IDH1 R132H and R132L Forward:

More information

Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing

Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing Supplementary Information Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing Can Alkan 1,6, Jeffrey M. Kidd 1, Tomas Marques-Bonet 1,2, Gozde Aksay 1, Francesca 1

More information

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage:

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage: Genomics 101 (2013) 134 138 Contents lists available at SciVerse ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno Gene-based copy number variation study reveals a microdeletion at

More information

DNA-seq Bioinformatics Analysis: Copy Number Variation

DNA-seq Bioinformatics Analysis: Copy Number Variation DNA-seq Bioinformatics Analysis: Copy Number Variation Elodie Girard elodie.girard@curie.fr U900 institut Curie, INSERM, Mines ParisTech, PSL Research University Paris, France NGS Applications 5C HiC DNA-seq

More information

MRC-Holland MLPA. Description version 12; 13 January 2017

MRC-Holland MLPA. Description version 12; 13 January 2017 SALSA MLPA probemix P219-B3 PAX6 Lot B3-0915: Compared to version B2 (lot B2-1111) two reference probes have been replaced and one additional reference probe has been added. In addition, one flanking probe

More information

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY.

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. SAMPLE REPORT SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. RESULTS SNP Array Copy Number Variations Result: GAIN,

More information

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

November 9, Johns Hopkins School of Medicine, Baltimore, MD, Fast detection of de-novo copy number variants from case-parent SNP arrays identifies a deletion on chromosome 7p14.1 associated with non-syndromic isolated cleft lip/palate Samuel G. Younkin 1, Robert

More information

Same Day, Cost-Effective Aneuploidy Detection with Agilent Oligonucleotide array CGH and MDA Single Cell Amplification Method

Same Day, Cost-Effective Aneuploidy Detection with Agilent Oligonucleotide array CGH and MDA Single Cell Amplification Method Same Day, Cost-Effective Aneuploidy Detection with Agilent Oligonucleotide array CGH and MDA Single Cell Amplification Method Presenter: Dr. Ali Hellani, Founder, Viafet Genomic Center, Dubai Wednesday,

More information

SALSA MLPA probemix P315-B1 EGFR

SALSA MLPA probemix P315-B1 EGFR SALSA MLPA probemix P315-B1 EGFR Lot B1-0215 and B1-0112. As compared to the previous A1 version (lot 0208), two mutation-specific probes for the EGFR mutations L858R and T709M as well as one additional

More information

CHROMOSOMAL MICROARRAY (CGH+SNP)

CHROMOSOMAL MICROARRAY (CGH+SNP) Chromosome imbalances are a significant cause of developmental delay, mental retardation, autism spectrum disorders, dysmorphic features and/or birth defects. The imbalance of genetic material may be due

More information

MRC-Holland MLPA. Description version 14; 28 September 2016

MRC-Holland MLPA. Description version 14; 28 September 2016 SALSA MLPA probemix P279-B3 CACNA1A Lot B3-0816. As compared to version B2 (lot B2-1012), one reference probe has been replaced and the length of several probes has been adjusted. Voltage-dependent calcium

More information

A. Incorrect! Cells contain the units of genetic they are not the unit of heredity.

A. Incorrect! Cells contain the units of genetic they are not the unit of heredity. MCAT Biology Problem Drill PS07: Mendelian Genetics Question No. 1 of 10 Question 1. The smallest unit of heredity is. Question #01 (A) Cell (B) Gene (C) Chromosome (D) Allele Cells contain the units of

More information

Pedigree Analysis Why do Pedigrees? Goals of Pedigree Analysis Basic Symbols More Symbols Y-Linked Inheritance

Pedigree Analysis Why do Pedigrees? Goals of Pedigree Analysis Basic Symbols More Symbols Y-Linked Inheritance Pedigree Analysis Why do Pedigrees? Punnett squares and chi-square tests work well for organisms that have large numbers of offspring and controlled mating, but humans are quite different: Small families.

More information

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015

More information

Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping

Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping RESEARCH ARTICLE Open Access Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping Bujie Zhan 1, João Fadista 1, Bo Thomsen 1, Jakob Hedegaard 1,2, Frank

More information

SALSA MLPA probemix P241-D2 MODY mix 1 Lot D As compared to version D1 (lot D1-0911), one reference probe has been replaced.

SALSA MLPA probemix P241-D2 MODY mix 1 Lot D As compared to version D1 (lot D1-0911), one reference probe has been replaced. mix P241-D2 MODY mix 1 Lot D2-0413. As compared to version D1 (lot D1-0911), one reference has been replaced. Maturity-Onset Diabetes of the Young (MODY) is a distinct form of non insulin-dependent diabetes

More information

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Accessing and Using ENCODE Data Dr. Peggy J. Farnham 1 William M Keck Professor of Biochemistry Keck School of Medicine University of Southern California How many human genes are encoded in our 3x10 9 bp? C. elegans (worm) 959 cells and 1x10 8 bp 20,000

More information

MRC-Holland MLPA. Description version 29; 31 July 2015

MRC-Holland MLPA. Description version 29; 31 July 2015 SALSA MLPA probemix P081-C1/P082-C1 NF1 P081 Lot C1-0114. As compared to the previous B2 version (lot 0813 and 0912), 11 target probes are replaced or added, and 10 new reference probes are included. P082

More information

CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery

CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery PacBio CoLab Session October 20, 2017 For Research Use Only. Not for use in diagnostics procedures. Copyright 2017 by Pacific Biosciences

More information

SEAMLESS CGH DIAGNOSTIC TESTING

SEAMLESS CGH DIAGNOSTIC TESTING SEAMLESS CGH DIAGNOSTIC TESTING GENETISURE DX POSTNATAL ASSAY Informed decisions start with a complete microarray platform for postnatal analysis For In Vitro Diagnostic Use INTENDED USE: GenetiSure Dx

More information

Identification of both copy number variation-type and constant-type core elements in a large segmental duplication region of the mouse genome

Identification of both copy number variation-type and constant-type core elements in a large segmental duplication region of the mouse genome Identification of both copy number variation-type and constant-type core elements in a large segmental duplication region of the mouse genome Umemori et al. Umemori et al. BMC Genomics 2013, 14:455 Umemori

More information

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis HMG Advance Access published December 21, 2012 Human Molecular Genetics, 2012 1 13 doi:10.1093/hmg/dds512 Whole-genome detection of disease-associated deletions or excess homozygosity in a case control

More information

Vega: Variational Segmentation for Copy Number Detection

Vega: Variational Segmentation for Copy Number Detection Vega: Variational Segmentation for Copy Number Detection Sandro Morganella Luigi Cerulo Giuseppe Viglietto Michele Ceccarelli Contents 1 Overview 1 2 Installation 1 3 Vega.RData Description 2 4 Run Vega

More information

A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data

A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data Chihyun Park 1, Jaegyoon Ahn 1, Youngmi Yoon 2, Sanghyun Park 1 * 1 Department

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature13908 Supplementary Tables Supplementary Table 1: Families in this study (.xlsx) All families included in the study are listed. For each family, we show: the genders of the probands and

More information

Nature Structural & Molecular Biology: doi: /nsmb.2419

Nature Structural & Molecular Biology: doi: /nsmb.2419 Supplementary Figure 1 Mapped sequence reads and nucleosome occupancies. (a) Distribution of sequencing reads on the mouse reference genome for chromosome 14 as an example. The number of reads in a 1 Mb

More information

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing Last update: 05/10/2017 MODULE 4: SPLICING Lesson Plan: Title MEG LAAKSO Removal of introns from messenger RNA by splicing Objectives Identify splice donor and acceptor sites that are best supported by

More information

NEXT GENERATION SEQUENCING. R. Piazza (MD, PhD) Dept. of Medicine and Surgery, University of Milano-Bicocca

NEXT GENERATION SEQUENCING. R. Piazza (MD, PhD) Dept. of Medicine and Surgery, University of Milano-Bicocca NEXT GENERATION SEQUENCING R. Piazza (MD, PhD) Dept. of Medicine and Surgery, University of Milano-Bicocca SANGER SEQUENCING 5 3 3 5 + Capillary Electrophoresis DNA NEXT GENERATION SEQUENCING SOLEXA-ILLUMINA

More information

CITATION FILE CONTENT/FORMAT

CITATION FILE CONTENT/FORMAT CITATION For any resultant publications using please cite: Matthew A. Field, Vicky Cho, T. Daniel Andrews, and Chris C. Goodnow (2015). "Reliably detecting clinically important variants requires both combined

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION CONTENTS A. AUTISM SPECTRUM DISORDER (ASD) SAMPLE AND CONTROL COLLECTIONS 4 ASD samples 4 Control cohorts 4 B. GENOTYPING AND DATA CLEANING 6 SNP quality control 6 Intensity quality control for CNV detection

More information

Transcript reconstruction

Transcript reconstruction Transcript reconstruction Summary I Data types, file formats and utilities Annotation: Genomic regions Genes Peaks bedtools Alignment: Map reads BAM/SAM Samtools Aggregation: Summary files Wig (UCSC) TDF

More information

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays The Harvard community has made this article openly available. Please share how this access benefits you.

More information

SALSA MLPA probemix P360-A1 Y-Chromosome Microdeletions Lot A

SALSA MLPA probemix P360-A1 Y-Chromosome Microdeletions Lot A SALSA MLPA probemix P360-A1 Y-Chromosome Microdeletions Lot A1-1011. This SALSA MLPA probemix is for basic research and intended for experienced MLPA users only! This probemix enables you to quantify genes

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Country distribution of GME samples and designation of geographical subregions.

Nature Genetics: doi: /ng Supplementary Figure 1. Country distribution of GME samples and designation of geographical subregions. Supplementary Figure 1 Country distribution of GME samples and designation of geographical subregions. GME samples collected across 20 countries and territories from the GME. Pie size corresponds to the

More information

Supplementary Figures

Supplementary Figures Supplementary Figures Supplementary Figure 1. Heatmap of GO terms for differentially expressed genes. The terms were hierarchically clustered using the GO term enrichment beta. Darker red, higher positive

More information

Ambient temperature regulated flowering time

Ambient temperature regulated flowering time Ambient temperature regulated flowering time Applications of RNAseq RNA- seq course: The power of RNA-seq June 7 th, 2013; Richard Immink Overview Introduction: Biological research question/hypothesis

More information

MRC-Holland MLPA. Description version 06; 23 December 2016

MRC-Holland MLPA. Description version 06; 23 December 2016 SALSA MLPA probemix P417-B2 BAP1 Lot B2-1216. As compared to version B1 (lot B1-0215), two reference probes have been added and two target probes have a minor change in length. The BAP1 (BRCA1 associated

More information

MODULE 3: TRANSCRIPTION PART II

MODULE 3: TRANSCRIPTION PART II MODULE 3: TRANSCRIPTION PART II Lesson Plan: Title S. CATHERINE SILVER KEY, CHIYEDZA SMALL Transcription Part II: What happens to the initial (premrna) transcript made by RNA pol II? Objectives Explain

More information

MRC-Holland MLPA. Description version 18; 09 September 2015

MRC-Holland MLPA. Description version 18; 09 September 2015 SALSA MLPA probemix P090-A4 BRCA2 Lot A4-0715, A4-0714, A4-0314, A4-0813, A4-0712: Compared to lot A3-0710, the 88 and 96 nt control fragments have been replaced (QDX2). This product is identical to the

More information

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 PGAR: ASD Candidate Gene Prioritization System Using Expression Patterns Steven Cogill and Liangjiang Wang Department of Genetics and

More information