Integrated detection and population-genetic analysis. of SNPs and copy number variation

Size: px
Start display at page:

Download "Integrated detection and population-genetic analysis. of SNPs and copy number variation"

Transcription

1 Integrated detection and population-genetic analysis of SNPs and copy number variation Steven A. McCarroll 1,2,*, Finny G. Kuruvilla 1,2,*, Joshua M. Korn 1,SimonCawley 3, James Nemesh 1, Alec Wysoker 1, Michael H. Shapero 3,PaulI.W.deBakker 4,JulianMaller 6, Andrew Kirby 6, Amanda L. Elliott 1, Melissa Parkin 1, Earl Hubbell 3, Teresa Webster 3, Rui Mei 3, Robert Handsaker 1, Steve Lincoln 3, Marcia Nizzari 1, John Blume 3, Keith Jones 3, Rich Rava 3,MarkJ.Daly 1,5,6,StaceyB.Gabriel 1, and David Altshuler 1,2,6 * These authors contributed equally to this work. Correspondence: altshuler@molbio.mgh.harvard.edu, mccarroll@molbio.mgh.harvard.edu 1 Program in Medical and Population Genetics and Genetic Analysis Platform, The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. 2 Department of Molecular Biology, Massachusetts General Hospital, Boston Massachusetts, USA. 3 Affymetrix Inc., Santa Clara, California, USA. 4 Harvard Medical School, Boston, Massachusetts, USA. 5 Department of Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA. 6 Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, USA.

2 Abstract Dissecting the genetic basis of disease risk requires accurate measurement of all forms of genetic variation, including SNPs and copy number variants (CNVs), and is enabled by accurate maps of their locations, frequencies, and population-genetic properties. We developed a hybrid genotyping array that simultaneously measures 906,600 SNPs and copy number at 1.8 million genomic locations. By characterizing the 270 HapMap samples on this array, we developed a physical and linkage-disequlibrium map of human copy number variation (at 2 kb physical resolution) that is informed by integer genotypes across 270 individuals for more than 1,500 copy-number polymorphisms (CNPs) that segregate in the population at an appreciable frequency. We find that copy number variation affects a far smaller portion of the genome than previously reported; 85% of the genes in reported copy-number-variant regions (CNVRs) fall outside the CNVs within these regions. The vast majority (> 95%) of copy-number differences between any two individuals are due to a limited universe of ancestral copy number polymorphisms (CNPs), the great majority of which are in linkage disequilibrium with SNPs and also can be queried by linkage disequilibrium-based approaches. The hybrid genotyping arrays and population-genetic map of copy-number polymorphism described here will enable the integrated exploration and analysis of SNPs and CNVs in human disease. 2

3 Whole-genome association studies were made possible by accurate, detailed maps of human sequence variation, and by accurate analysis methods for typing SNPs. Over the same time that first-generation whole-genome association studies were conducted, the human genome has been demonstrated to show extensive copy-number variation 1-9. Important progress has been made in identifying large genomic regions that appear to harbor CNVs 9, and finer-scale descriptions of many CNVs in select individuals 10. However, the ability to assess copy-number variation for a role in disease is severely limited by the lack of techniques for accurately measuring the copynumber level of each CNV in each patient, and the lack of basic enabling knowledge about the precise locations and allele frequencies of most of the copy number polymorphisms (CNPs) that segregate in the population 11. We sought to develop hybrid oligonucleotide microarrays with the ability to accurately analyze SNPs and copy-number variation simultaneously across the genome, and to use these arrays to develop an improved understanding of the physical locations and population-genetic properties of human copy-number polymorphisms, that could inform medical genetics research. Results Development of hybrid SNP/CNV genotyping arrays Our ability to expand content on genotyping arrays was enabled by an insight which increased the efficiency with which genetic variation was captured. The Affymetrix 500K array interrogated each SNP with different 25mer probes, designed to both strands and at different offsets with respect to each SNP; we evaluated the information content of each probe with respect to the genotype at each SNP provided by the International HapMap Project 12,13. Logistic regression revealed that among the different probes used for each SNP, some were far more informative than others (Fig. 1a). Using only the best A/B probe pair for each SNP supported 1

4 genotyping performance only slightly diminished compared to using all probes (Fig. 1b). This result, supported by other studies 14,15, suggested an alternative design, involving multiple replicates of the most informative probe pair rather than a single copy each of many different probe sequences. A simulation of such a design based on technical replicates achieved equivalent or improved performance with dramatically fewer probes (Fig 1c). A prototype microarray (SNP 5.0) was manufactured using the best A/B probe pair each in 4 copies, requiring only 4M probes (on one array) to interrogate the same SNPs as the 13M-probe Affymetrix 500K array set (Table 1). Cluster separation between SNP genotype classes was improved (Supplementary Fig. 1). Encouraged, we extended these principles to design a next-generation array with increased SNP coverage and explicit copy number detection. We screened more than 2 million additional SNPs (chosen from HapMap and dbsnp) on a prototype array, identifying 700 thousand additional SNPs for which individual A/B probe pairs yielded effective cluster separation. A total of 936 thousand SNPs were selected to optimize coverage of common patterns of variation in the three Hapmap analysis panels (Table 1). For each SNP, the best A/B probe pair was placed on the array 3-4 times. We empirically selected 900 thousand additional probes to directly interrogate copy number variation, unrestricted by the locations and sequence properties of SNPs. Thirteen million candidate probe sequences, designed to interrogate sites spread across the genome, were placed on a prototype array. To identify the probe sequences most responsive to copy-number variation, a titration experiment was performed in which the ratio of two reduced representations of the genome 16 (the Nsp and Sty fractions) was varied; each probe was evaluated for its correlation to DNA dosage across the titration series (Methods). To provide uniform coverage across the genome, 800 thousand probes were selected based on technical performance and physical 2

5 spacing. We sought to make the genome-wide coverage as uniform as possible within the constraints of the technology (Fig. 1d), including the majority of low-copy repeat regions (LCRs) in which we were able to find unique 25mer sequences (high-copy LCRs present in six or more copies per diploid genome were not covered). To maximize detection in regions of previously reported copy number variation, 140 thousand additional probes were targeted at high density in 3,700 regions reported to contain CNVs 1-9. Initial performance evaluation These arrays represented a first attempt to try to collect SNP and copy-number information across the genome on a common array platform and with an empirical probe-selection strategy. It was therefore critical to evaluate the quality of the data generated. A novel SNP genotyping algorithm was developed for the arrays (JMK, FGK, et al., accompanying manuscript). Across a variety of samples and laboratories we observed genotyping completeness greater than 99%; concordance with HapMap genotypes exceeded 99.5% (JMK, FGK, et al., accompanying manuscript). Coverage was substantially improved compared to the Affymetrix 500K platform: among the 3 million SNPs on HapMap Phase II with minor allele frequency 5%, the SNP 6.0 array provided a mean maximum r 2 of 0.89 in YRI, 0.95 in CEU, and 0.94 in CHB+JPT using multi-marker haplotype tests 17 (Table 1); still-greater coverage may be possible using imputation 18. We evaluated the array probes performance for copy-number analysis by examining probes on the X chromosome using a signal-to-noise ratio (SNR) to evaluate such probes abillity to distinguish males from females. Individual copy-number probes on the SNP 6.0 array displayed greater SNR (median = 2.0) than SNP probe sets on the 500K or SNP 5.0 arrays (median = 1.4). Of course, an individual BAC probe (150 kb, approximately six thousand times larger) provides 3

6 still-higher SNR (for BAC probes from an earlier study 9, we found that median SNR= 6.9). To evaluate the overall resolution of an array for copy-number analysis, it is necessary to analyze information per fixed span of genome, combining information for adjacent probes. The SNR of the typical BAC probe with respect to large variants (which span an 150 kb BAC probe) was achieved with about 10 probes from the SNP 6.0 array, spanning a mean of 22 kb (Fig. 2a). Of course, both techniques appear to be able to identify CNVs much smaller than 150 kb or 22 kb, as we describe below. Physical extent of human copy-number variation We used this array to develop a high-resolution map of copy-number variation in the 270 HapMap samples. Our goal was to construct a map that was precise and accurate in two dimensions: (1) the boundaries of the regions affected by CNV (what DNA is encompassed); and (2) the measurement of an accurate integer copy-number genotype in each individual. We used two computational approaches to identify CNVs in this dataset: a hidden Markov model Birdseye (JMK et al., accompanying manuscript), and an approach based on correlation between nearby probes across a large population sample (Methods). To maximize the quality of reported findings, we ran duplicate experiments in independent labs, and report only the CNVs that were observed in both experiments, in the same samples and at essentially identical genomic locations. Using these stringent criteria, we identified 3,084 CNV regions among the HapMap samples. To estimate the sensitivity of our approach for identifying CNVs, and its precision for estimating their locations, it was important to compare our results to a completely independent data set. We compared these results to a set of CNVs that had been identified in eight of the HapMap 4

7 individuals by a sequence-based method (fosmid end-sequence-pair [ESP] analysis) and localized by complete fosmid resequencing or targeted 200bp-resolution oligo array CGH (J Kidd, G Cooper, E Eichler, et al., in press). Of the independently-identified CNVs from these eight individuals, our data identified 76% of the CNVs larger than 10 kb and 64% of the CNVs 5-10 kb. We identified only 27% of the ESP-identified CNVs smaller than 5 kb (and believe the true sensitivity of our approach for such small CNVs to be much less than 27%, since both approaches have sharply diminishing senstivity as CNV sizes fall below 4 kb). Our estimates of the locations of CNV regions (whose resolution is limited by the spacing of probes on the array) differed from the sequencing-established breakpoints by a median of 1.6 kb. Based on these results, we estimate that our map describes the majority (70%) of CNVs larger than 5 kb that are present in the 270 HapMap individuals, at approximately 2 kb resolution. We compared this map to two other data sets: a catalog of copy-number-variant regions from the same 270 HapMap samples identified by BAC arraycgh and the 500K array 9, and the full set of structural variants identified in eight of these same individuals by analysis of fosmid ESPs (J Kidd, E Eichler, et al., in press). We identified CNVs within 82% of the regions identified by Redon et al., but the CNVs were generally far smaller (5-10 fold) than the size of the CNV regions previously reported (Fig. 2 b-h). By contrast, our size estimates showed good concordance (Fig. 2i) with estimates based on the apparent size discrepancy of fosmid ESPs. The agreement of our data with a sequence-based method confirms that the physical scale of CNVs is far smaller than implied by the coordinates of the CNV regions previously described in the HapMap samples. This is different from the prevailing interpretation of earlier CNV data, which has generally assumed that such events are large (>100 kb) and are a leading indicator of a far-larger number of intermediate-size ( kb) CNVs yet to be discovered 19. Numerous published analyses of the functional content of CNVs the genes, ultraconserved elements, and 5

8 segmental duplications they contain are based on a literal interpretation of the reported coordinates of CNV regions. Such findings will need to be reevaluated in light of the 5-10 fold downsizing of most of these regions. As just one example, an earlier finding that CNVs disproportionately affect structural proteins is based on a 400-kilboase CNV region that contains a family of 20 structural-protein-encoding genes in the late cornified envelope family 9,20.Our results indicate that only about 45 kb of this 400 kb region is affected (by two distinct, common CNVs), and that only two (rather than twenty) of these genes are copy number variant (Fig. 2f). Moreover, estimates that 12% of the genome is involved in large-scale copy-number variation 9, and that 18% of the genome is involved in copy-number variation at all scales (Database of Genomic Variants), do not appear to be supported; we estimate that less than 3% of the genome is affected by variants larger than 50 kb in these 270 individuals, and that a still-smaller fraction of the genome (0.3%) differs in copy number between any two individuals (see below), at the scale (> 5kb) at which we can characterize CNVs on this platform. Common, rare, and de novo copy number variation Almost half of the CNV regions identified (1,501) were observed in a single family, but not in unrelated individuals; we refer to these here as rare CNVs. (A few may also reflect somatic events or cell-line artifacts.) The other half of the CNV regions (comprising 1,583 nonoverlapping regions) were observed in multiple unrelated individuals, generally across genomic segments that were indistinguishable at the resolution of the array (Fig. 2 b-f). We refer to these as copy number polymorphisms (CNPs), because they appear (at 2kb resolution) to involve the same affected genomic sequence in each individual (an inference that was not possible in BACresolution CNV studies or in CNV catalogs made from isolated individuals) and are therefore consistent with a model of a polymorphism that segregates in the population at an appreciable frequency. 6

9 Because CNPs segregate at an appreciable frequency, they can be heterozygous or homozygous in many individuals genomes, giving rise to three or more potential copy-number levels in a population. An individual s status for a CNP is therefore not well-described by terms such as gain or loss relative to a normal reference. To assess the role of CNPs in disease and population genetics, the integer copy-number level of each CNP locus must be accurately measured in each individual (a CNP genotype ). To genotype these 1,583 CNPs, the intensity measurements of the probes corresponding to each CNP were summarized into a single measurement for each sample (Supplementary Table 1); these quantitative measurements were then clustered into a series of discrete classes corresponding to successive integer copy-number levels (Fig. 3 a-f, Supplementary Fig. 2). A set of heuristics, informed by the relative copynumber intensity measurements across these genotype classes (Methods), was then used to assign a specific integer copy-number level to each class (Fig. 3 a-f, Supplementary Table 1). The results defined two distinct subsets of CNPs, which we here refer to as diallelic and multiallelic. For most CNPs (1,434 of 1,583), we observed two or three diploid copy-number levels, whose population frequencies were consistent with the Mendelian segregation of two copy-number alleles in Hardy-Weinberg equilibrium (Fig. 3 c-d). A minority of CNPs (149 of 1,583, approximately 9%) could not be explained by a simple, two-allele system, because the population data was distributed into four or more diploid copy-number levels (Fig. 3e)orinto three levels for which the intermediate level was overwhelmingly the most common (Fig. 3f). (For an additional 180 regions not counted in the above figures, measurements did not cluster into discrete classes; these could in principle represent more-complex CNVs, or events for which our array lacked the precision to measure the underlying genotypes accurately.) To evaluate the accuracy of these copy-number assignments (and in the absence of a gold- 7

10 standard set of reference genotypes analogous to HapMap SNPs), several approaches were used. To evaluate genotypes for diallelic CNPs, we applied quality-control criteria analogous to those used to assess SNP genotypes. Diallelic CNPs showed low rates of deviation from Mendelian inheritance (0.4%), a rate consistent with a low rate of genotype error (comparable to that of highquality filtered SNP genotypes in HapMap 13,21 ). Genotypes for common diallelic CNPs also conformed to Hardy-Weinberg equilibrium (failing at p < 0.01 with a rate of 0.015). To evaluate genotypes for multiallelic CNPs (Fig. 3 e-f), we used inheritance (Fisher s h) to measure the correlation between parental and offspring copy-number levels. Across 58 multiallelic CNPs that segregated at high frequency among the CEU and YRI trios (with at least 10% of individuals showing a non-modal copy number), Fisher s h was distributed with a mean of 1.0 (the level expected for a perfectly heritable trait), and was significantly (p < 0.01) less than 1.0 for only two (of 58) common, multiallelic CNPs. These genotype data made it possible to assess the nature of human copy-number variation in a manner informed by the frequency of each copy-number class in each population. On average, we estimate that two unrelated individuals from the same population differed in copy number at 225 (in CEU, JPT, CHB) or 290 genomic loci (in YRI), spanning 6.4 Mb of the genome (6.7 Mb in YRI), and overlapping the transcribed regions of 100 genes (120 in YRI). Approximately 95% of the copy-number differences in this comparison result from CNPs (versus only 5% from rare CNVs), and 82% from a subset of CNPs that segregate at high frequency (minor allele frequency > 5%). This suggests that a limited universe of common CNPs captures a very large component of human copy number variation, a pattern of variation similar to that observed previously for SNPs. The fact that we did not detect CNVs in thousands of the CNV regions described in current databases 22 (despite having dozens of probes within each of those regions) suggests that many 8

11 such reported CNVs are rare variants (or false discoveries), consistent with the allelic spectrum we observe (in which fully half of the CNV regions we observe are rare CNVs) and therefore partially explaining the lack of concordance that has been observed among CNV data sets ascertained in different individuals 23. These data were also able to explain an earlier pattern in CNV studies, the identification of a larger number of CNV regions in population samples with African ancestry than in similarlysized population samples without African ancestry. Using the CNP genotypes to parse this result by allele frequency revealed that it is entirely explained by CNPs with allele frequency less than 15% (Fig. 3g). Like a similar relationship for SNPs, this indicates the loss of many lowfrequency alleles in the narrow population bottlenecks that accompanied human migration out of Africa. Some specific CNPs have been observed to segregate at different frequencies in different populations, an observation which could in theory be due to the action of recent selection on CNP alleles 24,25. It is reasonable to speculate that large-scale polymorphisms may be particularly likely to have functional effects that would cause their allele frequencies to be shaped by recent selection, but this hypothesis has not been tested on a genome scale. We used the CNP genotype data to assess the distribution of F st, a measure of population differentiation in allele frequencies, for SNP and CNP genotypes. The distributions of F st for CNPs and SNPs were indistinguishable (Fig. 3h), suggesting that common CNPs are not more influenced by recent selection than are common SNPs. We performed a carefully curated analysis to identify de novo CNVs among the offspring in 60 HapMap father-mother-offspring trios (Methods). All but ten of the rare CNVs observed in trio offspring were also observed in at least one parent. (An unknown fraction of these ten could be 9

12 somatic mutations in the offspring, or cell-line mutations, that are not present in the individual s germ line.) Thus, even when ascertained in a cultured lymphoblastoid cell line, the contribution of de novo CNV formation to an individual s genome appears to be more than 100 times smaller than the contribution of inheritance. Combined with the observed low rate of Mendelian inconsistencies for common CNPs, these results suggest that the copy-number differences observed among normal individuals are overwhelmingly inherited from parents. Linkage disequilibrium properties of CNPs Previous estimates of linkage disequilibrium between SNPs and CNPs have been constrained by the limited number of CNPs for which accurate genotypes could be obtained, by insufficient knowledge of the true locations of CNPs detected by large clones, and by diminished density of effectively typed SNPs (that could serve as potential tags) in the repeat-rich regions in which CNVs are enriched. We assessed linkage disequilibrium in two ways: for common CNPs, by correlation to other genetic markers; for rarer CNPs, by the haplotype diversity of chromosomes that carry the event. A majority of common, diallelic CNPs (with minor allele frequency greater than 5%) were perfectly captured (r 2 = 1.0) by at least one SNP tag from HapMap Phase 2 (Fig. 4a). Common CNPs showed a modest shift in the distribution of r 2 compared to frequency-matched SNPs in the same populations (Fig. 4a). This taggability gap, which has been noted in earlier studies, could in principle be due to recurrent rearrangements at CNV sites 4,9 or to a paucity of SNPs in repeat-rich regions to serve as potential tags 8,9. To distinguish between these potential explanations, we examined the relationship between r 2 and physical distance from the estimated 10

13 CNP breakpoint. Recurrent formation of a CNV would reduce correlations to flanking markers; by contrast, a paucity of SNP tags would would reduce the number of potential tags without affecting the relationship between r 2 and physical distance. Mean r 2 as a function of distance was indistinguishable for SNPs and CNPs (Fig. 4b), suggesting that common, diallelic CNPs are overwhelmingly ancestral mutations. (We note that multiallelic CNPs, and CNPs not genotyped by our platform, may have different properties.) The linkage disequlibrium properties of less-common CNVs have not been previously investigated. In disease studies, a CNV site observed in multiple patients may arise from shared ancestry at a locus (a low-frequency CNP), or alternatively from independent structural mutations at the site (distinct and possibly de novo CNVs, which might offer stronger evidence of causality between the locus and the disease). To evaluate the extent to which shared rare CNVs in normal individuals arise from shared ancestry, we identified 162 CNVs that were observed in exactly two unrelated YRI individuals in HapMap. We used inheritance to phase these CNV alleles onto the SNP haplotypes defined by HapMap SNP genotypes in the same trios. Rare CNVs that were observed in two unrelated individuals were almost always present on the same SNP haplotype in both individuals (Fig. 4c). This haplotype sharing was robustly detectable at distances greater than 200 kb (Fig. 4c), suggesting that sharing of rare CNVs can be a proxy for sharing of muchlarger genomic regions. Dissecting complex CNVs Studies of copy-number variation have noted the complexity of copy number variation. Several CNVs have been identified that cannot be understood in terms of a single mutational event. We explored a number of these and found that they could be explained not by one, but by two or three 11

14 simple mutations. In some cases, the apparent complexity of a CNV (Fig. 5 a, b) resulted from interrogating multiple chromosomes together. For example, a deletion region that was believed to be longer and more complex in one individual than in others 6 appears to be explained by two nearby, non-overlapping deletion polymorphisms (both common) that the individual inherited from different parents (Fig. 5c); both deletions segregate on specific SNP haplotypes and therefore appear to be unique ancestral events. Another CNV that initially appeared to show architectural complexity (Fig. 5b) was readily explained as the combination of two simpler deletion polymorphisms that were segregating separately (Fig. 5d) on specific SNP haplotypes. Many apparently complex CNVs can thus be incorporated into association studies as their molecular and population-genetic nature is elucidated. Copy-number analysis of earlier whole-genome scans While arrays of this sort make it possible to directly interrogate copy-number variation in new disease studies, tens of thousands of samples have already been scanned on first-generation SNP arrays. It was therefore important to assess what information can be gleaned about copy number variation from the SNP assays on earlier arrays. We had hypothesized that the content of firstgeneration SNP arrays was systematically biased against genomic regions affected by common CNPs, because common CNPs cause SNP data to fail. Our map of copy-number polymorphisms and their frequencies confirmed that this hypothesis was largely correct: common CNPs generally corresponded to bald spots in the physical coverage of first-generation SNP arrays (Fig. 6a, b). Of 423 CNPs that we observed to be common (minor allele frequency > 5%) in HapMap, only 24% (103 / 423) were interrogated by two or more SNPs on the Affymetrix 500K or Illumina 650Y arrays, and most (56%) were not interrogated by even a single SNP assay (Fig. 6c). By contrast, low-frequency CNPs (as well as rare CNVs) showed little or no bias (Fig. 6b), and were 12

15 consequently much better captured: 80% (646 / 1075) of low-frequency CNPs (maf < 5%) were interrogated by at least one SNP, and 64% by two or more SNPs (data not shown), with a probe density comparable to coverage of the genome as a whole (Fig. 6b). (Sensitivity for copy number analysis is a function not only of number of probes, but of measurement precision and analysis algorithms; however, a CNP that is not interrogated by any probes on a SNP array platform cannot be measured by any algorithm). We were concerned that this analysis might be confounded by the fact that we had defined these CNPs using the SNP 6.0 array; we therefore analyzed coverage of a completely independent set of 100 CNV regions discovered by fosmid ESP analysis and refined by complete sequencing (J Kidd, E Eichler, et al., in press). These sequencing-defined CNV regions showed a coverage bias that was similar to (though slightly less extreme than) that of common CNPs, reflecting that they are a mixture of common CNPs, lowerfrequency CNPs, and rare CNVs (Fig. 6b). The extent to which earlier SNP arrays are blind to common CNPs has not been previously appreciated, because estimates of CNV coverage by SNP array platforms have utilized the Database of Genomic Variants 22, which appears to include much non-cnv genome in CNV definitions (Fig. 2b-f, h; Fig. 6a) and does not distinguish between common CNPs and rare CNVs. One implication of this observation is that, although there exists an extensive literature on mining copy-number information from SNP array data, efforts to extract copy-number information from earlier whole-genome association studies will be unable to observe most common CNPs directly. Another possibility is that the disease effects of some common CNPs could be captured in wholegenome association studies through linkage disequilibrium to SNPs that were typed. We used the linkage-disequilibrium map of CNPs to assess capture of common CNPs by linkage disequilibrium on five commercial array platforms widely used for genome-wide association 13

16 studies (Fig. 6d). Slightly under half (40-50%) of common CNPs were captured (r 2 >0.8)by markers on first-generation SNP arrays (a larger fraction could in principle be identified by multimarker haplotypes 17 or imputation 18 ). This suggests that at least a partial picture of the contribution of common CNPs to disease might be obtained by analyzing the association of typed markers that are in linkage disequilibrium with CNPs; such an analysis could be supplemented by using the intensity data from the SNP arrays to identify rare and de novo CNVs that span the typed SNPs. Having made an accurate map of common CNPs and their linkage disequilibrium properties, it is now possible to specify which allelic tests can be used to capture each CNP using the data from each array platform. As a resource to support groups wanting to evaluate the potential association of disease phenotypes with common CNPs, we provide a list of allelic tests (Supplementary Data) that can be used to infer the states of CNPs from SNPs typed in whole-genome association studies to date. Discussion We have described the creation of experimental tools to simultaneously interrogate SNPs and CNVs, and their use to develop a physical and linkage disequilibrium map of CNPs in the HapMap cohorts. Our results suggest that the extent of SNP and intermediate-to-large-scale (>10 kb) copy-number variation in each of us is similar, rather than disproportionately due to CNVs. The great majority of genes and genomic regions previously reported to be copy-number variant are not so at an appreciable frequency. The population genetics of SNPs and CNPs (frequency, population differentiation, linkage disequilibrium patterns) are much more similar than previously thought. As both common CNPs and rare CNVs appear to derive overwhelmingly from inheritance and shared ancestry, linkage disequilibrium-based analyses will provide an important 14

17 context for interpreting putative associations of SNPs and CNVs with phenotypes. An integrated understanding of the contribution of SNPs and copy-number variation to human phenotypes will help elucidate the molecular meachanisms of disease. 15

18 Methods Experiments DNA from the 270 HapMap individuals was obtained from Coriell Laboratories, and independently prepared, labeled, and hybridized to the SNP 6.0 arrays (in distinct batch configurations and plate layouts) at Affymetrix and the Broad Institute. This experimental protocol is described elsewhere; briefly, Nsp and Sty restriction enzymes are used to prepare a reduced representation of the genome that is amplified, biotin-labeled, and hybridized to arrays, which are then fluorescently labeled and scanned to yield a measurement of hybridization intensity for each probe. Such experiments are single-channel (one sample per array). To ensure that probe-level intensity measurements from different samples were comparable, the set of 6.9 million probe-specific measurements from each of the 270 individuals was quantile normalized to a common distribution. Probe sets corresponding to the same allele of the same SNP (3-4 probes each) were summarized by median polish. This yielded 2.7 million measurements for each sample (corresponding to 940 thousand individual copy-number probes, and 2 x 900 thousand SNP alleles), which became the input to downstream analyses. Identification of CNVs Identification of CNVs using the HMM Birdseye is described in the accompanying manuscript (JMK et al., accompanying manuscript). We developed an additional method for identifying common CNPs, built around utilizing the pattern of intensity measurements acorss a large 16

19 population. In this approach, we searched the genome for regions across which multiple copynumber probes showed highly correlated patterns of intensity (across the HapMap samples). All sets of 2, 4, and 8 consecutive probes were analyzed. The correlation score for any set of 2, 4, or 8 probes was the median pairwise correlation of their measurements across the 270 samples. To create an empirical null distribution with which we could determine the probability of observing particular correlations by chance, we randomly permuted the locations of all probes and repeated the analysis. To set thresholds (separate for each window size) for determining correlation scores to be significant, we compared the empirical distribution to the null distribution, identified the correlation score at which the height of the empirical distribution was ten times greater then the height of the null distribution, and used this as a threshold. This identified groups of highly correlated neighboring probes across the genome. We agglomeratively clustered overlapping groups of probes. The result was the identification of a series of genomic segments over which particular population copy-number patterns prevailed. To further refine the boundaries of these genomic segments of persistent probe-to-probe correlation, we summarized the measurements from experiment 1 into a single summarized measurement per sample per CNV segment (using the median polish of the log-intensities, normalized to the median intensity of all copy-number probes on the array), then identified (from the data in experiment 2) the individual copy-number probes and SNP probe sets for which intensity measurements were correlated with the (CNP-summarized) measurement from the first experiment. In this way, we defined a potential breakpoint using information across as many samples as possible. To merge the catalogs of CNVs identified by this approach with lower-frequency CNVs discovered by Birdseye, (1) CNVs identified by one or the other approach were maintained intact; (2) for regions identified by both approaches, we maintained the definition of the region obtained 17

20 by the probe-correlation approach, since it was informed by a larger number of observations. We excluded CNVs that potentially represented artifacts in the sample preparation process (for example, for which probe measurements could be influenced by a single restriction site, or by linkage disequilibrium between pairs of cryptic sequence polymorphisms). CNV regions and sizes were defined by the span of the genomic sites interrogated by all probes offering contributing evidence to a CNV (which tends to slightly underestimate CNV size). Genotyping of copy number polymorphisms A typical use of genotyping arrays will involve hybridizing each indvidual sample to a single array and performing automated computational analyses of the data; we describe in the accompanying manuscript (JMK, FGK, et al.,) new algorithms for such an application. To create a reference data set of particular crispness and utility, we utilized here the independent experiments performed on the same samples in labs at Broad and Affymetrix, and further undertook a process of manual curation of the genotypes for each CNP. For each CNP probe set, we generated a scatter plot comparing the summarized intensity measurement (across all probes contributing to the identification of the CNP, summarized using the median polish of the logintensity measurements) of each HapMap sample in experiment 1 versus the summarized intensity measurement for the same sample in experiment 2; we then manually identified each of the discrete copy-number levels present in the population (Fig. 3 a-f). We generated cluster assignments only for those samples belonging to the same class in both experiments. To assign a specific integer copy number to each genotype cluster (Fig. 3 a-f), we utilized an observation and an assumption: (1) the observation that the relationship between DNA copy number and hybridization intensity is slightly non-linear, with intensity increasing not quite 18

21 linearly with the number of DNA copies; (2) the assumption that the series of copy-number classes present in the population represent consecutive integer copy-number levels. The combination of this observation and assumption led to a simple heuristic, in which the ratios of the average intensities for consecutive copy-number classes were used to inform the appropriate set of consecutive integer copy-number levels to assign to each class. For example, for a CNP for which two copy-number classes were observed (Fig 3 c-d), the observation of an intensity ratio greater than 3/2 but less than 2/1 (between the two classes) evidenced that the CNP was a deletion CNP (producing copy number classes of 1 and 2, with an intensity ratio less than 2/1 but greater than 3/2) rather than a duplication CNP (producing copy number classes of 2 and 3, with an intensity ratio less than 3/2) (Fig. 3 a-b). Identification of de novo CNVs We sought to identify any rare CNVs that were present in an individual and in neither of that individual s parents, using the 60 CEU and YRI trios from HapMap. We used the Birdseye CNV-discovery algorithm (JMK et al., accompanying manuscript) to produce a list of putative CNVs in each individual. To minimize false discoveries, we utilized the data from two independent experiments and required that a CNV be observed in both experiments. Any apparent chromosome-scale anomalies were removed from analysis. We joined CNVs across samples if they significantly overlapped across at least 80% of their length. We removed regions of known somatic rearrangement (immunoglobulin regions ) as well as common CNPs (for which inheritance was separately evaluated using clustering-based genotypes and found to conform to Mendelian inheritance, as described in the main text of this manuscript.) Genome-wide CNV discovery suffers from a bias in which, because of the large number of sites tested genome-wide, false negatives are tolerated at a much higher rate than false positives; such arbitrary genomewide statistical thresholds can create the appearance of a de novo CNV that is in fact inherited 19

22 (but simply missed in the parent by the CNV discovery algorithm). We therefore sought to specifically measure without such bias the copy number of each individual in the population at each of these CNV sites (or more precisely, the probability distribution across potential copynumber levels given the observed hybridization data). We used Birdseye to estimate a posterior probability of each copy number level in each sample at each CNV site (JMK et al., accompanying manuscript). We considered a CNV to be de novo if (1) the probability of the altered copy number level in the child was three orders of magnitude greater than the probability of normal copy number level; and (2) the probability of the altered copy number level in each parent was three orders of magnitude smaller than the probability of normal copy number level. Linkage disequilibrium analysis Linkage disequilibrium analysis utilized the CNP genotypes obtained from clustering, for both low- and high-frequency diallelic CNPs (Fig. 3 a-f). Inheritance in the CEU and YRI trios was used to determine phase of each CNP in each trio. These phased CNP genotypes were then compared to the phased SNP haplotypes for the same trios. Phased SNP genotypes were obtained from HapMap. A window of 200 kb around each CNP s start and end was used for analysis. Genome coordinates All physical positions described in the text and figures utilize the hg17 (b35) build of the human genome sequence. Availability of data Raw hybridization data have been deposited into NCBI s dbgap (accession: phs v1.p1) and are publicly available. CNP-summarized intensity data are provided here both in raw form (Supplementary Data), and plotted for each CNP to illustrate cluster separation (Supplementary Figure 1). 20

23 Acknowledgements We wish to thank Jeff Kidd, Greg Cooper, and Evan Eichler for sharing unpublished data on high-resolution breakpoints of select CNVs from sequencing and array CGH; and Eric Lander, Joel Hirschhorn, and Sekar Kathiresan, for thoughtful readings of the manuscript. 21

24 Table 1. Design characteristics of array platforms described in this study. Estimated coverage (average maximum r 2 ) is for common (maf > 5%) SNPs in HapMap Phase II. Multi-marker coverage is computed using two-snp haplotypes as described in K SNP 5.0 SNP 6.0 Number of arrays Total number of probes 13.0 million 4.6 million 6.9 million Interrogation of SNPs Probes per SNP Designed SNP assays 500, , ,000 Validated / activated SNP assays 500, , ,600 Coverage (single-/multi-marker) YRI 0.63 / / / 0.89 CEU 0.78 / / / 0.95 CHB + JPT 0.78 / / / 0.94 Copy number (CN) probes CNVR-targeted CN probes 0 100, ,000 Additional CN probes 0 320, ,000 All CN probes 0 420, ,000 22

25 References 1. Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, (2004). 2. Iafrate, A.J. et al. Detection of large-scale variation in the human genome. Nat Genet 36, (2004). 3. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat Genet 37, (2005). 4. Sharp, A.J. et al. Segmental duplications and copy-number variation in the human genome. Am J Hum Genet 77, (2005). 5. Hinds, D.A., Kloek, A.P., Jen, M., Chen, X. & Frazer, K.A. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat Genet 38, 82-5 (2006). 6. Conrad, D.F., Andrews, T.D., Carter, N.P., Hurles, M.E. & Pritchard, J.K. A high-resolution survey of deletion polymorphism in the human genome. Nat Genet 38, (2006). 7. McCarroll, S.A. et al. Common deletion polymorphisms in the human genome. Nat Genet 38, (2006). 8. Locke, D.P. et al. Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am J Hum Genet 79, (2006). 9. Redon, R. et al. Global variation in copy number in the human genome. Nature 444, (2006). 10. Korbel, J.O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, (2007). 11. McCarroll, S.A. & Altshuler, D.M. Copy-number variation and association studies of human disease. Nat Genet 39, S37-42 (2007). 12. A haplotype map of the human genome. Nature 437, (2005). 13. Frazer, K.A. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, (2007). 14. Smemo, S. & Borevitz, J.O. Redundancy in genotyping arrays. PLoS ONE 2, e287 (2007). 15. Antipova, A.A., Tamayo, P. & Golub, T.R. A strategy for oligonucleotide microarray probe reduction. Genome Biol 3, RESEARCH0073 (2002). 16. Altshuler, D. et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, (2000). 17. de Bakker, P.I. et al. Efficiency and power in genetic association studies. Nat Genet 37, (2005). 18. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat Genet 39, (2007). 19. Stranger, B.E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, (2007). 20. Cooper, G.M., Nickerson, D.A. & Eichler, E.E. Mutational and selective effects on copynumber variants in the human genome. Nat Genet 39, S22-9 (2007). 21. Consortium, I.H. A haplotype map of the human genome. Nature 437, (2005). 22. Zhang, J., Feuk, L., Duggan, G.E., Khaja, R. & Scherer, S.W. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet Genome Res 115, (2006). 23. Scherer, S.W. et al. Challenges and standards in integrating surveys of structural variation. Nat Genet 39, S7-15 (2007). 24. Kidd, J.M., Newman, T.L., Tuzun, E., Kaul, R. & Eichler, E.E. Population stratification of a common APOBEC gene deletion polymorphism. PLoS Genet 3, e63 (2007). 25. Perry, G.H. et al. Diet and the evolution of human amylase gene copy number variation. Nat Genet 39, (2007). 26. Pe'er, I. et al. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat Genet 38, (2006). 23

26 Figure legends Figure 1. Design of new microarrays (a) Density plots of information content at a probe level for 10,000 randomly selected SNPs from the Nsp array of the Affymetrix 500K platform. Information is defined in the sense of the C-statistic of logistic regression, that is, how effectively a given probe could be used to correctly genotype Hapmap samples via logistic regression. (b) Using the information metric described above, the best and worst probe pair (in a joint sense) could be defined. A random probe pair was also selected. Using these selections, the Bayesian Robust Linear Model using Mahalanobis Distance (BRLMM) algorithm was run on the 270 Hapmap samples (Sty fraction), but only allowed to use two probes ( blinded performance). The comparison of the normal operation of BRLMM (all pairs) to the blinded performance of the best pair, random pair, and worst pair is shown in terms of call rates and concordances. Error bars designate the 90% confidence intervals (t-test, n=270). (c) A single sample was hyribidized to 21 Affymetrix 500K arrays. Using these empirical data, virtual chips were simulated in which each probe was tiled up to 21 times. For each simulation of the number of probes, the experiment was conducted in triplicate, that is, three random draws from the 21 replicates were performed. Shown are the means and 90% confidence intervals of call rates and concordances (t-test, n=3). Horizontal lines represent empirical performance of the Affymetrix 500K array (Nsp fraction). (d) Physical coverage for copy-number analysis. Shown here is the percent of the genome that has a probe within X base pairs on each side. 1

27 Figure 2. Discovery and sizes of CNV regions (a) Signal-to-noise ratio as a function of physical resolution. Black curve indicates the SNR of WGTP BAC probes from an earlier study used to make an initial map of copy number variation across the HapMap samples (Redon et al.). Red curve indicates the SNR of sets of probes on 6.0 array, as a function of the distance spanned by those probes. Dotted lines indicate that the SNR achieved by a typical BAC (150 kb) is achieved at approximately 22 kb physical resolution by probes on the 6.0 array. (b -f). Revision of CNV regions. Heatmap representation of copy-number measurements from 6.0 array in 90 individuals (HapMap CEU). At each probe, intensity measurements (relative to median sample) are represented by shades of orange, with red corresponding to reduced intensity and yellow to increased intensity. Note that at sites of common CNPs, the median individual can be heterozygous for a CNP allele and therefore have an intermediate copy number. CNV definitions from earlier studies are indicated by blue horizontal lines. In figure f, triangles indicate genes in the late cornified envelope (LCE) gene family. (g). Length distribution of CNPs observed to segregate at an appreciable frequency. (h -i). Comparison of the estimated sizes of CNPs identified in this study, to corresponding size estimates of the same CNPs from studies by Redon et al. (h) and JK, GC, EE, et al., in press (i). 2

28 Figure 3. Classes of copy-number variants; reproducibility and discrete quality of copy-number measurements (a -f) Reproducible population distributions of discrete-valued copy number measurements. Each HapMap sample was analyzed in two different labs; probes across each CNP region (Fig. 2 b-f) were used to derive a summarized measurement for each sample for each CNP, in each experiment. Observed categories of variation include rare CNVs (a, b), in which an altered copy-number level is observed in only a single individual, or in an individual and her offspring; diallelic CNPs (c, d), for which two copy-number alleles each appear to segregate at an appreciable frequency; and multiallelic CNPs (e, f) that are not explained by a simple, two-allele system. (g) Frequency distribution for CNPs in different HapMap populations (red = CEU; green = JPT+CHB; blue = YRI). (h) Quantile-quantile plot comparing F st (for HapMap CEU and YRI populations) for CNP and SNP genotypes. 3

29 Figure 4. Linkage disequilibrium properties of CNPs. (a) Taggability of common CNPs (red) and common SNPs (black), expressed as the maximum correlation (r 2 ) to an individual SNP (data shown for HapMap CEU). (b) Average strength of linkage disequilibrium (mean r 2 ) as a function of distance from a polymorphism, for common (maf > 5%) CNPs (red) and common SNPs (black) in HapMap CEU. (c) Average strength of linkage disequilibrium around low-frequency CNVs, expressed as the homozygosity of SNPs on haplotypes sharing a rare CNV allele (filled circles) or on the other haplotypes (open circles) in the same population. 4

30 Figure 5. Dissection of complex CNVs. Two CNV regions that are shaped by multiple mutational events (a and b). In (a), the appearance of a long, complex deletion results from two nearby, segregating deletion polymorphisms that the individual inherited from different parents (c). In (b), the appearance of an architecturally complex CNV is explained by two simpler deletion polymorphisms that are segregating at the same locus (d). 5

31 Figure 6. Capture of CNPs in whole-genome association studies via direct interrogation and linkage disequilibrium. (a) Locations of sites interrogated on five whole-genome array platforms, in a representative region containing two common CNPs (gray regions). Black tick marks indicate coverage by SNP assays; red tick marks indicate coverage by copy-number probes. The horizontal span of the figure corresponds to the region described as copy-number-variant in the Database of Genomic Variants. (b) Density of interrogated sites (number of sites per kb) in the genome as a whole (red bars), rare CNVs (green), common CNPs (blue), and a set of CNVs identified (without regard to frequency) by an independent, sequence-based method (orange). (c) Capture of common (maf > 5%) CNPs by direct interrogation on genome-wide association platforms. For each number of probe locations, curves indicate the fraction of common CNPs containing at least that number of probes on each array platform (orange, Affymetrix 500K; green, Illumina 317; light blue, Illumina 650Y; blue, Illumina 1M; red Affymetrix SNP 6.0). (d) Capture of common (maf > 5%) CNPs by linkage disequilibrium to SNPs typed on genome-wide association platforms. (black, SNPs in HapMap; other colors, same as in panel c). 6

32 a b c d 1 Percent of Genome Covered SNP 6.0 SNP k Nsp + 250k Sty Distance Between Consecutive Probes For Coverage Figure 1

Integrated detection and population-genetic analysis of SNPs and copy number variation

Integrated detection and population-genetic analysis of SNPs and copy number variation 8 Nature Publishing Group http://www.nature.com/naturegenetics Integrated detection and population-genetic analysis of SNPs and copy number variation Steven A McCarroll 4,, Finny G Kuruvilla 4,, Joshua

More information

Integrated detection and population-genetic analysis of SNPs and copy number variation

Integrated detection and population-genetic analysis of SNPs and copy number variation Integrated detection and population-genetic analysis of SNPs and copy number variation Steven A McCarroll 4,, Finny G Kuruvilla 4,, Joshua M Korn 6, Simon Cawley 7, James Nemesh, Alec Wysoker, Michael

More information

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Here we compare the results of this study to potentially overlapping results from four earlier studies

More information

LTA Analysis of HapMap Genotype Data

LTA Analysis of HapMap Genotype Data LTA Analysis of HapMap Genotype Data Introduction. This supplement to Global variation in copy number in the human genome, by Redon et al., describes the details of the LTA analysis used to screen HapMap

More information

Global variation in copy number in the human genome

Global variation in copy number in the human genome Global variation in copy number in the human genome Redon et. al. Nature 444:444-454 (2006) 12.03.2007 Tarmo Puurand Study 270 individuals (HapMap collection) Affymetrix 500K Whole Genome TilePath (WGTP)

More information

Genomic structural variation

Genomic structural variation Genomic structural variation Mario Cáceres The new genomic variation DNA sequence differs across individuals much more than researchers had suspected through structural changes A huge amount of structural

More information

Identification of regions with common copy-number variations using SNP array

Identification of regions with common copy-number variations using SNP array Identification of regions with common copy-number variations using SNP array Agus Salim Epidemiology and Public Health National University of Singapore Copy Number Variation (CNV) Copy number alteration

More information

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction Optimization strategy of Copy Number Variant calling using Multiplicom solutions Michael Vyverman, PhD; Laura Standaert, PhD and Wouter Bossuyt, PhD Abstract Copy number variations (CNVs) represent a significant

More information

Nature Biotechnology: doi: /nbt.1904

Nature Biotechnology: doi: /nbt.1904 Supplementary Information Comparison between assembly-based SV calls and array CGH results Genome-wide array assessment of copy number changes, such as array comparative genomic hybridization (acgh), is

More information

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis HMG Advance Access published December 21, 2012 Human Molecular Genetics, 2012 1 13 doi:10.1093/hmg/dds512 Whole-genome detection of disease-associated deletions or excess homozygosity in a case control

More information

Introduction to the Genetics of Complex Disease

Introduction to the Genetics of Complex Disease Introduction to the Genetics of Complex Disease Jeremiah M. Scharf, MD, PhD Departments of Neurology, Psychiatry and Center for Human Genetic Research Massachusetts General Hospital Breakthroughs in Genome

More information

Nature Genetics: doi: /ng Supplementary Figure 1

Nature Genetics: doi: /ng Supplementary Figure 1 Supplementary Figure 1 Illustrative example of ptdt using height The expected value of a child s polygenic risk score (PRS) for a trait is the average of maternal and paternal PRS values. For example,

More information

Associating Copy Number and SNP Variation with Human Disease. Autism Segmental duplication Neurobehavioral, includes social disability

Associating Copy Number and SNP Variation with Human Disease. Autism Segmental duplication Neurobehavioral, includes social disability Technical Note Associating Copy Number and SNP Variation with Human Disease Abstract The Genome-Wide Human SNP Array 6.0 is an affordable tool to examine the role of copy number variation in disease by

More information

GENOME-WIDE ASSOCIATION STUDIES

GENOME-WIDE ASSOCIATION STUDIES GENOME-WIDE ASSOCIATION STUDIES SUCCESSES AND PITFALLS IBT 2012 Human Genetics & Molecular Medicine Zané Lombard IDENTIFYING DISEASE GENES??? Nature, 15 Feb 2001 Science, 16 Feb 2001 IDENTIFYING DISEASE

More information

CS2220 Introduction to Computational Biology

CS2220 Introduction to Computational Biology CS2220 Introduction to Computational Biology WEEK 8: GENOME-WIDE ASSOCIATION STUDIES (GWAS) 1 Dr. Mengling FENG Institute for Infocomm Research Massachusetts Institute of Technology mfeng@mit.edu PLANS

More information

Nature Methods: doi: /nmeth.3115

Nature Methods: doi: /nmeth.3115 Supplementary Figure 1 Analysis of DNA methylation in a cancer cohort based on Infinium 450K data. RnBeads was used to rediscover a clinically distinct subgroup of glioblastoma patients characterized by

More information

CNV Detection and Interpretation in Genomic Data

CNV Detection and Interpretation in Genomic Data CNV Detection and Interpretation in Genomic Data Benjamin W. Darbro, M.D., Ph.D. Assistant Professor of Pediatrics Director of the Shivanand R. Patil Cytogenetics and Molecular Laboratory Overview What

More information

Agilent s Copy Number Variation (CNV) Portfolio

Agilent s Copy Number Variation (CNV) Portfolio Technical Overview Agilent s Copy Number Variation (CNV) Portfolio Abstract Copy Number Variation (CNV) is now recognized as a prevalent form of structural variation in the genome contributing to human

More information

Supplementary Figures

Supplementary Figures Supplementary Figures Supplementary Fig 1. Comparison of sub-samples on the first two principal components of genetic variation. TheBritishsampleisplottedwithredpoints.The sub-samples of the diverse sample

More information

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

November 9, Johns Hopkins School of Medicine, Baltimore, MD, Fast detection of de-novo copy number variants from case-parent SNP arrays identifies a deletion on chromosome 7p14.1 associated with non-syndromic isolated cleft lip/palate Samuel G. Younkin 1, Robert

More information

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used

More information

Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases

Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases arxiv:1010.5040v1 [stat.me] 25 Oct 2010 Statistical Science 2009, Vol. 24, No. 4, 530 546 DOI: 10.1214/09-STS304 c Institute of Mathematical Statistics, 2009 Using GWAS Data to Identify Copy Number Variants

More information

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Stanford Biostatistics Workshop Pierre Neuvial with Henrik Bengtsson and Terry Speed Department of Statistics, UC Berkeley

More information

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015

More information

Rare Variant Burden Tests. Biostatistics 666

Rare Variant Burden Tests. Biostatistics 666 Rare Variant Burden Tests Biostatistics 666 Last Lecture Analysis of Short Read Sequence Data Low pass sequencing approaches Modeling haplotype sharing between individuals allows accurate variant calls

More information

Supplementary Information. Supplementary Figures

Supplementary Information. Supplementary Figures Supplementary Information Supplementary Figures.8 57 essential gene density 2 1.5 LTR insert frequency diversity DEL.5 DUP.5 INV.5 TRA 1 2 3 4 5 1 2 3 4 1 2 Supplementary Figure 1. Locations and minor

More information

Introduction to LOH and Allele Specific Copy Number User Forum

Introduction to LOH and Allele Specific Copy Number User Forum Introduction to LOH and Allele Specific Copy Number User Forum Jonathan Gerstenhaber Introduction to LOH and ASCN User Forum Contents 1. Loss of heterozygosity Analysis procedure Types of baselines 2.

More information

Assessing Accuracy of Genotype Imputation in American Indians

Assessing Accuracy of Genotype Imputation in American Indians Assessing Accuracy of Genotype Imputation in American Indians Alka Malhotra*, Sayuko Kobes, Clifton Bogardus, William C. Knowler, Leslie J. Baier, Robert L. Hanson Phoenix Epidemiology and Clinical Research

More information

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016 Introduction to genetic variation He Zhang Bioinformatics Core Facility 6/22/2016 Outline Basic concepts of genetic variation Genetic variation in human populations Variation and genetic disorders Databases

More information

Dan Koller, Ph.D. Medical and Molecular Genetics

Dan Koller, Ph.D. Medical and Molecular Genetics Design of Genetic Studies Dan Koller, Ph.D. Research Assistant Professor Medical and Molecular Genetics Genetics and Medicine Over the past decade, advances from genetics have permeated medicine Identification

More information

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014 Challenges of CGH array testing in children with developmental delay Dr Sally Davies 17 th September 2014 CGH array What is CGH array? Understanding the test Benefits Results to expect Consent issues Ethical

More information

New Enhancements: GWAS Workflows with SVS

New Enhancements: GWAS Workflows with SVS New Enhancements: GWAS Workflows with SVS August 9 th, 2017 Gabe Rudy VP Product & Engineering 20 most promising Biotech Technology Providers Top 10 Analytics Solution Providers Hype Cycle for Life sciences

More information

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit APPLICATION NOTE Ion PGM System Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit Key findings The Ion PGM System, in concert with the Ion ReproSeq PGS View Kit and Ion Reporter

More information

New methods for discovering common and rare genetic variants in human disease

New methods for discovering common and rare genetic variants in human disease Washington University in St. Louis Washington University Open Scholarship All Theses and Dissertations (ETDs) 1-1-2011 New methods for discovering common and rare genetic variants in human disease Peng

More information

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis BST227 Introduction to Statistical Genetics Lecture 4: Introduction to linkage and association analysis 1 Housekeeping Homework #1 due today Homework #2 posted (due Monday) Lab at 5:30PM today (FXB G13)

More information

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin,

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin, ESM Methods Hyperinsulinemic-euglycemic clamp procedure During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin, Clayton, NC) was followed by a constant rate (60 mu m

More information

Global variation in copy number in the human genome

Global variation in copy number in the human genome Vol 3 November doi:.38/nature39 Global variation in copy number in the human genome Richard Redon, Shumpei Ishikawa,3, Karen R. Fitch, Lars Feuk,, George H. Perry 7, T. Daniel Andrews, Heike Fiegler, Michael

More information

CHROMOSOMAL MICROARRAY (CGH+SNP)

CHROMOSOMAL MICROARRAY (CGH+SNP) Chromosome imbalances are a significant cause of developmental delay, mental retardation, autism spectrum disorders, dysmorphic features and/or birth defects. The imbalance of genetic material may be due

More information

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Application Note Authors John McGuigan, Megan Manion,

More information

Large-scale identity-by-descent mapping discovers rare haplotypes of large effect. Suyash Shringarpure 23andMe, Inc. ASHG 2017

Large-scale identity-by-descent mapping discovers rare haplotypes of large effect. Suyash Shringarpure 23andMe, Inc. ASHG 2017 Large-scale identity-by-descent mapping discovers rare haplotypes of large effect Suyash Shringarpure 23andMe, Inc. ASHG 2017 1 Why care about rare variants of large effect? Months from randomization 2

More information

Understanding DNA Copy Number Data

Understanding DNA Copy Number Data Understanding DNA Copy Number Data Adam B. Olshen Department of Epidemiology and Biostatistics Helen Diller Family Comprehensive Cancer Center University of California, San Francisco http://cc.ucsf.edu/people/olshena_adam.php

More information

Copy Number Variations and Association Mapping Advanced Topics in Computa8onal Genomics

Copy Number Variations and Association Mapping Advanced Topics in Computa8onal Genomics Copy Number Variations and Association Mapping 02-715 Advanced Topics in Computa8onal Genomics SNP and CNV Genotyping SNP genotyping assumes two copy numbers at each locus (i.e., no CNVs) CNV genotyping

More information

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library Marilou Wijdicks International Product Manager Research For Life Science Research Only. Not for Use in Diagnostic Procedures.

More information

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies Cytogenetics 101: Clinical Research and Molecular Genetic Technologies Topics for Today s Presentation 1 Classical vs Molecular Cytogenetics 2 What acgh? 3 What is FISH? 4 What is NGS? 5 How can these

More information

Research Strategy: 1. Background and Significance

Research Strategy: 1. Background and Significance Research Strategy: 1. Background and Significance 1.1. Heterogeneity is a common feature of cancer. A better understanding of this heterogeneity may present therapeutic opportunities: Intratumor heterogeneity

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

Structural Variation and Medical Genomics

Structural Variation and Medical Genomics Structural Variation and Medical Genomics Andrew King Department of Biomedical Informatics July 8, 2014 You already know about small scale genetic mutations Single nucleotide polymorphism (SNPs) Deletions,

More information

and SNPs: Understanding Human Structural Variation in Disease. My

and SNPs: Understanding Human Structural Variation in Disease. My CNVs vs. SNPs: Understanding Human Structural Variation in Disease [0:00:00] Hello and welcome to today s Science/AAAS live webinar entitled, CNVs and SNPs: Understanding Human Structural Variation in

More information

Structural Variants and Susceptibility to Common Human Disorders Dr. Xavier Estivill

Structural Variants and Susceptibility to Common Human Disorders Dr. Xavier Estivill Structural Variants and Susceptibility Genetic Causes of Disease Lab Genes and Disease Program Center for Genomic Regulation (CRG) Barcelona 1 Complex genetic diseases Changes in prevalence (>10 fold)

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC. Supplementary Figure 1 Rates of different mutation types in CRC. (a) Stratification by mutation type indicates that C>T mutations occur at a significantly greater rate than other types. (b) As for the

More information

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays The Harvard community has made this article openly available. Please share how this access benefits you.

More information

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from Supplementary Figure 1 SEER data for male and female cancer incidence from 1975 2013. (a,b) Incidence rates of oral cavity and pharynx cancer (a) and leukemia (b) are plotted, grouped by males (blue),

More information

Tutorial on Genome-Wide Association Studies

Tutorial on Genome-Wide Association Studies Tutorial on Genome-Wide Association Studies Assistant Professor Institute for Computational Biology Department of Epidemiology and Biostatistics Case Western Reserve University Acknowledgements Dana Crawford

More information

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 PGAR: ASD Candidate Gene Prioritization System Using Expression Patterns Steven Cogill and Liangjiang Wang Department of Genetics and

More information

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits Accelerating clinical research Next-generation sequencing (NGS) has the ability to interrogate many different genes and detect

More information

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations. Supplementary Figure. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations. a Eigenvector 2.5..5.5. African Americans European Americans e

More information

Nature Neuroscience: doi: /nn Supplementary Figure 1

Nature Neuroscience: doi: /nn Supplementary Figure 1 Supplementary Figure 1 Illustration of the working of network-based SVM to confidently predict a new (and now confirmed) ASD gene. Gene CTNND2 s brain network neighborhood that enabled its prediction by

More information

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK CHAPTER 6 DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK Genetic research aimed at the identification of new breast cancer susceptibility genes is at an interesting crossroad. On the one hand, the existence

More information

On Missing Data and Genotyping Errors in Association Studies

On Missing Data and Genotyping Errors in Association Studies On Missing Data and Genotyping Errors in Association Studies Department of Biostatistics Johns Hopkins Bloomberg School of Public Health May 16, 2008 Specific Aims of our R01 1 Develop and evaluate new

More information

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits? corebio II - genetics: WED 25 April 2018. 2018 Stephanie Moon, Ph.D. - GWAS After this class students should be able to: 1. Compare and contrast methods used to discover the genetic basis of traits or

More information

UNIVERSITY OF CALIFORNIA, LOS ANGELES

UNIVERSITY OF CALIFORNIA, LOS ANGELES UNIVERSITY OF CALIFORNIA, LOS ANGELES BERKELEY DAVIS IRVINE LOS ANGELES MERCED RIVERSIDE SAN DIEGO SAN FRANCISCO UCLA SANTA BARBARA SANTA CRUZ DEPARTMENT OF EPIDEMIOLOGY SCHOOL OF PUBLIC HEALTH CAMPUS

More information

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22. Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.32 PCOS locus after conditioning for the lead SNP rs10993397;

More information

Pirna Sequence Variants Associated With Prostate Cancer In African Americans And Caucasians

Pirna Sequence Variants Associated With Prostate Cancer In African Americans And Caucasians Yale University EliScholar A Digital Platform for Scholarly Publishing at Yale Public Health Theses School of Public Health January 2015 Pirna Sequence Variants Associated With Prostate Cancer In African

More information

Interactive analysis and quality assessment of single-cell copy-number variations

Interactive analysis and quality assessment of single-cell copy-number variations Interactive analysis and quality assessment of single-cell copy-number variations Tyler Garvin, Robert Aboukhalil, Jude Kendall, Timour Baslan, Gurinder S. Atwal, James Hicks, Michael Wigler, Michael C.

More information

CNV detection. Introduction and detection in NGS data. G. Demidov 1,2. NGSchool2016. Centre for Genomic Regulation. CNV detection. G.

CNV detection. Introduction and detection in NGS data. G. Demidov 1,2. NGSchool2016. Centre for Genomic Regulation. CNV detection. G. Introduction and detection in NGS data 1,2 1 Genomic and Epigenomic Variation in Disease group, Centre for Genomic Regulation 2 Universitat Pompeu Fabra NGSchool2016 methods: methods Outline methods: methods

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Country distribution of GME samples and designation of geographical subregions.

Nature Genetics: doi: /ng Supplementary Figure 1. Country distribution of GME samples and designation of geographical subregions. Supplementary Figure 1 Country distribution of GME samples and designation of geographical subregions. GME samples collected across 20 countries and territories from the GME. Pie size corresponds to the

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma.

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma. Supplementary Figure 1 Mutational signatures in BCC compared to melanoma. (a) The effect of transcription-coupled repair as a function of gene expression in BCC. Tumor type specific gene expression levels

More information

White Paper Guidelines on Vetting Genetic Associations

White Paper Guidelines on Vetting Genetic Associations White Paper 23-03 Guidelines on Vetting Genetic Associations Authors: Andro Hsu Brian Naughton Shirley Wu Created: November 14, 2007 Revised: February 14, 2008 Revised: June 10, 2010 (see end of document

More information

Genome-Wide Analysis of Copy Number Variations in Normal Population Identified by SNP Arrays

Genome-Wide Analysis of Copy Number Variations in Normal Population Identified by SNP Arrays 54 The Open Biology Journal, 2009, 2, 54-65 Open Access Genome-Wide Analysis of Copy Number Variations in Normal Population Identified by SNP Arrays Jian Wang 1,2, Tsz-Kwong Man 1,3,4, Kwong Kwok Wong

More information

Genome-wide Association Analysis Applied to Asthma-Susceptibility Gene. McCaw, Z., Wu, W., Hsiao, S., McKhann, A., Tracy, S.

Genome-wide Association Analysis Applied to Asthma-Susceptibility Gene. McCaw, Z., Wu, W., Hsiao, S., McKhann, A., Tracy, S. Genome-wide Association Analysis Applied to Asthma-Susceptibility Gene McCaw, Z., Wu, W., Hsiao, S., McKhann, A., Tracy, S. December 17, 2014 1 Introduction Asthma is a chronic respiratory disease affecting

More information

Lecture 20. Disease Genetics

Lecture 20. Disease Genetics Lecture 20. Disease Genetics Michael Schatz April 12 2018 JHU 600.749: Applied Comparative Genomics Part 1: Pre-genome Era Sickle Cell Anaemia Sickle-cell anaemia (SCA) is an abnormality in the oxygen-carrying

More information

DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING

DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING DANIEL R. SCHRIDER * Department of Biology and School of Informatics and Computing, Indiana University, 1001 E Third

More information

Chromatin marks identify critical cell-types for fine-mapping complex trait variants

Chromatin marks identify critical cell-types for fine-mapping complex trait variants Chromatin marks identify critical cell-types for fine-mapping complex trait variants Gosia Trynka 1-4 *, Cynthia Sandor 1-4 *, Buhm Han 1-4, Han Xu 5, Barbara E Stranger 1,4#, X Shirley Liu 5, and Soumya

More information

DNA-seq Bioinformatics Analysis: Copy Number Variation

DNA-seq Bioinformatics Analysis: Copy Number Variation DNA-seq Bioinformatics Analysis: Copy Number Variation Elodie Girard elodie.girard@curie.fr U900 institut Curie, INSERM, Mines ParisTech, PSL Research University Paris, France NGS Applications 5C HiC DNA-seq

More information

Genetics and Genomics in Medicine Chapter 8 Questions

Genetics and Genomics in Medicine Chapter 8 Questions Genetics and Genomics in Medicine Chapter 8 Questions Linkage Analysis Question Question 8.1 Affected members of the pedigree above have an autosomal dominant disorder, and cytogenetic analyses using conventional

More information

Multimarker Genetic Analysis Methods for High Throughput Array Data

Multimarker Genetic Analysis Methods for High Throughput Array Data Multimarker Genetic Analysis Methods for High Throughput Array Data by Iuliana Ionita A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department

More information

Supplementary Material to. Genome-wide association study identifies new HLA Class II haplotypes strongly protective against narcolepsy

Supplementary Material to. Genome-wide association study identifies new HLA Class II haplotypes strongly protective against narcolepsy Supplementary Material to Genome-wide association study identifies new HLA Class II haplotypes strongly protective against narcolepsy Hyun Hor, 1,2, Zoltán Kutalik, 3,4, Yves Dauvilliers, 2,5 Armand Valsesia,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION CONTENTS A. AUTISM SPECTRUM DISORDER (ASD) SAMPLE AND CONTROL COLLECTIONS 4 ASD samples 4 Control cohorts 4 B. GENOTYPING AND DATA CLEANING 6 SNP quality control 6 Intensity quality control for CNV detection

More information

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder Introduction to linkage and family based designs to study the genetic epidemiology of complex traits Harold Snieder Overview of presentation Designs: population vs. family based Mendelian vs. complex diseases/traits

More information

Introduction to Genetics and Genomics

Introduction to Genetics and Genomics 2016 Introduction to enetics and enomics 3. ssociation Studies ggibson.gt@gmail.com http://www.cig.gatech.edu Outline eneral overview of association studies Sample results hree steps to WS: primary scan,

More information

Association for Molecular Pathology Promoting Clinical Practice, Basic Research, and Education in Molecular Pathology

Association for Molecular Pathology Promoting Clinical Practice, Basic Research, and Education in Molecular Pathology Association for Molecular Pathology Promoting Clinical Practice, Basic Research, and Education in Molecular Pathology 9650 Rockville Pike, Bethesda, Maryland 20814 Tel: 301-634-7939 Fax: 301-634-7990 Email:

More information

CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE. Dr. Bahar Naghavi

CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE. Dr. Bahar Naghavi 2 CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE Dr. Bahar Naghavi Assistant professor of Basic Science Department, Shahid Beheshti University of Medical Sciences, Tehran,Iran 3 Introduction Over 4000

More information

Cost effective, computer-aided analytical performance evaluation of chromosomal microarrays for clinical laboratories

Cost effective, computer-aided analytical performance evaluation of chromosomal microarrays for clinical laboratories University of Iowa Iowa Research Online Theses and Dissertations Summer 2012 Cost effective, computer-aided analytical performance evaluation of chromosomal microarrays for clinical laboratories Corey

More information

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage:

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage: Genomics 101 (2013) 134 138 Contents lists available at SciVerse ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno Gene-based copy number variation study reveals a microdeletion at

More information

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes. Supplementary Figure 1 Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes. (a,b) Values of coefficients associated with genomic features, separately

More information

Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer

Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer Supplementary Information Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer Lars A. Forsberg, Chiara Rasi, Niklas Malmqvist, Hanna Davies, Saichand

More information

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection Dr Elaine Kenny Neuropsychiatric Genetics Research Group Institute of Molecular Medicine Trinity College Dublin

More information

Statistical power and significance testing in large-scale genetic studies

Statistical power and significance testing in large-scale genetic studies STUDY DESIGNS Statistical power and significance testing in large-scale genetic studies Pak C. Sham 1 and Shaun M. Purcell 2,3 Abstract Significance testing was developed as an objective method for summarizing

More information

Large multi-allelic copy number variations in humans

Large multi-allelic copy number variations in humans Large multi-allelic copy number variations in humans The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Handsaker, Robert

More information

Role of Genomics in Selection of Beef Cattle for Healthfulness Characteristics

Role of Genomics in Selection of Beef Cattle for Healthfulness Characteristics Role of Genomics in Selection of Beef Cattle for Healthfulness Characteristics Dorian Garrick dorian@iastate.edu Iowa State University & National Beef Cattle Evaluation Consortium Selection and Prediction

More information

ChIP-seq data analysis

ChIP-seq data analysis ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017 Contents Background ChIP-seq protocol ChIP-seq data analysis Transcriptional regulation Transcriptional

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data. Supplementary Figure 1 PCA for ancestry in SNV data. (a) EIGENSTRAT principal-component analysis (PCA) of SNV genotype data on all samples. (b) PCA of only proband SNV genotype data. (c) PCA of SNV genotype

More information

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Review

More information

SNPrints: Defining SNP signatures for prediction of onset in complex diseases

SNPrints: Defining SNP signatures for prediction of onset in complex diseases SNPrints: Defining SNP signatures for prediction of onset in complex diseases Linda Liu, Biomedical Informatics, Stanford University Daniel Newburger, Biomedical Informatics, Stanford University Grace

More information

Supplementary Methods

Supplementary Methods Supplementary Methods Populations ascertainment and characterization Our genotyping strategy included 3 stages of SNP selection, with individuals from 3 populations (Europeans, Indian Asians and Mexicans).

More information

MULTIFACTORIAL DISEASES. MG L-10 July 7 th 2014

MULTIFACTORIAL DISEASES. MG L-10 July 7 th 2014 MULTIFACTORIAL DISEASES MG L-10 July 7 th 2014 Genetic Diseases Unifactorial Chromosomal Multifactorial AD Numerical AR Structural X-linked Microdeletions Mitochondrial Spectrum of Alterations in DNA Sequence

More information

Unit 5 Review Name: Period:

Unit 5 Review Name: Period: Unit 5 Review Name: Period: 1 4 5 6 7 & give an example of the following. Be able to apply their meanings: Homozygous Heterozygous Dominant Recessive Genotype Phenotype Haploid Diploid Sex chromosomes

More information

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB Analysis Kits Next-generation performance in liquid biopsies 2 Accelerating clinical research From liquid biopsy to next-generation

More information

0.1% variance attributed to scattered single base-pair changes SNPs

0.1% variance attributed to scattered single base-pair changes SNPs April 2003, human genome project completed: 99.9% of genome identical in all humans 0.1% variance attributed to scattered single base-pair changes SNPs It has been long recognized that variation in the

More information