cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs

Size: px
Start display at page:

Download "cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs"

Transcription

1 cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs Lachlan J M Coin 1, Julian E Asher, Robin G Walters, Julia S El-Sayed Moustafa, Adam J de Smith, Rob Sladek 3, David J Balding 4, Philippe Froguel,5 & Alexandra I F Blakemore 1 Nature America, Inc. All rights reserved. Although genome-wide association studies have uncovered single-nucleotide polymorphisms (SNPs) associated with complex disease, these variants account for a small portion of heritability. Some contribution to this missing heritability may come from copy-number variants (CNVs), in particular rare CNVs; but assessment of this contribution remains challenging because of the difficulty in accurately genotyping CNVs, particularly small variants. We report a population-based approach for the identification of CNVs that integrates data from multiple samples and platforms. Our algorithm, cnvhap, jointly learns a chromosome-wide haplotype model of CNVs and cluster-based models of allele intensity at each probe. Using data for 5 French individuals assayed on four separate platforms, we found that cnvhap correctly detected at least 14% more deleted and 5% more amplified genotypes than PennCNV or QuantiSNP, with an 8% and 1% improvement for aberrations containing <1 probes. Combining data from multiple platforms additionally improved sensitivity. Copy-number variants (CNVs) have been proposed as a substantial source of phenotypic variation in human populations, particularly as single-nucleotide polymorphism (SNP)-based genome-wide association studies have not identified variants with sufficient genetic effect to account for the observed heritability of complex diseases 1 3. Rare CNVs have been associated with neuropsychiatric conditions 4 and obesity 5, whereas common CNVs have been associated with several complex disorders 6 9 and have been shown to affect long-range gene regulation 1 and gene expression 11. Personalized whole-genome sequencing has underscored the importance of CNVs as a source of individual genetic variation and suggests that a substantial number of small (<1 kb) structural variants remain to be identified 1. Whereas two recently published studies 13,14 have claimed that common CNVs are unlikely to account for the missing heritability in complex disease, it was recognized that current approaches do not reliably identify smaller or multiallelic CNVs, which are difficult to assay and may be poorly tagged by SNPs. This emphasizes the importance of directly interrogating copynumber information in association studies. Although early CNV studies were largely conducted using microarray-based comparative genome hybridization (acgh), there is now a shift to simultaneous SNP and CNV profiling using high-density SNP arrays, whose dense coverage and high throughput is ideal for exploring the role of CNVs in complex disease, for which large samples are needed to detect genetic effects. Moreover, as many genome-wide SNP association studies have already been completed for complex diseases such as obesity 1 and type- diabetes, there is an opportunity to reanalyze these datasets for CNV associations. Algorithms for inferring integer CNV genotypes from SNP and acgh arrays can be divided into two general classes. Wide methods, such as ADM, PennCNV 16 and QuantiSNP 17, build a one-dimensional spatial model of copy-number variation (often using a hidden Markov model (HMM)) but process each sample independently with fixed allele signal intensity clusters. Deep methods, such as TriTyper 18 or SNP-conditional outlier detection (SCOUT) 19, use the distribution of allele intensities across all samples but focus on a few probes at a time. Some packages, such as SNP-conditional mixture modeling (SCIMM) and BirdSuite 1, have separate wide and deep components. Here we use the following nomenclature: CNV genotype for numerical copy number in a sample at a given probe position; SNP genotype for allelic composition within a fixed CNV genotype at a biallelic probe (for example, AAB); and CNV-SNP genotype for combined CNV and SNP genotype (for example, 3, AAB). Here we describe cnvhap, an integrated approach to CNV detection and genotyping that is both wide and deep as it builds a spatial model of copy-number variation across the genome and updates its model of intensity clusters at each probe position using the population distribution of intensities, respectively. To achieve this, cnvhap builds a unified sample population and haplotypic model of copynumber variation, which can exploit large sample sizes and multiple assays to provide a step forward in CNV genotyping accuracy. cnvhap is available as Supplementary Software. 1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, St. Mary s Hospital, London, UK. Department of Genomics of Common Disease, School of Public Health, Imperial College London, Hammersmith Hospital, London, UK. 3 Departments of Medicine and Human Genetics, McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, Canada. 4 Institute of Genetics, University College London, London, UK. 5 Centre National de la Recherche Scientifique 89, Institute of Biology, Pasteur Institute, Lille, France. Correspondence should be addressed to L.J.M.C. (l.coin@imperial.ac.uk). Received 1 March; accepted 5 May; published online 3 May 1; doi:1.138/nmeth.1466 nature methods VOL.7 NO.7 JULY 1 541

2 1 Nature America, Inc. All rights reserved. Figure 1 Schematic flow chart showing cnvhap operation. (a) Three types of input files are needed for the analysis: data files for each dataset, an HMM parameter file and a data-specific parameter file. (b) cnvhap constructs a haplotype HMM, with each column in the model representing a probe (shown as a colored circle) in one of the datasets, which are interwoven according to genomic location such that linear sequence corresponds to genomic sequence. Each row in the model corresponds to an unobserved copy number state (, red; 1, gray; and, blue), each with its own probability distribution over alleles with the corresponding copy number. Thus, for a given sample, each Data files Build (probe positions and wave correction) Plate file Intensity files (sample intensity by probe) HMM parameters Transition model Maximum copy number Data-specific parameters Data type Ploidy Starting cluster positions probe (column) can have copy number of (deleted, which represents only the single deletion allele), 1 (haploid normal, which represents A and B alleles) or (duplicated, which represents the AA, AB and BB alleles). In this example, arrow widths represent the probability of transitioning out of a copy number = 1 state at position zero (left), and a copy number = state at position one (middle; representing the next probe along the genome). (c) cnvhap constructs a separate set of cluster positions for each probe (column). Crosses indicate trained cluster means, with colors corresponding to the most likely assigned copy-number state. The first and third probe illustrate Illumina SNP probes, for which both LRR and BAF are defined; the second probe illustrates an Agilent acgh probe for which only the LRR is defined. Parameters of the model are trained using the expectation-maximisation algorithm, and then the trained model is used to obtain the final copy-number annotation of the population. (d) In this copy-number annotation schematic, the annotated segment of a chromosome is shown for multiple individuals, in which each line represents a segment from one individual. Red, deletion; gray, haploid normal; and blue, duplication. RESULTS Genotyping CNVs using a copy-number haplotype model cnvhap integrates intensity information from multiple platforms and cohorts into a single probabilistic model of copy-number variation. Three types of input files are provided: input data for each dataset, HMM parameters and data-specific parameters (Fig. 1a). Copy-number variation is modeled using an HMM at the haplotypic (single chromosome) level (Fig. 1b). An HMM for modeling N-ploid (for example, diploid) genomes is obtained by pairing N copies of the haploid model. The observed intensity data are modeled separately at the population level for each probe; cnvhap constructs a separate set of cluster positions for each probe modeled as linear combinations of fixed nonlinear functions of the underlying CNV-SNP genotype (Fig. 1c). The full model, incorporating the HMM and the cluster position coefficients, is trained using the expectation-maximization algorithm, a two-step procedure in which the expected CNV-SNP annotation is calculated in each iteration (default setting is at iterations) given the current model parameters, followed by parameter optimization given this annotation. The final CNV-SNP annotation is then calculated from the trained model (Fig. 1d). cnvhap incorporates user-specified fixed ploidy and maximum copy number per haplotype; however, the model s complexity of (number of states) ploidy makes it infeasible to run genome-wide with both high ploidy and high copy number. We routinely ran cnvhap on cohorts of up to 5, diploid samples with three copy-number states as well as with up to six copynumber states on 5 diploid samples or with three copy-number states on up to 5 tetraploid samples. Here we describe the results generated by cnvhap for a region on chromosome 1, integrating data from an Illumina Human 1M chip and a custom Agilent 44k acgh array (Fig. ). The trained HMM consists of one state per haploid copy number (, 1 or ) (Fig. a). The corresponding diploid HMM used to model this autosomal region accommodates up to four copies (two per haplotype). At a given position, the trained A and B allele probabilities for the copy number = 1 state correspond to the allele a c BAF b Haplotype HMM A = % A = 9% B = 8% A B = 1% AA = 4% AB = 3% BB = 64% AA Cluster positions LRR LRR LRR proportions in the population, and the trained AA, AB and BB probabilities for the copy number = state are the corresponding Hardy-Weinberg proportions. Transition probabilities are controlled by a single global transition rate matrix, which expresses the average amount of transition between copy-number states per base pair across the entire region, and by a position-dependent scalar transition rate, which captures transition-rate hot spots (Fig. b). This approach is analogous to the use of a global evolutionary rate matrix describing mutational events modulated by site-specific rates in phylogenetics. cnvhap also reclusters the transformed allele intensity measurements (for example, on Illumina chips, the log ratio of observed to expected fluorescence signal intensity (log R ratio; LRR) and the proportion of observed signal intensity owing to the B-allele (B-allele frequency; BAF)) at each position using a regularized regression framework as part of the model training. cnvhap fits a linear regression model in which cluster positions in the twodimensional LRR BAF space are expressed as linear combinations of fixed nonlinear functions of CNV-SNP genotype. The coefficients of these functions are updated in the maximization step using ridge regression 3. We found this to be particularly useful for reducing the impact of normalization-induced artifacts. For example, at position 5195 on chromosome 1 (dbsnp accession number rs39748) data normalization would ideally cause diploid samples to have a mean LRR of (Fig. c), but cnvhap has correctly identified that for this probe, homozygote AA individuals have a mean LRR of. and heterozygote individuals have a mean LRR of.3, whereas homozygote AAA amplifications are indicated by a mean LRR of.1. cnvhap also uses the trained cluster positions to produce a probability of each CNV-SNP genotype at each position. cnvhap s framework for integrating multiple datasets in a single HMM enables it to project copy-number state from a set of measured probes onto a set of unmeasured loci. At the unmeasured loci, the cluster-based emission probabilities are replaced with a uniform distribution over all CNV-SNP genotypes. A probability distribution over copy number is BAF AA = 81% AB = 18% BB = 1% d 54 VOL.7 NO.7 JULY 1 nature methods

3 1 Nature America, Inc. All rights reserved. Figure Visualization of cnvhap population model on chromosome 1 for integrated Illumina 1M and Agilent 44k datasets. (a) Haploid HMM, with rows corresponding to copy-number state (, red; 1, gray; and, blue) and columns corresponding to probe locations. Positions are given in megabases (Mb). Bubble size corresponds to the expected number of samples assigned to each state at each position. Text in each bubble is the trained emission probability for SNP haplotypes in each copynumber state. The width of lines between bubbles indicates transition probability. (b) Pointwise transition rate integrated across all copy-number transitions revealed transition hotspots at CNV breakpoints. (c) Cluster plots from an Agilent 44k and an Illumina 1M probe in the CNV arranged by genomic location. Each data point corresponds to a single sample; color denotes most likely assigned copy-number state and symbol denotes the number of B alleles in the most likely SNP genotype. Crosses indicate trained cluster means (cross position) and variance (line width) for each CNV-SNP genotype. BAF for the Agilent probes is randomly assigned between and 1. a Position (Mb) CN = CN = 1 reported at these positions, which takes into account both copy-number state at flanking loci and the estimated local copy-number transition rate. We artificially masked 3% and 7% of Illumina 1M probes to demonstrate the feasibility of this procedure and the accuracy of the uncertainty estimates (Supplementary Figs. 1 and ). Given the strong linkage disequilibrium reported between certain classes of CNVs and SNPs 14,4, we expect CNV genotyping accuracy to be improved by explicitly modeling CNV-SNP haplotypes. This might be achieved by using data from individuals with a strong intensity signal to identify CNV-SNP haplotypes and then using shared haplotype structure to identify CNVs with weak intensity signal. Hence, to model CNV-SNP haplotypes, we coupled the cnvhap copy-number transition model with the haplotype models used by fastphase and polyhap 6 to obtain the extended cnvhap+snph model (Supplementary Fig. 3). This was feasible because both algorithms model haplotypic changes. A fixed number (default = 4) of SNP haplotype states with copy number = 1 are specified, which can be thought of as ancestral haplotype states. After model training, the emission probabilities of these states reflect haplotype-specific SNP-allele frequencies. We then constructed the copy number = states as unordered pairs of copy number = 1 states. We also included a fixed number (default = 1) of copy number = states in the model (we used multiple copy number = states to model overlapping deletions on different haplotypes). Genomewide benchmarking of CNV calls To benchmark CNV genotyping accuracy on Illumina platforms, we used data for a cohort of 5 healthy individuals from northern France previously characterized for copy-number variation using a prototype 185k genome-wide acgh array followed by a focused 44k array custom-designed for CNV validation and mapping 7. The accuracy of the copy-number b c CN = BAF Log rate 1..5 A B Chromosome 1:.454 Mb.539 Mb AA AB BB Relative position A_16_P399 (44k) LRR BAF mapping of this dataset (as characterized by Agilent s ADM algorithm) has been experimentally verified 4,7. To assess CNV genotyping on Illumina arrays, we assayed data for all 5 individuals using the Illumina 1M SNP array and ran cnvhap as well as three widely used CNV prediction algorithms: PennCNV 16, QuantiSNP 17 and cnvpartition 8 (Supplementary Fig. 4). We also assayed data for a subset of 36 individuals on a prototype Illumina 317k array, on which we ran cnvhap only. To evaluate accuracy, we projected Illuminabased copy-number predictions onto Agilent acgh 44K probes and compared predictions to the copy-number genotypes called by ADM (Supplementary Table 1). One caveat is that regions detectable on the 44k but not the 1M chip (for example, owing to absence of 1M probes or CNVs in the 44k reference sample) will be reported as false negative predictions for all three algorithms; similarly, genuine CNVs identified by the test algorithm but not identified by ADM will be erroneously scored as false positives, which may have the effect of penalizing highly sensitive algorithms. Copy-number annotation on the previously illustrated region on chromosome 1 showed that all three algorithms (cnvhap, PennCNV and QuantiSNP) run on the 1M chip detected a 65-kb deletion (Fig. 3a c), although PennCNV missed one instance of this deletion compared to the ADM reference. cnvhap+snph identified (different) shared flanking SNP haplotypes for amplifications and deletions (Fig. 3d). cnvhap also identified four instances of an amplification missed by the other two algorithms, which we confirmed by analysis of the 44k or 185k Agilent arrays with either cnvhap or ADM (Fig. 3e h). We also compared the algorithms accuracy in correctly calling copy number for each sample at sites of known copy-number variation genome-wide. The cumulative distribution of the squared Pearson s correlation coefficient between predicted copy number (using Illumina 1M data) and benchmark copy number.8 rs39748 (1M).6.4 LRR. nature methods VOL.7 NO.7 JULY 1 543

4 1 Nature America, Inc. All rights reserved. Figure 3 CNV predictions for chromosome 1.45 Mb.536 Mb. (a h) Copy-number annotation on each chromosome segment in the population (data for individuals are separated by thin white lines) obtained with the indicated algorithm and datasets (1M is an Illumina dataset, 44k and 185k are Agilent). Yellow, purple, aqua and dark red in d correspond to four different ancestral SNP haplotypes inferred by cnvhap+snph. White rows in g and h indicate samples that were not measured on Agilent 185k array. (ADM using 44k acgh data) at all copy-number aberrant sites in the benchmark, revealed that cnvhap is markedly more accurate (Fig. 4a). At an r threshold of.5, cnvhap correctly genotyped 44.8% of all copy-number aberrant sites (3.7%, 7.1% and 1.8% more than QuantiSNP, PennCNV and cnvpartition, respectively). Extending cnvhap to include SNP haplotypes added an improvement of.7% of all copy-number aberrant sites; however, for the 317k chip, cnvhap+snph had reduced accuracy, indicating that lower probe density can lead to overfitting of SNP haplotypes. The converse comparison, using PennCNV calls from the 1M chip as a benchmark for assessing CNV calls using acgh data, indicated that cnvhap outperformed ADM on both the 185k and 44k acgh chips (although the PennCNV benchmark may be slightly biased in favor of HMM methods such as cnvhap) (Fig. 4b and Supplementary Table ). To investigate cnvhap s genotyping accuracy, we calculated receiver operating characteristic (ROC) curves for detection of copy-number variation in each individual for all probes genomewide, and of the presence or absence of copy-number variation in the population as a whole (Fig. 5). For each algorithm, as the probability threshold for assignment of copy-number aberrant probes decreased, the number of true and false copy-number assignments (as determined by ADM 44k benchmark) increased, thus tracing out the ROC curve. For per-sample calls, cnvhap detected substantially more deleted and amplified probes than other algorithms did (Fig. 5a). cnvhap also detected more copy-number aberrant probes overall in the population, although with less marked improvement compared to per-sample calls (Fig. 5c), indicating that improvement in individual sample detection was the compound effect of gains in overall detection and gains in individual-sample genotyping. cnvhap was also superior in identifying correct copy-number breakpoints (Fig. 5b). For shorter CNVs, notably those with <1 Illumina probes, cnvhap s advantage was even more striking (Supplementary Fig. 5). We observed similar results for the PennCNV benchmark (Supplementary Figs. 6 and 7). The cnvhap+snph model was less sensitive in detecting amplifications than cnvhap alone. This may be due to the lack a Cumulative frequency (%) b Cumulative frequency (%) r r cnvhap (44k) cnvhap (185k) ADM (44k) ADM (185k) Samples Samples a b c d e cnvhap (44k) f ADM (44k) g cnvhap (185k) h ADM (185k) CN = CN = CN = Chromosome 1:.454 Mb.539 Mb Not measured cnvhap+snph (1M+44k) Inferred haplotypes of linkage disequilibrium between SNPs flanking the original location and the amplified state when the amplified copy is not in tandem with the original 14. cnvhap+snph also was less specific in its calls for both deletions and amplifications. However, in view of cnvhap+snph s increased genotyping accuracy (Fig. 4a and Supplementary Table 3), we suggest that cnvhap+snph was calling additional genuine, small CNVs on the Illumina 1M array that were not detected on the ADM (44k) reference and which were, therefore, erroneously scored as false positives. SNP genotyping discordance rate (between Illumina 1M and 317k chips) for cnvhap and cnvhap+snph was similar for different copy-number states (Supplementary Fig. 8). Multiplatform integration We also evaluated the advantages of cnvhap s ability to integrate multiple data sources in a single probabilistic model by assessing the sensitivity of cnvhap calls on different combinations of the four datasets (Illumina 1M, Illumina 317k, Agilent 44k and Agilent 185k). We anticipated that this would improve detection via greater probe coverage and partial replicate measurements. We used copy-number calls produced by cnvhap for the maximally dense probe set (integrating all four chips) as the benchmark to compare the performance of different subsets. We found that combining information always increased genotyping accuracy relative to individual component datasets, even when one probe was a subset of the other (for example, 317k+1M outperformed 1M) or when the second stage probeset was designed to refine predicted CNV boundaries (for example, 185k+44k outperformed 44k) (Fig. 6a). We conclude that if Figure 4 Cumulative frequency of squared Pearson s correlation coefficient between predicted and benchmark copy number calls. (a) Comparison of algorithms applied to Illumina 1M and 317K data with a 44k acgh (ADM) benchmark. (b) Comparison of algorithms applied to 44K and 185K acgh data with an Illumina 1M (PennCNV) benchmark. 544 VOL.7 NO.7 JULY 1 nature methods

5 a b c Figure 5 ROC curves for detecting CNVs using Illumina 1M data. (a c) Each algorithm was run genome-wide on Illumina data and projected to 44k probes, and plotted data for deletions (left) and amplifications (right). We detected copy-number genotypes for each individual (a), copy-number breakpoints for each individual (b) and the presence or absence of copy-number variation in a population (c). 1 Nature America, Inc. All rights reserved. cohorts have been regenotyped on a denser chip, intensity information from the older chip should still be included in the analysis. This improvement is most striking if we consider the accuracy of breakpoint identification (Supplementary Fig. 9). The most accurate breakpoint identification is provided by the combination of Illumina genotyping chips (317k+1M) for deletions, and by the combination of Agilent chips for amplifications. As another test of the increase in genotyping accuracy provided by combining datasets, we examined small (<3 kb) deletions initially detected in at least two samples by ADM in the 44k acgh data, with genotypes and breakpoints experimentally determined using PCR and sequencing 4. Despite ascertainment bias in favor of the 44k chip, integrating 185k acgh data or 1M Illumina data boosted sensitivity (Fig. 6b). Validation using an independent HapMap dataset We validated cnvhap s predictions on 118 HapMap samples using CNVs identified from fosmid end sequence pair (fosmid ESP) maps on eight of those samples 9. Of all such fosmid ESP defined CNVs validated by targeted acgh, cnvhap with default parameters identified 68% of >1 kb CNVs, 6% of 5 1 kb CNVs and 31% of <5 kb CNVs. Of 894 deletions >1 kb identified by cnvhap, 44% overlapped a deletion identified by fosmid ESP mapping (compared with 45% of 8 deletions identified by SCIMM ). Similarly, sequence data validated 18% of 369 cnvhap-identified amplifications (14% of 11 amplifications for SCIMM). This is consistent a Cumulative frequency (%) k+1M 44k+185k 44k+317k 1M+317k 44k M+185k 1M 185k+317k 185k 317k r k+1M+185k+317k 185k k+1M 44k 1M+185k 1M+317k 44k+185k 1M 44k+317k 317k 6 185k+317k Figure 6 Combining datasets improved sensitivity. (a) Cumulative frequency of squared Pearson s correlation coefficient for combinations of datasets analyzed with cnvhap using the maximal probe set (185k+44k+317k+1M) as benchmark. (b) ROC curves for per-sample detection of sequenced deletions for different dataset combinations. b with our benchmarking results and demonstrated that cnvhap has increased sensitivity while maintaining high specificity. To assess cnvhap s genotyping accuracy on this dataset, we examined cnvhap predictions at the sites of 18 common deletions previously independently genotyped by PCR and Illumina GoldenGate assays using the correlation coefficient between predicted and benchmark copy-number calls (Supplementary Table 3). cnvhap s genotyping accuracy was comparable to that of SCIMM at most sites. Although cnvhap missed one CNV on chromosome 7, it correctly identified and accurately genotyped a CNV on chromosome not identified by SCIMM. Including haplotype information (cnvhap+snph) further improved genotyping accuracy and enabled identification of a locus on chromosome missed by SCIMM. A simple adjustment to the HMM parameters to relax the initial probability of transitioning into a CNV also improved genotyping accuracy. DISCUSSION cnvhap is a cross-platform CNV genotyping tool that bridges the gap between CNV discovery (typically carried out sample by sample) and CNV genotyping (typically performed at a few probes at validated CNV locations). cnvhap s increased accuracy results from improved modeling of the underlying structure of aberrations via a haplotype model of copy-number variation combined with more effective synthesis of information across samples to recluster intensity at each probe. Although here we focused on Illumina and Agilent data, cnvhap has modules for sequence and Affymetrix data and can be readily extended to incorporate modules for other platforms. Our multiplatform benchmarking dataset will be a useful resource for other algorithm comparisons. The extended model (cnvhap+snph) incorporated SNP haplotypes into the underlying model, improving genotyping of deletions. We anticipate even greater improvement (particularly when using sparser arrays) if dense intensity data from a reference panel, providing a robust source of CNV-SNP haplotypes, are included in the analysis. Additionally, the SNP haplotype model incorporated in cnvhap+snph accurately phases polyploid genomic regions 3, enabling cnvhap+snph to jointly discover and phase CNV regions. We used cnvhap to detect rare recurrent deletions and amplifications of a region on chromosome 16p11., which is strongly associated with obesity in multiple genome-wide association datasets 5. Accurate estimates of the posterior probability distribution over copy-number genotypes obtained by projecting copy-number nature methods VOL.7 NO.7 JULY 1 545

6 1 Nature America, Inc. All rights reserved. annotation from one platform to another will be useful for combining effect size estimates in meta-analyses incorporating multiple platforms with partially correlated probe density. Filtering out low-information-content probes then accounting for copy-number uncertainty in the association analyses should reduce spurious associations. Our work is particularly topical as researchers seek improved CNV genotyping to facilitate detection of CNV-phenotype associations. The high error rate in detecting certain classes of CNVs, including short deletions (also the most prevalent 1 ) and amplifications, has contributed to the relatively paucity of reported CNV-disease associations and highlights the need for more sensitive detection methods 13,14. Many attempts to identify CNV-phenotype associations have been carried out using single studies (usually with modest sample size), leading to low power to detect modest effects or associations with rare CNVs. The difficulty in accurately genotyping CNVs leads to additional loss of power. Thus, there is a real need for a framework for improving CNV genotyping accuracy, integrating data across multiple platforms and pooling results across studies. cnvhap fills this gap in the genome-wide CNV analysis toolkit, and we anticipate that it will lead to improvements in detecting CNV-disease associations. Methods Methods and any associated references are available in the online version of the paper at Note: Supplementary information is available on the Nature Methods website. Acknowledgments We thank D. Serre, A. Montpetit and D. Vincent for advice concerning Illumina arrays and D. Peiffer (Illumina) for providing genotype data on HapMap samples. Genome Canada and Genome Quebec funded genotyping on the Illumina Human1M platform. L.J.M.C. is funded by a Research Council UK fellowship. J.E.A. is supported by the Medical Research Council. R.G.W. is supported by Johnson & Johnson and the South East England Development Agency. J.S.E.-S.M. is supported by an Imperial College Division of Medicine PhD studentship. AUTHOR CONTRIBUTIONS L.J.M.C. designed the project with A.I.F.B., developed the cnvhap algorithm and software, analyzed data and wrote the paper. J.E.A. ran cnvpartition, PennCNV and QuantiSNP on the data and helped write the paper. R.G.W. and J.S.E.-S.M. provided critical comments and helped to write the paper. D.J.B. provided statistical advice. R.S. provided SNP genotype data, advised on its interpretation and edited the paper. A.J.d.S. provided acgh data and advised on its interpretation. P.F. provided the DNA samples and coordinated the SNP genotyping. A.I.F.B. designed the project with L.J.M.C., coordinated the acgh analysis, contributed to writing the paper and oversaw the project. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests. Published online at Reprints and permissions information is available online at com/reprintsandpermissions/. 1. Meyre, D. et al. Genome-wide association study for early-onset and morbid adult obesity identifies three new risk loci in European populations. Nat. Genet. 41, 7 9 (9).. Sladek, R. et al. A genome-wide association study identifies novel risk loci for type diabetes. Nature 445, (7). 3. Zeggini, E. et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type diabetes. Nat. Genet. 4, (8). 4. Cook, E.H. & Scherer, S.W. Copy-number variations associated with neuropsychiatric conditions. Nature 455, (8). 5. Walters, R.G. et al. A new highly penetrant form of obesity due to deletions on chromosome 16p11.. Nature 463, (1). 6. Aitman, T.J. et al. Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature 439, (6). 7. Diskin, S.J. et al. Copy number variation at 1q1.1 associated with neuroblastoma. Nature 459, (9). 8. McCarroll, S.A. et al. Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn s disease. Nat. Genet. 4, (8). 9. Willer, C.J. et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat. Genet. 41, 34 (9). 1. Kleinjan, D.A. & van Heyningen, V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76, 8 3 (). 11. Stranger, B.E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 3, (7). 1. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, (8). 13. Wellcome Trust Case Control Consortium. Genome-wide association study of CNVs in 16, cases of eight common diseases and 3, shared controls. Nature 464, (1). 14. Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, (1).. Lipson, D., Aumann, Y., Ben-Dor, A., Linial, N. & Yakhini, Z. Efficient calculation of interval scores for DNA copy number data analysis. J. Comput. Biol. 13, 8 (6). 16. Wang, K. et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, (7). 17. Colella, S. et al. QuantiSNP: an objective Bayes hidden-markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res., 13 (7). 18. Franke, L. et al. Detection, imputation, and association analysis of small deletions and null alleles on oligonucleotide arrays. Am. J. Hum. Genet. 8, (8). 19. Mefford, H.C. et al. A method for rapid, targeted CNV genotyping identifies rare variants associated with neurocognitive disease. Genome Res. 19, (9).. Cooper, G.M., Zerr, T., Kidd, J.M., Eichler, E.E. & Nickerson, D.A. Systematic assessment of copy-number-variant detection via genome-wide SNP genotyping. Nat. Genet. 4, (8). 1. Korn, J.M. et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 4, (8).. Coin, L. & Durbin, R. Improved techniques for the identification of pseudogenes. Bioinformatics (Suppl. 1), i94 i1 (4). 3. Hoerl, A.E. Application of ridge analysis to regression problems. Chem. Eng. Prog. 58, (196). 4. de Smith, A.J. et al. Small deletion variants have stable breakpoints commonly associated with alu elements. PLoS One 3, e314 (8).. Scheet, P. & Stephens, M. A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, (6). 6. Su, S.-Y., Balding, D.J. & Coin, L.J.M. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics 9, 513 (8). 7. de Smith, A.J. et al. Array CGH analysis of copy number variation identifies 184 new genes variant in healthy white males: implications for association studies of complex diseases. Hum. Mol. Genet. 16, (7). 8. Peiffer, D.A. et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res. 16, (6). 9. Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, (8). 3. Su, S.-Y., Balding, D.J. & Coin, L.J.M. Disease association tests by inferring ancestral haplotypes using a hidden Markov model. Bioinformatics 4, (8). 546 VOL.7 NO.7 JULY 1 nature methods

7 1 Nature America, Inc. All rights reserved. ONLINE METHODS Definition of CNV-SNP alleles and genotypes. The CNV-SNP alleles were defined up to a user-specified maximum haploid copy number and are denoted by h {,, H}. For example, given a maximum copy number of, the space of CNV-SNP alleles at a biallelic probe is {, A, B, AA, AB, BB} and {, A, AA} at a monoallelic probe, where denotes a deletion. For user-specified ploidy N, the space of CNV-SNP genotypes was then constructed as all unordered lists of length N of CNV-SNP alleles. cnvhap haplotype hidden Markov model (HMM). cnvhap builds an HMM at the haplotype level (Fig. 1d). At each probe m {1,, M}, the HMM comprises one hidden state s m,l per haploid copy number l, up to a user-defined maximum l {,, z 1}. At each probe, each state comprises an emission probability distribution over all CNV SNP alleles, h, with corresponding copy number, which we denoted by u m,l (h). Because each state corresponds to a particular copy number, u m,l (h) = if CN(h) CN(s m,l ), in which we denote by CN the copy number of a state, CNV-SNP allele or genotype. This means that for the deletion state u m,l ( ) = 1. The emission probability distribution for states with CN(s m,l ) > are expressed such that they reflect the Hardy-Weinberg equilibrium distributions of the single-copy state (Supplementary Note 1). This encourages the model to find duplication alleles which are consistent with the underlying SNP allele distribution. Haploid transition probabilities between copy number states. We used continuous time Markov chain theory to define transition probabilities between CN states. Let Q define a global transition rate matrix between different CN states, so that Q k,l is the instantaneous rate of transition from state k to state l. We require that the rows of this matrix sum to zero, that is, Q k,k = Σ l k Q k,l. We denote by P k (d m ) the probability that the sample is in state k at genomic distance d m and by P(d m ) the vector of these probabilities. We also define as π the equilibrium probability distribution over copy number states such that Qπ =, which can be found using the singular value decomposition of Q. The average transition rate for Q is defined as r Q = Σ k π k Q kk. The evolution of P over genomic distance d is defined by the differential equation P (d) = QPr, in which r is an arbitrary distance scaling parameter. The solution of this equation is given by the matrix exponential of Q multiplied by the initial probability distribution: P(d) = P()e Qrd. Thus, we can calculate the transition probability between copy number states at probes m and m 1 as Qr ( ( ) ( )) p( sm l sm k) { e m d m d m 1 = 1 = = } k, l (1) in which r m is the site-specific transition rate. Details of how the matrix Q and local transition rate r m are initialized and updated during training are available in the Supplementary Note 1. cnvhap+snph haplotype HMM. The cnvhap+snph HMM fuses the cnvhap model described above with the haplotype HMM used in polyhap and fastphase. To achieve this, we introduced multiple states per CN (Supplementary Fig. 3). A user-defined number of states with CN = 1 is included in the model. The CN = states are defined as all unordered pairs of CN = 1 states. A user-defined number of CN = states is also added to the model. As above, the emission probability distributions θ ml (h) are only defined for CN = 1 states, with higher copy-number states defined in terms of the CN = 1 copy number states they comprise (Supplementary Note 1). To fuse the polyhap transition model with the cnvhap model, we separately define between CN, within same CN as well as within different CN transition probabilities. The within-cn transition model is the same as used in polyhap. Similarly, the between-cn model is the same as the cnvhap transition model described above, with the exception that we model haplotypespecific transition hotspots by allowing the site-specific transition rate r m to vary between different copy-number states. The within different CN transition probabilities capture joint haplotypes between states with different CN. Denote by ci(k) the index of state k among all states with CN(k). The fused transition probability is the product of within and between copy-number transitions: p( sm = l sm 1 = k) = p( ci( sm) = ci( l) ci( sm 1) = ci( k), CN( sm) = CN( l), CN( sm 1) = CN( k)) p( CN( sm) = CN( l) CN( sm 1) = CN( k)) () We set the within same CN transition to zero if CN(l) = CN(k) = (to disallow haplotype state switches in a deletion); to the polyhap probability if CN(l) = CN(k) = 1; and to a transition probability which is defined in terms of CN = 1 polyhap probabilities if CN(l) = CN(k) > 1 (Supplementary Note 1). The within different CN transition rate is a parameter τ CN(k),CN(l),ci(k),ci(l) which was updated during training. Modeling intensity data. For Illumina genotyping arrays, the observed data we model are the log R intensity ratio and B-allele frequency at each marker: (LRR m, BAF m ). For Agilent arrays the observed data consist of LRR only. We followed the work of others in assuming that LRR m and BAF m are uncorrelated. We assumed that given an (unobserved) genotype g, LRR m is distributed normally given an unobserved mean and variance, that is, LRR m ~ N(μ rm (g), σ rm (g)); and that BAF m is distributed according to a normal distribution truncated to lie between BAF m 1, that is, BAF m ~ TN(μ bm (g), σ bm (g)), unless CN(g) =, in which case we assume is BAF m is uniformally distributed between and 1. The mean and variance for BAF and LRR for all g with CN(g) > were parameterised via a linear model: f( g) = 1 mrm( g) f1( g) = log( CN( g)/ ) srm( g) f ( ) ( ( )) mbm( g) = g = f g 1 f3( g) = bfrac( g) sbm( g) f4( g) = f3( g) ( 1 f3( g)) f5( g) = f3( g) ( f3( g). 5) ( f3( g) 1) (3) in which b is a 4 by 6 matrix of parameters, the functions f are basis functions, and bfrac(g) is defined as the proportion of B alleles in the genotype g. We also restricted σ rm (g) > and σ bm (g) >. For CN(g) =, we simply allowed μ rm (g), σ rm (g) to be free parameters of our model. If we focus on the first row of equation 3, we note that this model enables us to express the LRR mean as a function of the actual theoretical log R ratio (which is expressed by f 1 (g)) and the square of the log R ratio. However, it also enables us to express a dependence on terms involving the theoretical B-allele fraction of genotype g (which is expressed by f 3 (g)), which in terms of the cluster plots, means that the expected LRR value can increase (or decrease) as we go from homozygote AA to heterozygote to homozygote BB. doi:1.138/nmeth.1466 nature methods

8 1 Nature America, Inc. All rights reserved. Note that although these basis functions seem to work well in practice, any function of the genotype of g could be used. Creating a multiplatform HMM and mapping between different chips. We describe here how to create a multiplatform HMM that can integrate information from multiple sources. The positions of all of the probes on each of the platforms which have been used (on any individual) are interwoven according to their genomic location. The total number of positions from all of the chips forms the total number of probes M. Then, for each individual, the emission state probability at each position is given as one of the likelihoods described above and in Supplementary Note 1, depending on which platform was used at each position. If data for an individual was not collected on a particular platform, then a uniform genotype distribution was used at that position to allow the possibility of any CNV genotype. If two platforms assayed the same genomic location, then the likelihood is the product of the two platformspecific likelihoods. This also provides a way to project CNV predictions from one array to another. In this framework, we simply included all of the probe positions on the second array but used uniform CNV-SNP genotype distribution at these positions rather than the actual data. Estimating genotypes from the model together with uncertainties. Details of constructing a N-ploid HMM from the haploid HMM is given in Supplementary Note 1. We ran the Baum-Welch training algorithm to estimate the parameters of the haploid HMM, which approximate a local mode of the posterior probability distribution. For data from each individual, we ran the forward-backward algorithm, and, using standard dynamic programming techniques calculated the probability distribution (conditional on the model parameters) over unordered lists of hidden states, and using this together with Supplementary Note 1 equation 3, we calculated the probability distribution over CNV-SNP genotypes. This procedure can be repeated for a user-specified number of training repetitions, and the CNV-SNP probability distribution can be averaged over these repetitions. We have previously reported that the gain in phasing accuracy using ten instead of one repetition with polyhap is modest and even smaller for inferring missing genotype data 6. To minimize the computational burden, we only used one iteration in this work. Samples and genotyping. DNA was isolated from peripheral blood samples from 5 unrelated, apparently healthy Caucasian males of northern French origin. The study protocol followed the standards laid out in the Declaration of Helsinki with full Ethics Committee approval as detailed in reference 6. The 185k CGH array consisted of 185, probes (17,1 autosomal) with an average spacing of 16 kb and a bias toward genes. The 44k CGH array consisted of 44, probes (3,81 autosomal) designed to provide high-resolution coverage of CNV regions discovered on the 185k CGH array and other previously identified CNVs. A reference consisting of pooled DNA from all 5 subjects was used on the 185k CGH array, whereas a reference sample for a single Caucasian individual from the Coriell Cell Repository (NA51) was used for the 44k array. Data were acquired on 5 samples on the 44k CGH array and samples on the 185k CGH arrays 7. Illumina genotyping was performed using the Infinium Human 1M chip (1,7,8 SNPs of which 1,9,591 were autosomal) and a prototype 317k chip (based on the Hap3k BeadArray; 317,53 SNPs, of which 38,33 were autosomal). All 5 samples were measured on the Human 1M chip, and a subset of 36 samples were measured on the 317k BeadArray. Intensity data for HapMap samples using the Illumina1M BeadArray were obtained from Illumina. Details of fosmid-esp defined CNVs and acgh validation status were obtained from reference 8. SCIMM genotyping results for 18 experimentally defined CNV loci were obtained from reference 19. Sample quality control and normalization. Samples with an LRR variance greater than.3 on the Illumina Human 1M or 317k BeadArrays were excluded from the analysis. As previously observed, in general array based technologies exhibit localized wave-like changes in LRR which correlate with G+C content 31. For each chromosome separately, we first calculated the median LRR for each sample and subtracted this value from each point-wise LRR value so that each sample had a median LRR of zero. Next, for each sample, we calculated a Loess curve using only normal LRR variation of.3 to.3 and a window of 5 kb. We then averaged the Loess curves across all samples. This Loess value was then subtracted from the pointwise LRR values in each sample. Parameters for cnvpartition, PennCNV and QuantiSNP. CNV Partition 1.. (ref. 8) was run using the default settings (confidence threshold, probe gap size threshold 1 Mb). The latter parameter prevents the calling of CNVs over probe gaps greater than this size (for example, over centromeres). PennCNV 16 (March 8 version) was run using the HMM, population frequency for B allele (PFB) and wavemodel correction files for the Illumina 1M array supplied as part of the software package. The data for these default files was derived from a set of 1 HapMap samples. The PFB file contains the positions and the population frequency of the B allele for the SNP markers (PennCNV uses only LRR information when analyzing monomorphic markers). Whereas the creation of a custom PFB file can increase CNV calling accuracy in large ethnically homogeneous samples, the sample used for our analysis was considered too small to enable the construction of a representative PFB file. The only change from the default settings was the inclusion of the wavemodel adjustment procedure. This adjustment, based on a large set of training data with varying degrees of waviness, was designed to compensate for the known fluctuations ( GC waves ) in signal intensity caused by uneven G+C content in different regions of the genome. QuantiSNP version 1.1 (ref. 17) was run using default settings, a log Bayes factor threshold of 5, and the build 36 wavemodel (GC) correction files supplied as part of the software package. The wavemodel correction in QuantiSNP is derived by using a simple linear model incorporating local genomic G+C content. The inclusion of this correction was the only deviation from the default settings. Software. cnvhap software, documentation and example data files are available as Supplementary Software (also at imperial.ac.uk/medicine/people/l.coin/). 31. Marioni, J.C. et al. Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 8, R8 (7). nature methods doi:1.138/nmeth.1466

Nature Biotechnology: doi: /nbt.1904

Nature Biotechnology: doi: /nbt.1904 Supplementary Information Comparison between assembly-based SV calls and array CGH results Genome-wide array assessment of copy number changes, such as array comparative genomic hybridization (acgh), is

More information

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction Optimization strategy of Copy Number Variant calling using Multiplicom solutions Michael Vyverman, PhD; Laura Standaert, PhD and Wouter Bossuyt, PhD Abstract Copy number variations (CNVs) represent a significant

More information

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis HMG Advance Access published December 21, 2012 Human Molecular Genetics, 2012 1 13 doi:10.1093/hmg/dds512 Whole-genome detection of disease-associated deletions or excess homozygosity in a case control

More information

Identification of regions with common copy-number variations using SNP array

Identification of regions with common copy-number variations using SNP array Identification of regions with common copy-number variations using SNP array Agus Salim Epidemiology and Public Health National University of Singapore Copy Number Variation (CNV) Copy number alteration

More information

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Here we compare the results of this study to potentially overlapping results from four earlier studies

More information

Global variation in copy number in the human genome

Global variation in copy number in the human genome Global variation in copy number in the human genome Redon et. al. Nature 444:444-454 (2006) 12.03.2007 Tarmo Puurand Study 270 individuals (HapMap collection) Affymetrix 500K Whole Genome TilePath (WGTP)

More information

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit APPLICATION NOTE Ion PGM System Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit Key findings The Ion PGM System, in concert with the Ion ReproSeq PGS View Kit and Ion Reporter

More information

LTA Analysis of HapMap Genotype Data

LTA Analysis of HapMap Genotype Data LTA Analysis of HapMap Genotype Data Introduction. This supplement to Global variation in copy number in the human genome, by Redon et al., describes the details of the LTA analysis used to screen HapMap

More information

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Stanford Biostatistics Workshop Pierre Neuvial with Henrik Bengtsson and Terry Speed Department of Statistics, UC Berkeley

More information

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used

More information

Understanding DNA Copy Number Data

Understanding DNA Copy Number Data Understanding DNA Copy Number Data Adam B. Olshen Department of Epidemiology and Biostatistics Helen Diller Family Comprehensive Cancer Center University of California, San Francisco http://cc.ucsf.edu/people/olshena_adam.php

More information

Assessing Accuracy of Genotype Imputation in American Indians

Assessing Accuracy of Genotype Imputation in American Indians Assessing Accuracy of Genotype Imputation in American Indians Alka Malhotra*, Sayuko Kobes, Clifton Bogardus, William C. Knowler, Leslie J. Baier, Robert L. Hanson Phoenix Epidemiology and Clinical Research

More information

Introduction to LOH and Allele Specific Copy Number User Forum

Introduction to LOH and Allele Specific Copy Number User Forum Introduction to LOH and Allele Specific Copy Number User Forum Jonathan Gerstenhaber Introduction to LOH and ASCN User Forum Contents 1. Loss of heterozygosity Analysis procedure Types of baselines 2.

More information

Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre

Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre Structural variation (SVs) Copy-number variations C Deletion A B C Balanced rearrangements A B A B C B A C Duplication Inversion Causes

More information

Genomic structural variation

Genomic structural variation Genomic structural variation Mario Cáceres The new genomic variation DNA sequence differs across individuals much more than researchers had suspected through structural changes A huge amount of structural

More information

SNPrints: Defining SNP signatures for prediction of onset in complex diseases

SNPrints: Defining SNP signatures for prediction of onset in complex diseases SNPrints: Defining SNP signatures for prediction of onset in complex diseases Linda Liu, Biomedical Informatics, Stanford University Daniel Newburger, Biomedical Informatics, Stanford University Grace

More information

Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases

Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases arxiv:1010.5040v1 [stat.me] 25 Oct 2010 Statistical Science 2009, Vol. 24, No. 4, 530 546 DOI: 10.1214/09-STS304 c Institute of Mathematical Statistics, 2009 Using GWAS Data to Identify Copy Number Variants

More information

Nature Methods: doi: /nmeth.3115

Nature Methods: doi: /nmeth.3115 Supplementary Figure 1 Analysis of DNA methylation in a cancer cohort based on Infinium 450K data. RnBeads was used to rediscover a clinically distinct subgroup of glioblastoma patients characterized by

More information

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis BST227 Introduction to Statistical Genetics Lecture 4: Introduction to linkage and association analysis 1 Housekeeping Homework #1 due today Homework #2 posted (due Monday) Lab at 5:30PM today (FXB G13)

More information

Integrated detection and population-genetic analysis. of SNPs and copy number variation

Integrated detection and population-genetic analysis. of SNPs and copy number variation Integrated detection and population-genetic analysis of SNPs and copy number variation Steven A. McCarroll 1,2,*, Finny G. Kuruvilla 1,2,*, Joshua M. Korn 1,SimonCawley 3, James Nemesh 1, Alec Wysoker

More information

Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer

Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer Supplementary Information Mosaic loss of chromosome Y in peripheral blood is associated with shorter survival and higher risk of cancer Lars A. Forsberg, Chiara Rasi, Niklas Malmqvist, Hanna Davies, Saichand

More information

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

November 9, Johns Hopkins School of Medicine, Baltimore, MD, Fast detection of de-novo copy number variants from case-parent SNP arrays identifies a deletion on chromosome 7p14.1 associated with non-syndromic isolated cleft lip/palate Samuel G. Younkin 1, Robert

More information

Associating Copy Number and SNP Variation with Human Disease. Autism Segmental duplication Neurobehavioral, includes social disability

Associating Copy Number and SNP Variation with Human Disease. Autism Segmental duplication Neurobehavioral, includes social disability Technical Note Associating Copy Number and SNP Variation with Human Disease Abstract The Genome-Wide Human SNP Array 6.0 is an affordable tool to examine the role of copy number variation in disease by

More information

DNA-seq Bioinformatics Analysis: Copy Number Variation

DNA-seq Bioinformatics Analysis: Copy Number Variation DNA-seq Bioinformatics Analysis: Copy Number Variation Elodie Girard elodie.girard@curie.fr U900 institut Curie, INSERM, Mines ParisTech, PSL Research University Paris, France NGS Applications 5C HiC DNA-seq

More information

Agilent s Copy Number Variation (CNV) Portfolio

Agilent s Copy Number Variation (CNV) Portfolio Technical Overview Agilent s Copy Number Variation (CNV) Portfolio Abstract Copy Number Variation (CNV) is now recognized as a prevalent form of structural variation in the genome contributing to human

More information

Biostatistical modelling in genomics for clinical cancer studies

Biostatistical modelling in genomics for clinical cancer studies This work was supported by Entente Cordiale Cancer Research Bursaries Biostatistical modelling in genomics for clinical cancer studies Philippe Broët JE 2492 Faculté de Médecine Paris-Sud In collaboration

More information

Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis

Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS. VOL 8. NO 5. 353^366 doi:10.1093/bfgp/elp017 Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis Advance

More information

CNV Detection and Interpretation in Genomic Data

CNV Detection and Interpretation in Genomic Data CNV Detection and Interpretation in Genomic Data Benjamin W. Darbro, M.D., Ph.D. Assistant Professor of Pediatrics Director of the Shivanand R. Patil Cytogenetics and Molecular Laboratory Overview What

More information

Interactive analysis and quality assessment of single-cell copy-number variations

Interactive analysis and quality assessment of single-cell copy-number variations Interactive analysis and quality assessment of single-cell copy-number variations Tyler Garvin, Robert Aboukhalil, Jude Kendall, Timour Baslan, Gurinder S. Atwal, James Hicks, Michael Wigler, Michael C.

More information

Supplementary Figures

Supplementary Figures Supplementary Figures Supplementary Fig 1. Comparison of sub-samples on the first two principal components of genetic variation. TheBritishsampleisplottedwithredpoints.The sub-samples of the diverse sample

More information

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015

More information

Integrated detection and population-genetic analysis of SNPs and copy number variation

Integrated detection and population-genetic analysis of SNPs and copy number variation 8 Nature Publishing Group http://www.nature.com/naturegenetics Integrated detection and population-genetic analysis of SNPs and copy number variation Steven A McCarroll 4,, Finny G Kuruvilla 4,, Joshua

More information

Integrated detection and population-genetic analysis of SNPs and copy number variation

Integrated detection and population-genetic analysis of SNPs and copy number variation Integrated detection and population-genetic analysis of SNPs and copy number variation Steven A McCarroll 4,, Finny G Kuruvilla 4,, Joshua M Korn 6, Simon Cawley 7, James Nemesh, Alec Wysoker, Michael

More information

New Enhancements: GWAS Workflows with SVS

New Enhancements: GWAS Workflows with SVS New Enhancements: GWAS Workflows with SVS August 9 th, 2017 Gabe Rudy VP Product & Engineering 20 most promising Biotech Technology Providers Top 10 Analytics Solution Providers Hype Cycle for Life sciences

More information

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training. Supplementary Figure 1 Behavioral training. a, Mazes used for behavioral training. Asterisks indicate reward location. Only some example mazes are shown (for example, right choice and not left choice maze

More information

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22. Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.32 PCOS locus after conditioning for the lead SNP rs10993397;

More information

New methods for discovering common and rare genetic variants in human disease

New methods for discovering common and rare genetic variants in human disease Washington University in St. Louis Washington University Open Scholarship All Theses and Dissertations (ETDs) 1-1-2011 New methods for discovering common and rare genetic variants in human disease Peng

More information

CHROMOSOMAL MICROARRAY (CGH+SNP)

CHROMOSOMAL MICROARRAY (CGH+SNP) Chromosome imbalances are a significant cause of developmental delay, mental retardation, autism spectrum disorders, dysmorphic features and/or birth defects. The imbalance of genetic material may be due

More information

Introduction to Genetics and Genomics

Introduction to Genetics and Genomics 2016 Introduction to enetics and enomics 3. ssociation Studies ggibson.gt@gmail.com http://www.cig.gatech.edu Outline eneral overview of association studies Sample results hree steps to WS: primary scan,

More information

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin,

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin, ESM Methods Hyperinsulinemic-euglycemic clamp procedure During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin, Clayton, NC) was followed by a constant rate (60 mu m

More information

DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING

DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING DANIEL R. SCHRIDER * Department of Biology and School of Informatics and Computing, Indiana University, 1001 E Third

More information

Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays

Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays Published online 8 February 2010 Nucleic Acids Research, 2010, Vol. 38, No. 9 e105 doi:10.1093/nar/gkq040 Comparative analyses of seven algorithms for copy number variant identification from single nucleotide

More information

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes. Supplementary Figure 1 Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes. (a,b) Values of coefficients associated with genomic features, separately

More information

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Application Note Authors John McGuigan, Megan Manion,

More information

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK CHAPTER 6 DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK Genetic research aimed at the identification of new breast cancer susceptibility genes is at an interesting crossroad. On the one hand, the existence

More information

Nature Genetics: doi: /ng Supplementary Figure 1

Nature Genetics: doi: /ng Supplementary Figure 1 Supplementary Figure 1 Illustrative example of ptdt using height The expected value of a child s polygenic risk score (PRS) for a trait is the average of maternal and paternal PRS values. For example,

More information

Structural Variants and Susceptibility to Common Human Disorders Dr. Xavier Estivill

Structural Variants and Susceptibility to Common Human Disorders Dr. Xavier Estivill Structural Variants and Susceptibility Genetic Causes of Disease Lab Genes and Disease Program Center for Genomic Regulation (CRG) Barcelona 1 Complex genetic diseases Changes in prevalence (>10 fold)

More information

Below, we included the point-to-point response to the comments of both reviewers.

Below, we included the point-to-point response to the comments of both reviewers. To the Editor and Reviewers: We would like to thank the editor and reviewers for careful reading, and constructive suggestions for our manuscript. According to comments from both reviewers, we have comprehensively

More information

Modeling genetic inheritance of copy number variations

Modeling genetic inheritance of copy number variations Published online 2 October 2008 Nucleic Acids Research, 2008, Vol. 36, No. 21 e138 doi:10.1093/nar/gkn641 Modeling genetic inheritance of copy number variations Kai Wang 1,2, *, Zhen Chen 3, Mahlet G.

More information

Introduction to the Genetics of Complex Disease

Introduction to the Genetics of Complex Disease Introduction to the Genetics of Complex Disease Jeremiah M. Scharf, MD, PhD Departments of Neurology, Psychiatry and Center for Human Genetic Research Massachusetts General Hospital Breakthroughs in Genome

More information

CNV PCA Search Tutorial

CNV PCA Search Tutorial CNV PCA Search Tutorial Release 8.1 Golden Helix, Inc. March 18, 2014 Contents 1. Data Preparation 2 A. Join Log Ratio Data with Phenotype Information.............................. 2 B. Activate only

More information

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations. Supplementary Figure. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations. a Eigenvector 2.5..5.5. African Americans European Americans e

More information

Theta sequences are essential for internally generated hippocampal firing fields.

Theta sequences are essential for internally generated hippocampal firing fields. Theta sequences are essential for internally generated hippocampal firing fields. Yingxue Wang, Sandro Romani, Brian Lustig, Anthony Leonardo, Eva Pastalkova Supplementary Materials Supplementary Modeling

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma.

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma. Supplementary Figure 1 Mutational signatures in BCC compared to melanoma. (a) The effect of transcription-coupled repair as a function of gene expression in BCC. Tumor type specific gene expression levels

More information

White Paper. Copy number variant detection. Sample to Insight. August 19, 2015

White Paper. Copy number variant detection. Sample to Insight. August 19, 2015 White Paper Copy number variant detection August 19, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 Fax: +45 86 20 12 22 www.clcbio.com

More information

Sum of Neurally Distinct Stimulus- and Task-Related Components.

Sum of Neurally Distinct Stimulus- and Task-Related Components. SUPPLEMENTARY MATERIAL for Cardoso et al. 22 The Neuroimaging Signal is a Linear Sum of Neurally Distinct Stimulus- and Task-Related Components. : Appendix: Homogeneous Linear ( Null ) and Modified Linear

More information

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University Role of Chemical lexposure in Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University CNV Discovery Reference Genetic

More information

Rare Variant Burden Tests. Biostatistics 666

Rare Variant Burden Tests. Biostatistics 666 Rare Variant Burden Tests Biostatistics 666 Last Lecture Analysis of Short Read Sequence Data Low pass sequencing approaches Modeling haplotype sharing between individuals allows accurate variant calls

More information

White Paper Guidelines on Vetting Genetic Associations

White Paper Guidelines on Vetting Genetic Associations White Paper 23-03 Guidelines on Vetting Genetic Associations Authors: Andro Hsu Brian Naughton Shirley Wu Created: November 14, 2007 Revised: February 14, 2008 Revised: June 10, 2010 (see end of document

More information

STATISTICAL METHODS FOR THE DETECTION AND ANALYSES OF STRUCTURAL VARIANTS IN THE HUMAN GENOME. Shu Mei, Teo

STATISTICAL METHODS FOR THE DETECTION AND ANALYSES OF STRUCTURAL VARIANTS IN THE HUMAN GENOME. Shu Mei, Teo Department of Medical Epidemiology and Biostatistics Karolinska Institutet, Stockholm, Sweden & Saw Swee Hock School of Public Health National University of Singapore, Singapore STATISTICAL METHODS FOR

More information

Structural Variation and Medical Genomics

Structural Variation and Medical Genomics Structural Variation and Medical Genomics Andrew King Department of Biomedical Informatics July 8, 2014 You already know about small scale genetic mutations Single nucleotide polymorphism (SNPs) Deletions,

More information

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays The Harvard community has made this article openly available. Please share how this access benefits you.

More information

False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University

False Discovery Rates and Copy Number Variation. Bradley Efron and Nancy Zhang Stanford University False Discovery Rates and Copy Number Variation Bradley Efron and Nancy Zhang Stanford University Three Statistical Centuries 19th (Quetelet) Huge data sets, simple questions 20th (Fisher, Neyman, Hotelling,...

More information

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies Cytogenetics 101: Clinical Research and Molecular Genetic Technologies Topics for Today s Presentation 1 Classical vs Molecular Cytogenetics 2 What acgh? 3 What is FISH? 4 What is NGS? 5 How can these

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION CONTENTS A. AUTISM SPECTRUM DISORDER (ASD) SAMPLE AND CONTROL COLLECTIONS 4 ASD samples 4 Control cohorts 4 B. GENOTYPING AND DATA CLEANING 6 SNP quality control 6 Intensity quality control for CNV detection

More information

Imputation of Missing Genotypes from Sparse to High Density using Long-Range Phasing

Imputation of Missing Genotypes from Sparse to High Density using Long-Range Phasing Genetics: Published Articles Ahead of Print, published on June July 29, 24, 2011 as 10.1534/genetics.111.128082 1 2 Imputation of Missing Genotypes from Sparse to High Density using Long-Range Phasing

More information

Tutorial on Genome-Wide Association Studies

Tutorial on Genome-Wide Association Studies Tutorial on Genome-Wide Association Studies Assistant Professor Institute for Computational Biology Department of Epidemiology and Biostatistics Case Western Reserve University Acknowledgements Dana Crawford

More information

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz Software Manual Institute of Bioinformatics, Johannes Kepler University Linz cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University

More information

Supplementary Information. Supplementary Figures

Supplementary Information. Supplementary Figures Supplementary Information Supplementary Figures.8 57 essential gene density 2 1.5 LTR insert frequency diversity DEL.5 DUP.5 INV.5 TRA 1 2 3 4 5 1 2 3 4 1 2 Supplementary Figure 1. Locations and minor

More information

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data. Supplementary Figure 1 PCA for ancestry in SNV data. (a) EIGENSTRAT principal-component analysis (PCA) of SNV genotype data on all samples. (b) PCA of only proband SNV genotype data. (c) PCA of SNV genotype

More information

MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS

MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS Paul F. O Reilly 1 *, Clive J. Hoggart 2, Yotsawat Pomyen 3,4, Federico C. F. Calboli 1, Paul Elliott 1,5, Marjo- Riitta Jarvelin

More information

ChIP-seq data analysis

ChIP-seq data analysis ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017 Contents Background ChIP-seq protocol ChIP-seq data analysis Transcriptional regulation Transcriptional

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022 96 APPENDIX B. Supporting Information for chapter 4 "changes in genome content generated via segregation of non-allelic homologs" Figure S1. Potential de novo CNV probes and sizes of apparently de novo

More information

Supplementary Materials for

Supplementary Materials for www.sciencetranslationalmedicine.org/cgi/content/full/7/283/283ra54/dc1 Supplementary Materials for Clonal status of actionable driver events and the timing of mutational processes in cancer evolution

More information

Reveal Relationships in Categorical Data

Reveal Relationships in Categorical Data SPSS Categories 15.0 Specifications Reveal Relationships in Categorical Data Unleash the full potential of your data through perceptual mapping, optimal scaling, preference scaling, and dimension reduction

More information

Research Strategy: 1. Background and Significance

Research Strategy: 1. Background and Significance Research Strategy: 1. Background and Significance 1.1. Heterogeneity is a common feature of cancer. A better understanding of this heterogeneity may present therapeutic opportunities: Intratumor heterogeneity

More information

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 PGAR: ASD Candidate Gene Prioritization System Using Expression Patterns Steven Cogill and Liangjiang Wang Department of Genetics and

More information

Quality Control Analysis of Add Health GWAS Data

Quality Control Analysis of Add Health GWAS Data 2018 Add Health Documentation Report prepared by Heather M. Highland Quality Control Analysis of Add Health GWAS Data Christy L. Avery Qing Duan Yun Li Kathleen Mullan Harris CAROLINA POPULATION CENTER

More information

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits Accelerating clinical research Next-generation sequencing (NGS) has the ability to interrogate many different genes and detect

More information

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach Manuela Zucknick Division of Biostatistics, German Cancer Research Center Biometry Workshop,

More information

Supplementary Figures

Supplementary Figures Supplementary Figures Supplementary Figure 1. Heatmap of GO terms for differentially expressed genes. The terms were hierarchically clustered using the GO term enrichment beta. Darker red, higher positive

More information

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014 Challenges of CGH array testing in children with developmental delay Dr Sally Davies 17 th September 2014 CGH array What is CGH array? Understanding the test Benefits Results to expect Consent issues Ethical

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data

A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data Chihyun Park 1, Jaegyoon Ahn 1, Youngmi Yoon 2, Sanghyun Park 1 * 1 Department

More information

Identifying Mutations Responsible for Rare Disorders Using New Technologies

Identifying Mutations Responsible for Rare Disorders Using New Technologies Identifying Mutations Responsible for Rare Disorders Using New Technologies Jacek Majewski, Department of Human Genetics, McGill University, Montreal, QC Canada Mendelian Diseases Clear mode of inheritance

More information

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed.

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed. Reviewers' Comments: Reviewer #1 (Remarks to the Author) The manuscript titled 'Association of variations in HLA-class II and other loci with susceptibility to lung adenocarcinoma with EGFR mutation' evaluated

More information

Statistical power and significance testing in large-scale genetic studies

Statistical power and significance testing in large-scale genetic studies STUDY DESIGNS Statistical power and significance testing in large-scale genetic studies Pak C. Sham 1 and Shaun M. Purcell 2,3 Abstract Significance testing was developed as an objective method for summarizing

More information

CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE. Dr. Bahar Naghavi

CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE. Dr. Bahar Naghavi 2 CURRENT GENETIC TESTING TOOLS IN NEONATAL MEDICINE Dr. Bahar Naghavi Assistant professor of Basic Science Department, Shahid Beheshti University of Medical Sciences, Tehran,Iran 3 Introduction Over 4000

More information

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage:

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage: Genomics 101 (2013) 134 138 Contents lists available at SciVerse ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno Gene-based copy number variation study reveals a microdeletion at

More information

Supplementary materials for: Executive control processes underlying multi- item working memory

Supplementary materials for: Executive control processes underlying multi- item working memory Supplementary materials for: Executive control processes underlying multi- item working memory Antonio H. Lara & Jonathan D. Wallis Supplementary Figure 1 Supplementary Figure 1. Behavioral measures of

More information

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Author's response to reviews Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Authors: Jestinah M Mahachie John

More information

Genetics and Genomics in Medicine Chapter 8 Questions

Genetics and Genomics in Medicine Chapter 8 Questions Genetics and Genomics in Medicine Chapter 8 Questions Linkage Analysis Question Question 8.1 Affected members of the pedigree above have an autosomal dominant disorder, and cytogenetic analyses using conventional

More information

Multimarker Genetic Analysis Methods for High Throughput Array Data

Multimarker Genetic Analysis Methods for High Throughput Array Data Multimarker Genetic Analysis Methods for High Throughput Array Data by Iuliana Ionita A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department

More information

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012 Statistical Tests for X Chromosome Association Study with Simulations Jian Wang July 10, 2012 Statistical Tests Zheng G, et al. 2007. Testing association for markers on the X chromosome. Genetic Epidemiology

More information

CS2220 Introduction to Computational Biology

CS2220 Introduction to Computational Biology CS2220 Introduction to Computational Biology WEEK 8: GENOME-WIDE ASSOCIATION STUDIES (GWAS) 1 Dr. Mengling FENG Institute for Infocomm Research Massachusetts Institute of Technology mfeng@mit.edu PLANS

More information

OncoPhase: Quantification of somatic mutation cellular prevalence using phase information

OncoPhase: Quantification of somatic mutation cellular prevalence using phase information OncoPhase: Quantification of somatic mutation cellular prevalence using phase information Donatien Chedom-Fotso 1, 2, 3, Ahmed Ashour Ahmed 1, 2, and Christopher Yau 3, 4 1 Ovarian Cancer Cell Laboratory,

More information

Vega: Variational Segmentation for Copy Number Detection

Vega: Variational Segmentation for Copy Number Detection Vega: Variational Segmentation for Copy Number Detection Sandro Morganella Luigi Cerulo Giuseppe Viglietto Michele Ceccarelli Contents 1 Overview 1 2 Installation 1 3 Vega.RData Description 2 4 Run Vega

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Complex Traits Activity INSTRUCTION MANUAL. ANT 2110 Introduction to Physical Anthropology Professor Julie J. Lesnik

Complex Traits Activity INSTRUCTION MANUAL. ANT 2110 Introduction to Physical Anthropology Professor Julie J. Lesnik Complex Traits Activity INSTRUCTION MANUAL ANT 2110 Introduction to Physical Anthropology Professor Julie J. Lesnik Introduction Human variation is complex. The simplest form of variation in a population

More information