cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs

Size: px

Start display at page:

Download "cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs"

Avice Harper
5 years ago
Views:

1 cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs Lachlan J M Coin 1, Julian E Asher, Robin G Walters, Julia S El-Sayed Moustafa, Adam J de Smith, Rob Sladek 3, David J Balding 4, Philippe Froguel,5 & Alexandra I F Blakemore 1 Nature America, Inc. All rights reserved. Although genome-wide association studies have uncovered single-nucleotide polymorphisms (SNPs) associated with complex disease, these variants account for a small portion of heritability. Some contribution to this missing heritability may come from copy-number variants (CNVs), in particular rare CNVs; but assessment of this contribution remains challenging because of the difficulty in accurately genotyping CNVs, particularly small variants. We report a population-based approach for the identification of CNVs that integrates data from multiple samples and platforms. Our algorithm, cnvhap, jointly learns a chromosome-wide haplotype model of CNVs and cluster-based models of allele intensity at each probe. Using data for 5 French individuals assayed on four separate platforms, we found that cnvhap correctly detected at least 14% more deleted and 5% more amplified genotypes than PennCNV or QuantiSNP, with an 8% and 1% improvement for aberrations containing <1 probes. Combining data from multiple platforms additionally improved sensitivity. Copy-number variants (CNVs) have been proposed as a substantial source of phenotypic variation in human populations, particularly as single-nucleotide polymorphism (SNP)-based genome-wide association studies have not identified variants with sufficient genetic effect to account for the observed heritability of complex diseases 1 3. Rare CNVs have been associated with neuropsychiatric conditions 4 and obesity 5, whereas common CNVs have been associated with several complex disorders 6 9 and have been shown to affect long-range gene regulation 1 and gene expression 11. Personalized whole-genome sequencing has underscored the importance of CNVs as a source of individual genetic variation and suggests that a substantial number of small (<1 kb) structural variants remain to be identified 1. Whereas two recently published studies 13,14 have claimed that common CNVs are unlikely to account for the missing heritability in complex disease, it was recognized that current approaches do not reliably identify smaller or multiallelic CNVs, which are difficult to assay and may be poorly tagged by SNPs. This emphasizes the importance of directly interrogating copynumber information in association studies. Although early CNV studies were largely conducted using microarray-based comparative genome hybridization (acgh), there is now a shift to simultaneous SNP and CNV profiling using high-density SNP arrays, whose dense coverage and high throughput is ideal for exploring the role of CNVs in complex disease, for which large samples are needed to detect genetic effects. Moreover, as many genome-wide SNP association studies have already been completed for complex diseases such as obesity 1 and type- diabetes, there is an opportunity to reanalyze these datasets for CNV associations. Algorithms for inferring integer CNV genotypes from SNP and acgh arrays can be divided into two general classes. Wide methods, such as ADM, PennCNV 16 and QuantiSNP 17, build a one-dimensional spatial model of copy-number variation (often using a hidden Markov model (HMM)) but process each sample independently with fixed allele signal intensity clusters. Deep methods, such as TriTyper 18 or SNP-conditional outlier detection (SCOUT) 19, use the distribution of allele intensities across all samples but focus on a few probes at a time. Some packages, such as SNP-conditional mixture modeling (SCIMM) and BirdSuite 1, have separate wide and deep components. Here we use the following nomenclature: CNV genotype for numerical copy number in a sample at a given probe position; SNP genotype for allelic composition within a fixed CNV genotype at a biallelic probe (for example, AAB); and CNV-SNP genotype for combined CNV and SNP genotype (for example, 3, AAB). Here we describe cnvhap, an integrated approach to CNV detection and genotyping that is both wide and deep as it builds a spatial model of copy-number variation across the genome and updates its model of intensity clusters at each probe position using the population distribution of intensities, respectively. To achieve this, cnvhap builds a unified sample population and haplotypic model of copynumber variation, which can exploit large sample sizes and multiple assays to provide a step forward in CNV genotyping accuracy. cnvhap is available as Supplementary Software. 1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College London, St. Mary s Hospital, London, UK. Department of Genomics of Common Disease, School of Public Health, Imperial College London, Hammersmith Hospital, London, UK. 3 Departments of Medicine and Human Genetics, McGill University and Genome Quebec Innovation Centre, Montreal, Quebec, Canada. 4 Institute of Genetics, University College London, London, UK. 5 Centre National de la Recherche Scientifique 89, Institute of Biology, Pasteur Institute, Lille, France. Correspondence should be addressed to L.J.M.C. (l.coin@imperial.ac.uk). Received 1 March; accepted 5 May; published online 3 May 1; doi:1.138/nmeth.1466 nature methods VOL.7 NO.7 JULY 1 541

(b) cnvhap constructs a haplotype HMM, with each column in the model representing a probe (shown as a colored circle) in one of the datasets, which are interwoven according to genomic location such

2 1 Nature America, Inc. All rights reserved. Figure 1 Schematic flow chart showing cnvhap operation. (a) Three types of input files are needed for the analysis: data files for each dataset, an HMM parameter file and a data-specific parameter file. (b) cnvhap constructs a haplotype HMM, with each column in the model representing a probe (shown as a colored circle) in one of the datasets, which are interwoven according to genomic location such that linear sequence corresponds to genomic sequence. Each row in the model corresponds to an unobserved copy number state (, red; 1, gray; and, blue), each with its own probability distribution over alleles with the corresponding copy number. Thus, for a given sample, each Data files Build (probe positions and wave correction) Plate file Intensity files (sample intensity by probe) HMM parameters Transition model Maximum copy number Data-specific parameters Data type Ploidy Starting cluster positions probe (column) can have copy number of (deleted, which represents only the single deletion allele), 1 (haploid normal, which represents A and B alleles) or (duplicated, which represents the AA, AB and BB alleles). In this example, arrow widths represent the probability of transitioning out of a copy number = 1 state at position zero (left), and a copy number = state at position one (middle; representing the next probe along the genome). (c) cnvhap constructs a separate set of cluster positions for each probe (column). Crosses indicate trained cluster means, with colors corresponding to the most likely assigned copy-number state. The first and third probe illustrate Illumina SNP probes, for which both LRR and BAF are defined; the second probe illustrates an Agilent acgh probe for which only the LRR is defined. Parameters of the model are trained using the expectation-maximisation algorithm, and then the trained model is used to obtain the final copy-number annotation of the population. (d) In this copy-number annotation schematic, the annotated segment of a chromosome is shown for multiple individuals, in which each line represents a segment from one individual. Red, deletion; gray, haploid normal; and blue, duplication. RESULTS Genotyping CNVs using a copy-number haplotype model cnvhap integrates intensity information from multiple platforms and cohorts into a single probabilistic model of copy-number variation. Three types of input files are provided: input data for each dataset, HMM parameters and data-specific parameters (Fig. 1a). Copy-number variation is modeled using an HMM at the haplotypic (single chromosome) level (Fig. 1b). An HMM for modeling N-ploid (for example, diploid) genomes is obtained by pairing N copies of the haploid model. The observed intensity data are modeled separately at the population level for each probe; cnvhap constructs a separate set of cluster positions for each probe modeled as linear combinations of fixed nonlinear functions of the underlying CNV-SNP genotype (Fig. 1c). The full model, incorporating the HMM and the cluster position coefficients, is trained using the expectation-maximization algorithm, a two-step procedure in which the expected CNV-SNP annotation is calculated in each iteration (default setting is at iterations) given the current model parameters, followed by parameter optimization given this annotation. The final CNV-SNP annotation is then calculated from the trained model (Fig. 1d). cnvhap incorporates user-specified fixed ploidy and maximum copy number per haplotype; however, the model s complexity of (number of states) ploidy makes it infeasible to run genome-wide with both high ploidy and high copy number. We routinely ran cnvhap on cohorts of up to 5, diploid samples with three copy-number states as well as with up to six copynumber states on 5 diploid samples or with three copy-number states on up to 5 tetraploid samples. Here we describe the results generated by cnvhap for a region on chromosome 1, integrating data from an Illumina Human 1M chip and a custom Agilent 44k acgh array (Fig. ). The trained HMM consists of one state per haploid copy number (, 1 or ) (Fig. a). The corresponding diploid HMM used to model this autosomal region accommodates up to four copies (two per haplotype). At a given position, the trained A and B allele probabilities for the copy number = 1 state correspond to the allele a c BAF b Haplotype HMM A = % A = 9% B = 8% A B = 1% AA = 4% AB = 3% BB = 64% AA Cluster positions LRR LRR LRR proportions in the population, and the trained AA, AB and BB probabilities for the copy number = state are the corresponding Hardy-Weinberg proportions. Transition probabilities are controlled by a single global transition rate matrix, which expresses the average amount of transition between copy-number states per base pair across the entire region, and by a position-dependent scalar transition rate, which captures transition-rate hot spots (Fig. b). This approach is analogous to the use of a global evolutionary rate matrix describing mutational events modulated by site-specific rates in phylogenetics. cnvhap also reclusters the transformed allele intensity measurements (for example, on Illumina chips, the log ratio of observed to expected fluorescence signal intensity (log R ratio; LRR) and the proportion of observed signal intensity owing to the B-allele (B-allele frequency; BAF)) at each position using a regularized regression framework as part of the model training. cnvhap fits a linear regression model in which cluster positions in the twodimensional LRR BAF space are expressed as linear combinations of fixed nonlinear functions of CNV-SNP genotype. The coefficients of these functions are updated in the maximization step using ridge regression 3. We found this to be particularly useful for reducing the impact of normalization-induced artifacts. For example, at position 5195 on chromosome 1 (dbsnp accession number rs39748) data normalization would ideally cause diploid samples to have a mean LRR of (Fig. c), but cnvhap has correctly identified that for this probe, homozygote AA individuals have a mean LRR of. and heterozygote individuals have a mean LRR of.3, whereas homozygote AAA amplifications are indicated by a mean LRR of.1. cnvhap also uses the trained cluster positions to produce a probability of each CNV-SNP genotype at each position. cnvhap s framework for integrating multiple datasets in a single HMM enables it to project copy-number state from a set of measured probes onto a set of unmeasured loci. At the unmeasured loci, the cluster-based emission probabilities are replaced with a uniform distribution over all CNV-SNP genotypes. A probability distribution over copy number is BAF AA = 81% AB = 18% BB = 1% d 54 VOL.7 NO.7 JULY 1 nature methods

1 Nature America, Inc. All rights reserved. Figure Visualization of cnvhap population model on chromosome 1 for integrated Illumina 1M and Agilent 44k datasets.

3 1 Nature America, Inc. All rights reserved. Figure Visualization of cnvhap population model on chromosome 1 for integrated Illumina 1M and Agilent 44k datasets. (a) Haploid HMM, with rows corresponding to copy-number state (, red; 1, gray; and, blue) and columns corresponding to probe locations. Positions are given in megabases (Mb). Bubble size corresponds to the expected number of samples assigned to each state at each position. Text in each bubble is the trained emission probability for SNP haplotypes in each copynumber state. The width of lines between bubbles indicates transition probability. (b) Pointwise transition rate integrated across all copy-number transitions revealed transition hotspots at CNV breakpoints. (c) Cluster plots from an Agilent 44k and an Illumina 1M probe in the CNV arranged by genomic location. Each data point corresponds to a single sample; color denotes most likely assigned copy-number state and symbol denotes the number of B alleles in the most likely SNP genotype. Crosses indicate trained cluster means (cross position) and variance (line width) for each CNV-SNP genotype. BAF for the Agilent probes is randomly assigned between and 1. a Position (Mb) CN = CN = 1 reported at these positions, which takes into account both copy-number state at flanking loci and the estimated local copy-number transition rate. We artificially masked 3% and 7% of Illumina 1M probes to demonstrate the feasibility of this procedure and the accuracy of the uncertainty estimates (Supplementary Figs. 1 and ). Given the strong linkage disequilibrium reported between certain classes of CNVs and SNPs 14,4, we expect CNV genotyping accuracy to be improved by explicitly modeling CNV-SNP haplotypes. This might be achieved by using data from individuals with a strong intensity signal to identify CNV-SNP haplotypes and then using shared haplotype structure to identify CNVs with weak intensity signal. Hence, to model CNV-SNP haplotypes, we coupled the cnvhap copy-number transition model with the haplotype models used by fastphase and polyhap 6 to obtain the extended cnvhap+snph model (Supplementary Fig. 3). This was feasible because both algorithms model haplotypic changes. A fixed number (default = 4) of SNP haplotype states with copy number = 1 are specified, which can be thought of as ancestral haplotype states. After model training, the emission probabilities of these states reflect haplotype-specific SNP-allele frequencies. We then constructed the copy number = states as unordered pairs of copy number = 1 states. We also included a fixed number (default = 1) of copy number = states in the model (we used multiple copy number = states to model overlapping deletions on different haplotypes). Genomewide benchmarking of CNV calls To benchmark CNV genotyping accuracy on Illumina platforms, we used data for a cohort of 5 healthy individuals from northern France previously characterized for copy-number variation using a prototype 185k genome-wide acgh array followed by a focused 44k array custom-designed for CNV validation and mapping 7. The accuracy of the copy-number b c CN = BAF Log rate 1..5 A B Chromosome 1:.454 Mb.539 Mb AA AB BB Relative position A_16_P399 (44k) LRR BAF mapping of this dataset (as characterized by Agilent s ADM algorithm) has been experimentally verified 4,7. To assess CNV genotyping on Illumina arrays, we assayed data for all 5 individuals using the Illumina 1M SNP array and ran cnvhap as well as three widely used CNV prediction algorithms: PennCNV 16, QuantiSNP 17 and cnvpartition 8 (Supplementary Fig. 4). We also assayed data for a subset of 36 individuals on a prototype Illumina 317k array, on which we ran cnvhap only. To evaluate accuracy, we projected Illuminabased copy-number predictions onto Agilent acgh 44K probes and compared predictions to the copy-number genotypes called by ADM (Supplementary Table 1). One caveat is that regions detectable on the 44k but not the 1M chip (for example, owing to absence of 1M probes or CNVs in the 44k reference sample) will be reported as false negative predictions for all three algorithms; similarly, genuine CNVs identified by the test algorithm but not identified by ADM will be erroneously scored as false positives, which may have the effect of penalizing highly sensitive algorithms. Copy-number annotation on the previously illustrated region on chromosome 1 showed that all three algorithms (cnvhap, PennCNV and QuantiSNP) run on the 1M chip detected a 65-kb deletion (Fig. 3a c), although PennCNV missed one instance of this deletion compared to the ADM reference. cnvhap+snph identified (different) shared flanking SNP haplotypes for amplifications and deletions (Fig. 3d). cnvhap also identified four instances of an amplification missed by the other two algorithms, which we confirmed by analysis of the 44k or 185k Agilent arrays with either cnvhap or ADM (Fig. 3e h). We also compared the algorithms accuracy in correctly calling copy number for each sample at sites of known copy-number variation genome-wide. The cumulative distribution of the squared Pearson s correlation coefficient between predicted copy number (using Illumina 1M data) and benchmark copy number.8 rs39748 (1M).6.4 LRR. nature methods VOL.7 NO.7 JULY 1 543

Illumina dataset, 44k and 185k are Agilent). Yellow, purple, aqua and dark red in d correspond to four different ancestral SNP haplotypes inferred by cnvhap+snph.

4 1 Nature America, Inc. All rights reserved. Figure 3 CNV predictions for chromosome 1.45 Mb.536 Mb. (a h) Copy-number annotation on each chromosome segment in the population (data for individuals are separated by thin white lines) obtained with the indicated algorithm and datasets (1M is an Illumina dataset, 44k and 185k are Agilent). Yellow, purple, aqua and dark red in d correspond to four different ancestral SNP haplotypes inferred by cnvhap+snph. White rows in g and h indicate samples that were not measured on Agilent 185k array. (ADM using 44k acgh data) at all copy-number aberrant sites in the benchmark, revealed that cnvhap is markedly more accurate (Fig. 4a). At an r threshold of.5, cnvhap correctly genotyped 44.8% of all copy-number aberrant sites (3.7%, 7.1% and 1.8% more than QuantiSNP, PennCNV and cnvpartition, respectively). Extending cnvhap to include SNP haplotypes added an improvement of.7% of all copy-number aberrant sites; however, for the 317k chip, cnvhap+snph had reduced accuracy, indicating that lower probe density can lead to overfitting of SNP haplotypes. The converse comparison, using PennCNV calls from the 1M chip as a benchmark for assessing CNV calls using acgh data, indicated that cnvhap outperformed ADM on both the 185k and 44k acgh chips (although the PennCNV benchmark may be slightly biased in favor of HMM methods such as cnvhap) (Fig. 4b and Supplementary Table ). To investigate cnvhap s genotyping accuracy, we calculated receiver operating characteristic (ROC) curves for detection of copy-number variation in each individual for all probes genomewide, and of the presence or absence of copy-number variation in the population as a whole (Fig. 5). For each algorithm, as the probability threshold for assignment of copy-number aberrant probes decreased, the number of true and false copy-number assignments (as determined by ADM 44k benchmark) increased, thus tracing out the ROC curve. For per-sample calls, cnvhap detected substantially more deleted and amplified probes than other algorithms did (Fig. 5a). cnvhap also detected more copy-number aberrant probes overall in the population, although with less marked improvement compared to per-sample calls (Fig. 5c), indicating that improvement in individual sample detection was the compound effect of gains in overall detection and gains in individual-sample genotyping. cnvhap was also superior in identifying correct copy-number breakpoints (Fig. 5b). For shorter CNVs, notably those with <1 Illumina probes, cnvhap s advantage was even more striking (Supplementary Fig. 5). We observed similar results for the PennCNV benchmark (Supplementary Figs. 6 and 7). The cnvhap+snph model was less sensitive in detecting amplifications than cnvhap alone. This may be due to the lack a Cumulative frequency (%) b Cumulative frequency (%) r r cnvhap (44k) cnvhap (185k) ADM (44k) ADM (185k) Samples Samples a b c d e cnvhap (44k) f ADM (44k) g cnvhap (185k) h ADM (185k) CN = CN = CN = Chromosome 1:.454 Mb.539 Mb Not measured cnvhap+snph (1M+44k) Inferred haplotypes of linkage disequilibrium between SNPs flanking the original location and the amplified state when the amplified copy is not in tandem with the original 14. cnvhap+snph also was less specific in its calls for both deletions and amplifications. However, in view of cnvhap+snph s increased genotyping accuracy (Fig. 4a and Supplementary Table 3), we suggest that cnvhap+snph was calling additional genuine, small CNVs on the Illumina 1M array that were not detected on the ADM (44k) reference and which were, therefore, erroneously scored as false positives. SNP genotyping discordance rate (between Illumina 1M and 317k chips) for cnvhap and cnvhap+snph was similar for different copy-number states (Supplementary Fig. 8). Multiplatform integration We also evaluated the advantages of cnvhap s ability to integrate multiple data sources in a single probabilistic model by assessing the sensitivity of cnvhap calls on different combinations of the four datasets (Illumina 1M, Illumina 317k, Agilent 44k and Agilent 185k). We anticipated that this would improve detection via greater probe coverage and partial replicate measurements. We used copy-number calls produced by cnvhap for the maximally dense probe set (integrating all four chips) as the benchmark to compare the performance of different subsets. We found that combining information always increased genotyping accuracy relative to individual component datasets, even when one probe was a subset of the other (for example, 317k+1M outperformed 1M) or when the second stage probeset was designed to refine predicted CNV boundaries (for example, 185k+44k outperformed 44k) (Fig. 6a). We conclude that if Figure 4 Cumulative frequency of squared Pearson s correlation coefficient between predicted and benchmark copy number calls. (a) Comparison of algorithms applied to Illumina 1M and 317K data with a 44k acgh (ADM) benchmark. (b) Comparison of algorithms applied to 44K and 185K acgh data with an Illumina 1M (PennCNV) benchmark. 544 VOL.7 NO.7 JULY 1 nature methods

a b c 45 45 14 14 7 7 4 4 1 1 6 6 3 3 1 1 5 5 8 8 4 4 6 6 3 3 4 4 1 1 5 5 1 1..5.75 1. 1.1.5..5.75 1. 1. 1.5.5.1....5.1... 1 1 3 4 5 6 7 8 9 111 1 1 3 4 5 6 7 8 9 111 Figure 5 ROC curves for detecting CNVs using Illumina 1M data.

5 a b c Figure 5 ROC curves for detecting CNVs using Illumina 1M data. (a c) Each algorithm was run genome-wide on Illumina data and projected to 44k probes, and plotted data for deletions (left) and amplifications (right). We detected copy-number genotypes for each individual (a), copy-number breakpoints for each individual (b) and the presence or absence of copy-number variation in a population (c). 1 Nature America, Inc. All rights reserved. cohorts have been regenotyped on a denser chip, intensity information from the older chip should still be included in the analysis. This improvement is most striking if we consider the accuracy of breakpoint identification (Supplementary Fig. 9). The most accurate breakpoint identification is provided by the combination of Illumina genotyping chips (317k+1M) for deletions, and by the combination of Agilent chips for amplifications. As another test of the increase in genotyping accuracy provided by combining datasets, we examined small (<3 kb) deletions initially detected in at least two samples by ADM in the 44k acgh data, with genotypes and breakpoints experimentally determined using PCR and sequencing 4. Despite ascertainment bias in favor of the 44k chip, integrating 185k acgh data or 1M Illumina data boosted sensitivity (Fig. 6b). Validation using an independent HapMap dataset We validated cnvhap s predictions on 118 HapMap samples using CNVs identified from fosmid end sequence pair (fosmid ESP) maps on eight of those samples 9. Of all such fosmid ESP defined CNVs validated by targeted acgh, cnvhap with default parameters identified 68% of >1 kb CNVs, 6% of 5 1 kb CNVs and 31% of <5 kb CNVs. Of 894 deletions >1 kb identified by cnvhap, 44% overlapped a deletion identified by fosmid ESP mapping (compared with 45% of 8 deletions identified by SCIMM ). Similarly, sequence data validated 18% of 369 cnvhap-identified amplifications (14% of 11 amplifications for SCIMM). This is consistent a Cumulative frequency (%) k+1M 44k+185k 44k+317k 1M+317k 44k M+185k 1M 185k+317k 185k 317k r k+1M+185k+317k 185k k+1M 44k 1M+185k 1M+317k 44k+185k 1M 44k+317k 317k 6 185k+317k Figure 6 Combining datasets improved sensitivity. (a) Cumulative frequency of squared Pearson s correlation coefficient for combinations of datasets analyzed with cnvhap using the maximal probe set (185k+44k+317k+1M) as benchmark. (b) ROC curves for per-sample detection of sequenced deletions for different dataset combinations. b with our benchmarking results and demonstrated that cnvhap has increased sensitivity while maintaining high specificity. To assess cnvhap s genotyping accuracy on this dataset, we examined cnvhap predictions at the sites of 18 common deletions previously independently genotyped by PCR and Illumina GoldenGate assays using the correlation coefficient between predicted and benchmark copy-number calls (Supplementary Table 3). cnvhap s genotyping accuracy was comparable to that of SCIMM at most sites. Although cnvhap missed one CNV on chromosome 7, it correctly identified and accurately genotyped a CNV on chromosome not identified by SCIMM. Including haplotype information (cnvhap+snph) further improved genotyping accuracy and enabled identification of a locus on chromosome missed by SCIMM. A simple adjustment to the HMM parameters to relax the initial probability of transitioning into a CNV also improved genotyping accuracy. DISCUSSION cnvhap is a cross-platform CNV genotyping tool that bridges the gap between CNV discovery (typically carried out sample by sample) and CNV genotyping (typically performed at a few probes at validated CNV locations). cnvhap s increased accuracy results from improved modeling of the underlying structure of aberrations via a haplotype model of copy-number variation combined with more effective synthesis of information across samples to recluster intensity at each probe. Although here we focused on Illumina and Agilent data, cnvhap has modules for sequence and Affymetrix data and can be readily extended to incorporate modules for other platforms. Our multiplatform benchmarking dataset will be a useful resource for other algorithm comparisons. The extended model (cnvhap+snph) incorporated SNP haplotypes into the underlying model, improving genotyping of deletions. We anticipate even greater improvement (particularly when using sparser arrays) if dense intensity data from a reference panel, providing a robust source of CNV-SNP haplotypes, are included in the analysis. Additionally, the SNP haplotype model incorporated in cnvhap+snph accurately phases polyploid genomic regions 3, enabling cnvhap+snph to jointly discover and phase CNV regions. We used cnvhap to detect rare recurrent deletions and amplifications of a region on chromosome 16p11., which is strongly associated with obesity in multiple genome-wide association datasets 5. Accurate estimates of the posterior probability distribution over copy-number genotypes obtained by projecting copy-number nature methods VOL.7 NO.7 JULY 1 545

6 1 Nature America, Inc. All rights reserved. annotation from one platform to another will be useful for combining effect size estimates in meta-analyses incorporating multiple platforms with partially correlated probe density. Filtering out low-information-content probes then accounting for copy-number uncertainty in the association analyses should reduce spurious associations. Our work is particularly topical as researchers seek improved CNV genotyping to facilitate detection of CNV-phenotype associations. The high error rate in detecting certain classes of CNVs, including short deletions (also the most prevalent 1 ) and amplifications, has contributed to the relatively paucity of reported CNV-disease associations and highlights the need for more sensitive detection methods 13,14. Many attempts to identify CNV-phenotype associations have been carried out using single studies (usually with modest sample size), leading to low power to detect modest effects or associations with rare CNVs. The difficulty in accurately genotyping CNVs leads to additional loss of power. Thus, there is a real need for a framework for improving CNV genotyping accuracy, integrating data across multiple platforms and pooling results across studies. cnvhap fills this gap in the genome-wide CNV analysis toolkit, and we anticipate that it will lead to improvements in detecting CNV-disease associations. Methods Methods and any associated references are available in the online version of the paper at Note: Supplementary information is available on the Nature Methods website. Acknowledgments We thank D. Serre, A. Montpetit and D. Vincent for advice concerning Illumina arrays and D. Peiffer (Illumina) for providing genotype data on HapMap samples. Genome Canada and Genome Quebec funded genotyping on the Illumina Human1M platform. L.J.M.C. is funded by a Research Council UK fellowship. J.E.A. is supported by the Medical Research Council. R.G.W. is supported by Johnson & Johnson and the South East England Development Agency. J.S.E.-S.M. is supported by an Imperial College Division of Medicine PhD studentship. AUTHOR CONTRIBUTIONS L.J.M.C. designed the project with A.I.F.B., developed the cnvhap algorithm and software, analyzed data and wrote the paper. J.E.A. ran cnvpartition, PennCNV and QuantiSNP on the data and helped write the paper. R.G.W. and J.S.E.-S.M. provided critical comments and helped to write the paper. D.J.B. provided statistical advice. R.S. provided SNP genotype data, advised on its interpretation and edited the paper. A.J.d.S. provided acgh data and advised on its interpretation. P.F. provided the DNA samples and coordinated the SNP genotyping. A.I.F.B. designed the project with L.J.M.C., coordinated the acgh analysis, contributed to writing the paper and oversaw the project. COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests. Published online at Reprints and permissions information is available online at com/reprintsandpermissions/. 1. Meyre, D. et al. Genome-wide association study for early-onset and morbid adult obesity identifies three new risk loci in European populations. Nat. Genet. 41, 7 9 (9).. Sladek, R. et al. A genome-wide association study identifies novel risk loci for type diabetes. Nature 445, (7). 3. Zeggini, E. et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type diabetes. Nat. Genet. 4, (8). 4. Cook, E.H. & Scherer, S.W. Copy-number variations associated with neuropsychiatric conditions. Nature 455, (8). 5. Walters, R.G. et al. A new highly penetrant form of obesity due to deletions on chromosome 16p11.. Nature 463, (1). 6. Aitman, T.J. et al. Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature 439, (6). 7. Diskin, S.J. et al. Copy number variation at 1q1.1 associated with neuroblastoma. Nature 459, (9). 8. McCarroll, S.A. et al. Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn s disease. Nat. Genet. 4, (8). 9. Willer, C.J. et al. Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat. Genet. 41, 34 (9). 1. Kleinjan, D.A. & van Heyningen, V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am. J. Hum. Genet. 76, 8 3 (). 11. Stranger, B.E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 3, (7). 1. Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, (8). 13. Wellcome Trust Case Control Consortium. Genome-wide association study of CNVs in 16, cases of eight common diseases and 3, shared controls. Nature 464, (1). 14. Conrad, D.F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, (1).. Lipson, D., Aumann, Y., Ben-Dor, A., Linial, N. & Yakhini, Z. Efficient calculation of interval scores for DNA copy number data analysis. J. Comput. Biol. 13, 8 (6). 16. Wang, K. et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 17, (7). 17. Colella, S. et al. QuantiSNP: an objective Bayes hidden-markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res., 13 (7). 18. Franke, L. et al. Detection, imputation, and association analysis of small deletions and null alleles on oligonucleotide arrays. Am. J. Hum. Genet. 8, (8). 19. Mefford, H.C. et al. A method for rapid, targeted CNV genotyping identifies rare variants associated with neurocognitive disease. Genome Res. 19, (9).. Cooper, G.M., Zerr, T., Kidd, J.M., Eichler, E.E. & Nickerson, D.A. Systematic assessment of copy-number-variant detection via genome-wide SNP genotyping. Nat. Genet. 4, (8). 1. Korn, J.M. et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 4, (8).. Coin, L. & Durbin, R. Improved techniques for the identification of pseudogenes. Bioinformatics (Suppl. 1), i94 i1 (4). 3. Hoerl, A.E. Application of ridge analysis to regression problems. Chem. Eng. Prog. 58, (196). 4. de Smith, A.J. et al. Small deletion variants have stable breakpoints commonly associated with alu elements. PLoS One 3, e314 (8).. Scheet, P. & Stephens, M. A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, (6). 6. Su, S.-Y., Balding, D.J. & Coin, L.J.M. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics 9, 513 (8). 7. de Smith, A.J. et al. Array CGH analysis of copy number variation identifies 184 new genes variant in healthy white males: implications for association studies of complex diseases. Hum. Mol. Genet. 16, (7). 8. Peiffer, D.A. et al. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res. 16, (6). 9. Kidd, J.M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, (8). 3. Su, S.-Y., Balding, D.J. & Coin, L.J.M. Disease association tests by inferring ancestral haplotypes using a hidden Markov model. Bioinformatics 4, (8). 546 VOL.7 NO.7 JULY 1 nature methods

7 1 Nature America, Inc. All rights reserved. ONLINE METHODS Definition of CNV-SNP alleles and genotypes. The CNV-SNP alleles were defined up to a user-specified maximum haploid copy number and are denoted by h {,, H}. For example, given a maximum copy number of, the space of CNV-SNP alleles at a biallelic probe is {, A, B, AA, AB, BB} and {, A, AA} at a monoallelic probe, where denotes a deletion. For user-specified ploidy N, the space of CNV-SNP genotypes was then constructed as all unordered lists of length N of CNV-SNP alleles. cnvhap haplotype hidden Markov model (HMM). cnvhap builds an HMM at the haplotype level (Fig. 1d). At each probe m {1,, M}, the HMM comprises one hidden state s m,l per haploid copy number l, up to a user-defined maximum l {,, z 1}. At each probe, each state comprises an emission probability distribution over all CNV SNP alleles, h, with corresponding copy number, which we denoted by u m,l (h). Because each state corresponds to a particular copy number, u m,l (h) = if CN(h) CN(s m,l ), in which we denote by CN the copy number of a state, CNV-SNP allele or genotype. This means that for the deletion state u m,l ( ) = 1. The emission probability distribution for states with CN(s m,l ) > are expressed such that they reflect the Hardy-Weinberg equilibrium distributions of the single-copy state (Supplementary Note 1). This encourages the model to find duplication alleles which are consistent with the underlying SNP allele distribution. Haploid transition probabilities between copy number states. We used continuous time Markov chain theory to define transition probabilities between CN states. Let Q define a global transition rate matrix between different CN states, so that Q k,l is the instantaneous rate of transition from state k to state l. We require that the rows of this matrix sum to zero, that is, Q k,k = Σ l k Q k,l. We denote by P k (d m ) the probability that the sample is in state k at genomic distance d m and by P(d m ) the vector of these probabilities. We also define as π the equilibrium probability distribution over copy number states such that Qπ =, which can be found using the singular value decomposition of Q. The average transition rate for Q is defined as r Q = Σ k π k Q kk. The evolution of P over genomic distance d is defined by the differential equation P (d) = QPr, in which r is an arbitrary distance scaling parameter. The solution of this equation is given by the matrix exponential of Q multiplied by the initial probability distribution: P(d) = P()e Qrd. Thus, we can calculate the transition probability between copy number states at probes m and m 1 as Qr ( ( ) ( )) p( sm l sm k) { e m d m d m 1 = 1 = = } k, l (1) in which r m is the site-specific transition rate. Details of how the matrix Q and local transition rate r m are initialized and updated during training are available in the Supplementary Note 1. cnvhap+snph haplotype HMM. The cnvhap+snph HMM fuses the cnvhap model described above with the haplotype HMM used in polyhap and fastphase. To achieve this, we introduced multiple states per CN (Supplementary Fig. 3). A user-defined number of states with CN = 1 is included in the model. The CN = states are defined as all unordered pairs of CN = 1 states. A user-defined number of CN = states is also added to the model. As above, the emission probability distributions θ ml (h) are only defined for CN = 1 states, with higher copy-number states defined in terms of the CN = 1 copy number states they comprise (Supplementary Note 1). To fuse the polyhap transition model with the cnvhap model, we separately define between CN, within same CN as well as within different CN transition probabilities. The within-cn transition model is the same as used in polyhap. Similarly, the between-cn model is the same as the cnvhap transition model described above, with the exception that we model haplotypespecific transition hotspots by allowing the site-specific transition rate r m to vary between different copy-number states. The within different CN transition probabilities capture joint haplotypes between states with different CN. Denote by ci(k) the index of state k among all states with CN(k). The fused transition probability is the product of within and between copy-number transitions: p( sm = l sm 1 = k) = p( ci( sm) = ci( l) ci( sm 1) = ci( k), CN( sm) = CN( l), CN( sm 1) = CN( k)) p( CN( sm) = CN( l) CN( sm 1) = CN( k)) () We set the within same CN transition to zero if CN(l) = CN(k) = (to disallow haplotype state switches in a deletion); to the polyhap probability if CN(l) = CN(k) = 1; and to a transition probability which is defined in terms of CN = 1 polyhap probabilities if CN(l) = CN(k) > 1 (Supplementary Note 1). The within different CN transition rate is a parameter τ CN(k),CN(l),ci(k),ci(l) which was updated during training. Modeling intensity data. For Illumina genotyping arrays, the observed data we model are the log R intensity ratio and B-allele frequency at each marker: (LRR m, BAF m ). For Agilent arrays the observed data consist of LRR only. We followed the work of others in assuming that LRR m and BAF m are uncorrelated. We assumed that given an (unobserved) genotype g, LRR m is distributed normally given an unobserved mean and variance, that is, LRR m ~ N(μ rm (g), σ rm (g)); and that BAF m is distributed according to a normal distribution truncated to lie between BAF m 1, that is, BAF m ~ TN(μ bm (g), σ bm (g)), unless CN(g) =, in which case we assume is BAF m is uniformally distributed between and 1. The mean and variance for BAF and LRR for all g with CN(g) > were parameterised via a linear model: f( g) = 1 mrm( g) f1( g) = log( CN( g)/ ) srm( g) f ( ) ( ( )) mbm( g) = g = f g 1 f3( g) = bfrac( g) sbm( g) f4( g) = f3( g) ( 1 f3( g)) f5( g) = f3( g) ( f3( g). 5) ( f3( g) 1) (3) in which b is a 4 by 6 matrix of parameters, the functions f are basis functions, and bfrac(g) is defined as the proportion of B alleles in the genotype g. We also restricted σ rm (g) > and σ bm (g) >. For CN(g) =, we simply allowed μ rm (g), σ rm (g) to be free parameters of our model. If we focus on the first row of equation 3, we note that this model enables us to express the LRR mean as a function of the actual theoretical log R ratio (which is expressed by f 1 (g)) and the square of the log R ratio. However, it also enables us to express a dependence on terms involving the theoretical B-allele fraction of genotype g (which is expressed by f 3 (g)), which in terms of the cluster plots, means that the expected LRR value can increase (or decrease) as we go from homozygote AA to heterozygote to homozygote BB. doi:1.138/nmeth.1466 nature methods

8 1 Nature America, Inc. All rights reserved. Note that although these basis functions seem to work well in practice, any function of the genotype of g could be used. Creating a multiplatform HMM and mapping between different chips. We describe here how to create a multiplatform HMM that can integrate information from multiple sources. The positions of all of the probes on each of the platforms which have been used (on any individual) are interwoven according to their genomic location. The total number of positions from all of the chips forms the total number of probes M. Then, for each individual, the emission state probability at each position is given as one of the likelihoods described above and in Supplementary Note 1, depending on which platform was used at each position. If data for an individual was not collected on a particular platform, then a uniform genotype distribution was used at that position to allow the possibility of any CNV genotype. If two platforms assayed the same genomic location, then the likelihood is the product of the two platformspecific likelihoods. This also provides a way to project CNV predictions from one array to another. In this framework, we simply included all of the probe positions on the second array but used uniform CNV-SNP genotype distribution at these positions rather than the actual data. Estimating genotypes from the model together with uncertainties. Details of constructing a N-ploid HMM from the haploid HMM is given in Supplementary Note 1. We ran the Baum-Welch training algorithm to estimate the parameters of the haploid HMM, which approximate a local mode of the posterior probability distribution. For data from each individual, we ran the forward-backward algorithm, and, using standard dynamic programming techniques calculated the probability distribution (conditional on the model parameters) over unordered lists of hidden states, and using this together with Supplementary Note 1 equation 3, we calculated the probability distribution over CNV-SNP genotypes. This procedure can be repeated for a user-specified number of training repetitions, and the CNV-SNP probability distribution can be averaged over these repetitions. We have previously reported that the gain in phasing accuracy using ten instead of one repetition with polyhap is modest and even smaller for inferring missing genotype data 6. To minimize the computational burden, we only used one iteration in this work. Samples and genotyping. DNA was isolated from peripheral blood samples from 5 unrelated, apparently healthy Caucasian males of northern French origin. The study protocol followed the standards laid out in the Declaration of Helsinki with full Ethics Committee approval as detailed in reference 6. The 185k CGH array consisted of 185, probes (17,1 autosomal) with an average spacing of 16 kb and a bias toward genes. The 44k CGH array consisted of 44, probes (3,81 autosomal) designed to provide high-resolution coverage of CNV regions discovered on the 185k CGH array and other previously identified CNVs. A reference consisting of pooled DNA from all 5 subjects was used on the 185k CGH array, whereas a reference sample for a single Caucasian individual from the Coriell Cell Repository (NA51) was used for the 44k array. Data were acquired on 5 samples on the 44k CGH array and samples on the 185k CGH arrays 7. Illumina genotyping was performed using the Infinium Human 1M chip (1,7,8 SNPs of which 1,9,591 were autosomal) and a prototype 317k chip (based on the Hap3k BeadArray; 317,53 SNPs, of which 38,33 were autosomal). All 5 samples were measured on the Human 1M chip, and a subset of 36 samples were measured on the 317k BeadArray. Intensity data for HapMap samples using the Illumina1M BeadArray were obtained from Illumina. Details of fosmid-esp defined CNVs and acgh validation status were obtained from reference 8. SCIMM genotyping results for 18 experimentally defined CNV loci were obtained from reference 19. Sample quality control and normalization. Samples with an LRR variance greater than.3 on the Illumina Human 1M or 317k BeadArrays were excluded from the analysis. As previously observed, in general array based technologies exhibit localized wave-like changes in LRR which correlate with G+C content 31. For each chromosome separately, we first calculated the median LRR for each sample and subtracted this value from each point-wise LRR value so that each sample had a median LRR of zero. Next, for each sample, we calculated a Loess curve using only normal LRR variation of.3 to.3 and a window of 5 kb. We then averaged the Loess curves across all samples. This Loess value was then subtracted from the pointwise LRR values in each sample. Parameters for cnvpartition, PennCNV and QuantiSNP. CNV Partition 1.. (ref. 8) was run using the default settings (confidence threshold, probe gap size threshold 1 Mb). The latter parameter prevents the calling of CNVs over probe gaps greater than this size (for example, over centromeres). PennCNV 16 (March 8 version) was run using the HMM, population frequency for B allele (PFB) and wavemodel correction files for the Illumina 1M array supplied as part of the software package. The data for these default files was derived from a set of 1 HapMap samples. The PFB file contains the positions and the population frequency of the B allele for the SNP markers (PennCNV uses only LRR information when analyzing monomorphic markers). Whereas the creation of a custom PFB file can increase CNV calling accuracy in large ethnically homogeneous samples, the sample used for our analysis was considered too small to enable the construction of a representative PFB file. The only change from the default settings was the inclusion of the wavemodel adjustment procedure. This adjustment, based on a large set of training data with varying degrees of waviness, was designed to compensate for the known fluctuations ( GC waves ) in signal intensity caused by uneven G+C content in different regions of the genome. QuantiSNP version 1.1 (ref. 17) was run using default settings, a log Bayes factor threshold of 5, and the build 36 wavemodel (GC) correction files supplied as part of the software package. The wavemodel correction in QuantiSNP is derived by using a simple linear model incorporating local genomic G+C content. The inclusion of this correction was the only deviation from the default settings. Software. cnvhap software, documentation and example data files are available as Supplementary Software (also at imperial.ac.uk/medicine/people/l.coin/). 31. Marioni, J.C. et al. Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 8, R8 (7). nature methods doi:1.138/nmeth.1466

Nature Biotechnology: doi: /nbt.1904

Nature Biotechnology: doi: /nbt.1904 Supplementary Information Comparison between assembly-based SV calls and array CGH results Genome-wide array assessment of copy number changes, such as array comparative genomic hybridization (acgh), is