Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis

Size: px

Start display at page:

Download "Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis"

Walter Cole
5 years ago
Views:

1 BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS. VOL 8. NO ^366 doi: /bfgp/elp017 Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis Advance Access publication date 8 September 2009 Abstract Data from whole genome association studies can now be used for dual purposes, genotyping and copy number detection. In this review we discuss some of the methods for using SNP data to detect copy number events. We examine a number of algorithms designed to detect copy number changes through the use of signal-intensity data and consider methods to evaluate the changes found. We describe the use of several statistical models in copy number detection in germline samples. We also present a comparison of data using these methods to assess accuracy of prediction and detection of changes in copy number. Keywords: copy number; SNP array INTRODUCTION Structural variation in the human genome has been intensely studied in recent years [1 5]. Publications have shown rare copy number variations (CNV) with a relationship to certain diseases and much has also been done to study copy number polymorphisms (CNP) in the population, their contribution to structural variation and possible association to complex disease. Multiple methods for the detection of these structural variants exist [6, 7] but we seek to focus on methods designed to interpret results from SNP arrays. The most prominent SNP array types are available from commercial vendors and Illumina. Both companies sell competing arrays and continue to offer increased coverage for detecting copy number events and SNP assays simultaneously. Assay technique for the arrays differ [8, 9] but the signal-intensity output from the both platforms present similar analysis and interpretation problems. Successful application of these technologies has yielded a number of interesting individual CNVs with relationships to complex disease. For example, rare CNVs have been linked to schizophrenia [10] in a study where microdeletions and duplications were shown to be responsible for disrupting genes involved in neurodevelopment. The UGT2B17 gene on Chromosome 4q13.2 was linked to osteoporosis in a case-control study of 727 CNV regions in a Chinese sample set [11]. One approach to copy number event detection has been to investigate common events. Studies such as the McCarroll et al. [12] involved the characterization of deletion variations in the genome, while Redon et al. [2] have mapped the location of events found in multiple samples. Information about identified copy number events is recorded in databases such as The Database of Genomic Variants (DGV) [1]. Using the prior information about CNP location we can investigate copy number events as we would use SNP information in genotyping. Known CNPs can be genotyped in case control populations with similar methods to the SNP-based association study. With the diversity of approaches and analysis options it is important to decide on a method most suited for the particular experimental needs. This review presents methods suggested for analysis of germ line CNV analysis, including both CNP analysis and the detection of rare CNVs. Corresponding author. Jiannis Ragoussis, Genomics, Wellcome Trust Centre For Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK. Tel: (01865) ; Fax: (01865) ; ioannisr@well.ox.ac.uk LauraWinchester is a DPhil student at Oxford University where her research involves detection of copy number events in genetic disorders, in particular, Specific Language Impairment. Christopher Yau is a Postdoctoral Research Fellow in the Department of Statistics at Oxford University. Jiannis Ragoussis is Head of Genomics at WTCHG. Interests: gene expression regulation in hypoxia and inflammation, genotyping and sequencing technology, identification of chromosomal aneuploidies and CNVs associated with disease. ß The Author Published by Oxford University Press. For permissions, please journals.permissions@oxfordjournals.org

2 354 Winchester et al. CNV DISCOVERY AND DETECTION USING SNP CHIPS The use of SNP arrays in copy number event detection has a number of advantages. As well as the two applications for the data which are SNP genotyping and copy number analysis, there are other aspects that promote their use over other techniques. SNP arrays use less sample per experiment compared to other techniques such as comparative genomic hybridization (CGH) arrays. Cost is also an important factor in the selection of the method. The SNP array is a cost effective technique which allows the user to increase the number of samples tested on a limited budget. Although the advances in high throughput sequencing technology has made copy number discovery much easier, the application of known CNP information means that we can target structural variation in a sample using cheaper techniques such as the SNP array without a large reduction in genome wide coverage. One important consideration, however, is the bias of the SNP chip coverage towards known CNVs [13]. Historically, when SNPs are selected for genotyping arrays certain factors are considered which may decrease the number of copy number variants or polymorphisms typed [14]. Studies have found CNPs to be most common in regions containing high levels of segmental duplication [2], which are areas of low SNP coverage compared to other areas of the genome due to the difficulties of assay design and implementation. Common CNPs may cause assays to fail standard inheritance checks and Hardy Weinberg tests. For example, in a situation where a father is (A, B) and the mother (B, ), the child could be (A, B)or(A, ) or(b, ). However, in SNP genotyping results, the mother would appear to be called (B, B) and the child would be called either (A, B)or(A, A)or(B, B). If the child is really (A, ) then an (A, A) call would seem to violate Mendelian inheritance patterns and often cause the SNP to be rejected. Assays were also often selected and tested on the basis of their use in SNP genotyping, meaning the final result may produce noisy signal, which although per se does not affect the ability to genotype, is a major problem for accurate copy number detection. For instance, SNP data is typically standardized against a reference population in order to reduce the effect of factors including: between-array variation and probe-specific hybridization effects. In doing so, normalization routines implicitly assume that all members (or the large majority) of the reference population have the same copy number but, at locations of common CNV, this assumption is clearly no longer appropriate. At these genomic locations, the process of SNP data normalization and the derivation of copy number estimates should be integrated for optimal performance and the correct derivation of normalization parameters. Several of the new array assay selections have taken the copy number detection into account, for example, Illumina includes unsnpable genome probes on some of its products. These markers were picked to cover events recorded in the Database Genomic Variants (DGV) and some additional regions highlighted by experimental work. The SNP 6.0 chip was developed with an aim to assess SNPs and CNVs simultaneously. McCarroll et al. [15] studied 270 HapMap samples to design probes for their hybrid array. With these changes in assay selection techniques the SNP array has become more appealing for copy number detection and reliable interpretation of these results increases in importance. ILLUMINA PROPRIETARY SOFTWARE FOR COPY NUMBER DETECTION Illumina data can be initially viewed, checked and exported using the proprietary software BeadStudio. As well as the software s quality checking and genotype-calling functions it calculates a number of other values for the signal-intensity data. The normalized R value is used as a representation of intensity on individual SNP plots. The log R ratio value is then calculated from the expected normalized intensity of a sample and observed normalized intensity. The B allele frequency (BAF) is calculated from the difference between the expected position of the cluster group and the actual value. BAF and log R ratio are used by a number of the copy number event detection algorithms. Detection of copy number events within BeadStudio uses simple algorithms which can be run rapidly for an overview of larger events in a sample. The Loss of Heterozygosity (LOH) score is calculated using heterozygote frequency. The CNV partition plug-in uses the log R ratio and BAF and compares the data to 14 different Gaussian distribution models to assess copy number level. Values can be plotted in the Chromosome Browser allowing the

3 Comparing CNV detection methods for SNP arrays 355 Figure 1: BeadStudio Chromosome Viewer. Image from BeadStudio Chromosome Browser showing copy number values for Sample NA Chromosome 22 shown with an event at ^ confirmed by all statistics. CNV value produced by CNV Partition algorithm. user to compare predicted events with BAF or log R ratio at the location for event confirmation (Figure 1). AFFYMETRIX PROPRIETARY SOFTWARE FOR COPY NUMBER DETECTION SNP array data can be analysed with specially designed proprietary software. Within the Genotyping Console samples are grouped into In Bounds (good sample) and Out of bounds (problematic samples) after initial quality checks and other quality control metrics allow the user to investigate probe mismatching and individual SNP clustering. LOH scores can be calculated and the software contains a Chromosome Copy Number Analysis Tool (CNAT), which uses a reference set of data to compare the experiment signal-intensity values against and evaluates copy number changes. Results are processed by the segment reporting tool to produce a basic output of larger detected CNV events. Tools for analysis of the different chip types vary but HumanGenomeSNP Array 6.0 utilizes two externally developed algorithms from the BirdSuite package [16] which dramatically improves detection. Birdseed is used for SNP genotyping and Canary genotypes the known CNPs on the chip. Each CNP has a number of targeted probes,

356 Winchester et al. Figure 2: Genotyping Console Genome Viewer. Image from Genotyping Console showing sample NA10861.

4 356 Winchester et al. Figure 2: Genotyping Console Genome Viewer. Image from Genotyping Console showing sample NA Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (red mark)showingthesingleevent. data from these are summarized and then compared to a reference set to produce the final call. Results can be viewed in the Integrated Genome Browser (IGB) (Figure 2). HIDDEN MARKOV MODELS (HMMs) IN COPY NUMBER EVENT DETECTION Limitations of available copy number analyses within proprietary software led to the use of other methods to analyse data. The HMM assumes that observed intensities are related to an unobserved copy number state at each locus via an emission distribution (often assumed to be Gaussian). The copy number states are assumed to have a dependence structure such that neighbouring loci are assumed to have similar copy number states. Transitions between copy number states are determined by a transition matrix which describes the probability of moving from one state to another. The probabilistic structure of the HMM allows parameters in the model to be efficiently learnt from data, in both Bayesian and non-bayesian frameworks, by using dynamic programming-based algorithms, such as the expectation maximization (EM) algorithm. When applied to event detection each copy number possibility is assigned a state and the Viterbi algorithm is used to predict the state for each observation value.

5 Comparing CNV detection methods for SNP arrays 357 With prior knowledge of modelling statistics there are a multitude of options for copy number detection. HMMSeg [17] is a command line operated algorithm that is designed to apply HMM to genomic data. Application of correct modelling procedures is not an obvious process to nonstatisticians. For these reasons software has been developed which allows guided application of these types of advanced methods. GUIDED APPLICATION OF THE HMM A number of solutions for guided accurate CNV detection for SNP array data have been published but these are often platform specific. QuantiSNP [18] and PennCNV [19] are academically developed and freely available for prediction purposes. They use the HMM and assist the user to apply it to their own data. The standard output from these tools is a list of detected events and brief summary statistics used for quality checking. Checking the quality of data is extremely important in accurate event prediction. Data with high signal noise often causes false positive predictions and stringency with checks at this stage is highly recommended to eliminate any problem data. Signal noise is a strong limitation particularly with samples prepared by whole genome amplification. Output from QuantiSNP allows the user to plot average and standard deviations for BAF by chromosome or sample to show outliers (Figure 3). PennCNV has a detailed set of guidelines for identifying and rejecting problem data included on the software s support website. Both can run using command line options or integrated into Illumina s BeadStudio plug-in and have unique features to recommend them. The QuantiSNP algorithm output gives a log Bayes factor with its prediction which allows the user to rank events in order of likelihood and place their own cut off on acceptable events. Users can modify parameters to suit their own dataset, for example, changing the length parameter can allow more accurate detection of different sized events for a particular sample set. Later versions of QuantiSNP have increased flexibility for data other than the Figure 3: Graphical representation of quality control data from PennCNV and QuantiSNP algorithms. It is important to use quality control (QC) data from the algorithms to eliminate problem samples which would not be found during standard-genotyping analysis. Plot shows BAF score for each chromosome from analysis of sample NA10861, we can see chromosome 4 and X are outliers. Values produced by PennCNV log file also shown. NB Values shown relate to Illumina 1MDuo array.

6 358 Winchester et al. standard Illumina Infinium array and can used to process data and have proven accuracy on Illumina GoldenGate data [20] where SNP coverage is suitable. PennCNV has a number of downstream analysis options. Most important to highlight is the use of family trio data in analysis [21]. The use of trio information in event prediction allows easier detection of events novel to probands. It also integrates a pipeline for data analysis. The PennCNV package also includes a number of options to allow more analysis of event results such as a script to compare events to known gene libraries or for changing the format to be suitable for viewer such as BeadStudio s Chromosome Browser or the web-based genome browser, UCSC ( Dchip SNP [22] was originally developed for data but has been modified to allow the viewing of Illumina data. It produces an LOH score which can be plotted against chromosome but its functions are best suited to the platform generated values, in particular, the quality control options. The software also has options to carry out paired analysis for cancer data; major copy proportion analysis [22] uses HMM to analyse tumour samples. APPLYING APPROACHES ORIGINALLY USED IN ARRAYCGH A number of methods for copy number event detection were originally developed for arraycgh analysis but have been modified for SNP array analysis. The Circular Binary Segmentation (CBS) [23] algorithm is one such method. It was designed to convert noisy intensity values into regions of equal copy number. The algorithm will continue to divide a region into segments until it finds a segment, which is different to the neighbouring region. This change-point detection is designed to identify all the places which partition the chromosome into segments of the same copy number. An addition to the binary segmentation algorithm was made to allow the defining of single change inside a large segment. Segment ends were joined forming a circle to allow a further likelihood ratio test that the content has different means. Final segments are then given a cluster value, which is the median logratio value of the probes within the region and this value is used to define the copy number status. An alternative to the CBS algorithm was developed by Pique-Regi et al. [24], which can now be applied to SNP arrays. The Genome Alteration Detection Algorithm (GADA) uses sparse Bayesian learning to predict CN changes. For our testing we used a package designed for use in R environment with helpful processing options and detailed instructions for and Illumina data. The advantage of the speed of data processing was clear and we were able to analyse data within a few minutes. There are many other algorithms developed that could potentially be applied to SNP array data. Other reviews [6, 25] focused on the arraycgh format present the reader with a variety of alternative options. CNV DETECTION USING OTHER METHODS Approaches which describe different methods to address CN event detection are common in the literature. SNP conditional mixture modelling (SCIMM) developed by Cooper et al. [13], which is based on the observation that samples with deletions appear to have unique signal-intensity clusters. They applied a mixture-likelihood clustering method within the R statistical package to identify deletions. A secondary algorithm (SCIMM-Search) was developed to help discover probes which detect copy number changes within an array dataset. The algorithms require knowledge of modelling techniques to correctly carry out the analysis. The ITALICS [26] software focuses analysis on removal on unwanted events found in data. Rigaill et al. developed ITALICS (Iterative and Alternative normalisation and Copy number calling for affymetrix Snp arrays) to remove probes with abnormal intensities. Each iteration of the algorithm estimates the biological signal and then uses multiple linear regressions to estimate the nonlinear effects on the signal. The algorithm can be run in R and has the potential to analyse the Human mapping 500K, Genome Wide array 5.0 and 6.0 format but was designed to process data from chip formats containing perfect match and mismatch probes. COMMERCIALLY AVAILABLE SOFTWARE The strength of the software packages available to purchase lies in a number of traits; the ability

7 Comparing CNV detection methods for SNP arrays 359 to combine data from other platforms for comparison, graphical user interfaces, integrated pipelines for analysis and work flows, optimized computational speed and technical support. These factors are all extremely useful to those labs with no or limited bioinformatic core support. Unfortunately, commercial companies are limited in their use of some of the methods developed in the academic environment. They are often prevented from building user interfaces and other features around academic software due to restrictions imposed by free software licences such as GNU Public Licence, and prevention from having access to the latest methods. For our own purposes, we have chosen to look in detail at the Nexus Biodiscovery software. This uses the rank segmentation approach for detection. This approach is based on CBS but has been modified to increase speed of processing. It can be used for, arraycgh or Illumina data and although weaker for Illumina event detection is an extremely useful tool for practically trained scientists. COMBINING COPY NUMBER PREDICTION AND GENOTYPING Copy number detection approaches described thus far have looked only at a single aspect of the data. The Birdsuite set developed by Korn et al. [16] combines SNP genotyping and copy number detection as well as independently genotyping common CNPs. It uses four different methods to analyse an dataset. The Canary algorithm, which genotypes common CNPs and Birdseed, which carries out SNP genotyping are included in the Genotyping Console. Birdseye is used to discover rare CNVs. This uses the HMM to identify and assess previously unknown CNVs in the data. Fawkes is the final stage of Birdsuite; this merges all the results from the other three stages. Combining data in this way gives a more complete picture of structural variation in a sample and allows the user to proceed with single stage of association analysis with increased coverage on the data. Korn et al. compared their software to commercially available algorithms including Nexus and report the higher detection rates of Birdsuite. Franke et al. [27] have also presented a combined approach which focuses on single SNP interpretation. TriTyper uses maximum likelihood estimation to detect deletions in Illumina SNP data in unrelated samples. It incorporates an extra null allele into its genotyping clusters and uses deviations from the HWE as an indicator of when to use triallelic genotyping. It can also use neighbouring SNP data to impute the success of the caller which increases the accuracy of the output. COMPARING THE DETECTION ALGORITHMS There are a large variety of algorithms and software available for copy number event detection. Table 1 shows a summary of the software discussed in this review. A number of these software packages have been tested during the review and a brief synopsis of the results is presented here. Assessing Software To assess the accuracy of the algorithms we compared our data to the results of a well characterized sample. The sample NA12156 is the basis for our comparison (Table 2); it is from the HapMap collection and was sequenced for structural variation by Kidd et al. [28]. We have chosen to record the number of similar events between software and published data. We assume the samples with low numbers of similar events have higher false positive rates; however, we have not experimentally validated the results. While there is no faultless software we have found that at least 20% of events were confirmed by Kidd et al. in all algorithms. 27% of the overlapping detected events were found by more than one algorithm (Supplementary Table 1). Although some algorithms have a lower percentage of overlapping events it is important to consider the number of events found as well as the proportion, 49% of PennCNV detected events were confirmed but other algorithms have actually detected more in total. We carried out a secondary comparison using the CEPH sample NA15510 which has been characterized in a number of publications [2, 7, 28]. Table 3 shows the variation of results between studies. Further investigation of event replication across studies is represented in the Venn Diagrams (Figure 4). PennCNV and Illumina show similar patterns of overlap although we note an increased similarity between the Korbel et al. data and QuantiSNP output. We conclude that although we found a difference between detected events in our data and published results, we found similar variation between different publications, suggesting this is problem in

8 360 Winchester et al. Table 1: Summary of SNP array detection algorithms Software Platform Related publication Details Strengths Weaknesses Birdsuite (Birdseye and Canary) [15] Combined tool set to genotype SNPs & CNPs Unique approach, single association of SNPs and CN Integral part of Genome Console Integral part of BeadStudio Availability limited to data CNAT Technical notes Proprietaryçrun in Genome Console Accuracy of event prediction (missed events) CNVPartition Illumina Technical Proprietaryçrun in Accuracy of event prediction notes BeadStudio (missed events) Dchip SNP [22] Stand alone software Free viewer for all data Limited applications for or Illumina Illumina data GADA [24] Model uses Sparse Bayesian Speed of processing and Accuracy on Illumina weaker or Illumina Learning application within R HMMSeg Multiple [17] HMM application tool to any Flexibility to any dataset Statistical knowledge genomic data required for correct use Not CN specific ITALICS [26] R package for normalization Focus on removal of nonrelevant Designed to work on and CN detection in effects 100K þ 500K data chip (MM probe format) Nexus Biodiscovery Multiple [23] Commercial segmentation detection tool Allows combined data from different platforms Integrated viewer PennCNV Illumina or [19] Perl script based Multiple downstream tools for output QuantiSNP Illumina or [18] HHM PC or LINUX Bayes factor score for command line events, flexibility of run parameters SCIMM and SCIMM-Search Illumina [13] Modelling algorithm applied in R TriTyper Illumina [27] Identify and genotype SNPs with null allele Table 2 : Comparison of algorithms Algorithm Platform and array High detection rates compared to sequence data Able to interpret single SNPs Total of copy number events detected Freeware alternatives are available No way of ranking events due to likelihood Limited support for further event analysis Statistical knowledge required for correct use Only genotypes deletions Number of copy number events confirmed by Kidd et al. [28]. Birdsuite (Birdseye & Canary) (20%) CNAT (Genome Console 3.0.2) (25%) GADA (R 0.7-5) (23%) GADA (R 0.7-5) Illumina 1M Duo (31%) PennCNV (2009Jan06) (49%) PennCNV (2009Jan06) Illumina 1M Duo (37%) QuantiSNP v (41%) QuantiSNP v1.1 Illumina 1M Duo (31%) Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al. [28]. Default parameters are used for each algorithm and any Ychromosome data was omitted. An overlap between software output and confirmed data by Kidd et al. is determined by comparing the start and end points of events. Details of events are shown in SupplementaryTable 1. Percentage shows the number of confirmed CN events compared to the total detected by the algorithm. all comparisons and not unique to algorithms we tested. The overlap of algorithm events of the tested software is below 50% for all cases. We used default parameters for all our algorithms for ease of replication which means some algorithms were not run at their optimal level for our data. We deliberately chose data which did not use an array-based

Comparing CNV detection methods for SNP arrays 361 Table 3: Overlap between events detected by SNP array algorithms using multiple publication data Total events found in NA15510 by algorithm Number

9 Comparing CNV detection methods for SNP arrays 361 Table 3: Overlap between events detected by SNP array algorithms using multiple publication data Total events found in NA15510 by algorithm Number of copy number events (Kidd) [28] Number of copy number events (Korbel) [7] Number of copy number events (Redon) [2] Events in paper CNVPartition (4%) 22 (5%) 9 (4%) GADA (R 0.7-5) (23%) 85 (18%) 42 (19%) PennCNV (2009Jan06) (6%) 28 (%) 30 (14%) QuantiSNPv (6%) 41(9%) 29(13%) Data from CEPH sample NA15510 on1m array, Illumina platform is used to compare between algorithms and other publications. Default parameters are used for each algorithm and Y chromosome data was omitted. Event lists from publications were generated by combining data from several tables to create a complete list (including all validated and un-validated events). An event was counted if any overlap was found with base event in published data; multiple predictions by an algorithm for one published event were counted as one.value in brackets shows percentage of published events found by algorithm. We note from GADA analysis although a high number of overlaps were found, this was due to the prediction of large events that included smaller events found by Kidd et al.and Korbelet al. Figure 4: Venn diagrams comparing events for NA15510 between different studies. Visual representation of data from CEPH sample NA15510 on 1M array, Illumina platform used to compare between algorithms and other publications [2, 7, 28]. Default parameters are used for each algorithm and Y chromosome data was omitted from count. Event lists from publications were generated by combining data from several tables to create a complete list (including all validated and unvalidated events). An event was counted if any overlap was found with base event in published data, multiple predictions by an algorithm for one published event were counted as one. Each total in the diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is counted. Surprisingly, only 43 overlapping events are found for NA15510 in all the three studies (A). Results from the PennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three software due to the detection of more events overlapping with the Korbel et al. study. Overlap between algorithms is shown in Venn Diagram B where events which are detected by the algorithm and found in at least one of the publication are compared. A large proportion of detected events between PennCNV and QuantiSNP (43) overlap.

10 362 Winchester et al. technique for our NA12156 comparison to prevent a bias between and Illumina; but in doing so we accepted an increase in the number of differently detected events. Kidd et al. have shown similar data when comparing studies and found only a 12.5% overlap of events larger than 5 kb between their results and CN data generated by 6.0 array. Similarities of events detected between different Software We chose to test a single sample (NA10861) on a range of the available algorithms to compare the similarity between event detection. In all cases we found the academically developed software to be more sensitive and detect more events than proprietary algorithms (Table 4). The data also shows an increased number of events found from the sample using the SNP6.0 array; we assume this reflects the increase in the number of CNP probes on the array relative to Illumina s 1M chip. Table 5 shows the amount of overlap in event prediction. We show two results for each comparison counting the number of events overlapping for each algorithm separately. The difference in values represents the number of smaller events often found in one event by a different algorithm. In general, we found a higher number of overlapping events Table 4 : Comparison of event numbers detected for a single sample (NA10861) Algorithm Platform and array Birdsuite (Canary & Birdseye) CNAT(GenomeConsole3.0.2) CNVPartition Illumina 1M Duo 16 GADA (R 0.7-5) GADA (R 0.7-5) Illumina 1M Duo 87 Nexus Biodiscovery Nexus Biodiscovery Illumina 1M Duo 8 PennCNV (2009Jan06) PennCNV (2009Jan06) Illumina 1M Duo 43 QuantiSNP v QuantiSNP v1.1 Illumina 1M Duo 60 Number of CN events detected HapMap samples provided as demonstration data were analysed on both and Illumina platforms to give an easily reproducible comparison of event prediction. Events shown have been detected by the algorithm for CEPH sample NA Default parameters were used for all algorithms and any Ychromosome data was omitted. Data from the array has a higher number of detected events probably linked to the number of specifically targeted probes. Proprietary software from both Illumina and has a low detection rate. between algorithms run on 6.0 arrays data. We expected the low resemblance between data generated on different platforms as a result of the different probe sets; however, we are pleased to find some overlap. We have included a comparison to events published by Redon et al. [2]; although the study does not include a comprehensive list for this sample it does show that the algorithms are detecting confirmed events. During our comparison we often saw a difference in the size of the predicted event between algorithms (Figure 5). This was to be expected when using different platforms as probe locations vary, but was also seen when analysing an identical dataset. This kind of effect can even be produced when simply altering algorithm parameters and should be a consideration when looking at breakpoints of detected events. We found that the available software tend to target and support one particular platform for analysis, which unfortunately, can limit options. Recommending algorithms Comparison of events in a dataset is a good way of assessing accuracy of detection algorithms but it is also important to take into account that the different predictions can also be informative in showing false positives caused by noisy data and conversely that those in agreement are the strongest candidates for events. Multiple predictions from different software for the same event increase confidence in the data and give clearer indications of the event boundaries or any discrepancy in this information. We would recommend using a second algorithm on a single dataset to produce the most informative results and also utilize the different advantages of each software. We also suggest using software designed specifically for the platform which generated the data as several of the dual use algorithms have been shown to weaker in one format. We have selected a range of algorithms to discuss and test and the list in Table 1 is not exhaustive, only an overview of some of the possibilities. It is also important to state, even using different algorithms one cannot definitively confirm the presence of a CN event without separate biological replication and it is unlikely that any list of events detected will contain all CNVs in a sample. FURTHER ANALYSIS OF DETECTED CNVs With a number of reliable options available for the detection of copy number events it becomes

11 Comparing CNV detection methods for SNP arrays 363 Table 5: Comparison of software event predictions Published results (Redon) Birdsuite CNAT CNV Partition Illumina GADA GADA Illumina Nexus Nexus Illumina PennCNV PennCNV Illumina QuantiSNP QuantiSNP Illumina Publisheddata(Redon) 17(4%) 4(40%) 3(19%) 32(5%) 2(2%) 11(10%) 2(25%) 12(18%) 7(16%) 18(9%) 8(13%) Birdsuite 17(44%) 9(90%) 13(81%) 135(22%) 21(24%) 62(56%) 6(75%) 43(64%) 20(47%) 97(50%) 20(33%) CNAT 4(10%) 15(4%) 4(25%) 34(6%) 0 23(21%) 1(13%) 13(19%) 2(5%) 17(9%) 5(8%) CNV Partition Illumina 3 (8%) 16 (4%) 4 (40%) 37 (6%) 7 (8%) 20 (18%) 7 (88%) 9 (13%) 11 (26%) 16 (8%) 16 (27%) GADA 17 (44%) 106 (28%) 9 (90%) 13 (81%) 32 (37%) 91 (82%) 7 (88%) 58 (87%) 23 (53%) 153 (79%) 27 (45%) GADA Illumina 2(5%) 96(25%) 0 13(81%) 208(34%) 25(23%) 2(25%) 26(30%) 17(40%) 67(35%) 23(38%) Nexus 7(18%) 57(15%) 10(100%) 7(44%) 116(19%) 8(9%) 4(50%) 45(67%) 15(35%) 78(40%) 17(28%) Nexus Illumina 2(5%) 6(2%) 1(10%) 7(44%) 22(4%) 2(2%) 4(4%) 6(9%) 7(16%) 10(5%) 9(15%) PennCNV 11 (28%) 51 (13%) 10 (100%) 9 (56%) 105 (17%) 10 (11%) 65 (59%) 6 (75%) 19 (44%) 71 (37%) 21 (35%) PennCNV Illumina 6 (15%) 25 (7%) 2 (20%) 11 (69%) 44 (7%) 9 (10%) 23 (21%) 6 (75%) 18 (27%) 26 (13%) 28 (47%) QuantiSNP 14 (36%) 97 (25%) 10 (100%) 10 (63%) 199 (32%) 18 (21%) 86 (77%) 7 (88%) 65 (97%) 21 (49%) 24 (40%) QuantiSNP Illumina 6(15%) 14(4%) 5(50%) 15(94%) 55(9%) 10(11%) 30(27%) 8(100%) 23(34%) 32(74%) 31(16%) Algorithms were run on demonstration data for sample NA10861on 6.0 chips and Illumina1MDuo arrays.defaultparameters were used and anyychromosome data was omitted.for algorithm overall totals see Table 4. Events detected in both software are shown. Events counted as common between algorithms if part of region predicted overlaps with the other. Each comparison is carried out twice to show cases where smaller events within one algorithm make up one event in the other, therefore overlap of events depends on analysis orientation.total value represents number of events for software on horizontal axis found in the other software dataset, bracketed value shows percentage of events detected by same software.we have found the most similarities are between data from similar platforms or algorithm method; for example PennCNV and QuantiSNP are both based on the HMM algorithm and as such event prediction should be very similar. We have also noted a higher number of similar events from algorithms using data.

12 364 Winchester et al. Figure 5: Image from UCSC Browser showing the detection of a single event using different algorithms. The deletion described is a known CNP and is recorded several times in the DGV. Each track represents a different algorithm or platform. All results for detection algorithms shown used default parameters and test sample NA increasingly important to be able to summarize and use this data. Initially, we are often interested in looking for novel events in certain genes or regions. Tracks of events can be viewed in databases such as the web-based genome browser, UCSC ( and events can be compared to known copy number data in the DGV such as displayed in Figure 3. Importing several tracks of data into a browser simultaneously will allow the user to compare different result sets. Analysis of multiple events per sample is a more complicated procedure. Events and samples can be explored using pathway analysis tools to look for interesting groups or combinations of events in different genes but methods of confirming the significance of an event are required. A number of publications exist presenting ways of applying association study methods to copy number data. Barnes etal. [29] developed an R package, CNVtools, which allows the user to carry out case-control association

13 Comparing CNV detection methods for SNP arrays 365 analysis on a single CNV of interest. The publication tests a series of five alternative modelling methods before recommending a likelihood ratio test which combines CNV calling and association testing into a single model. This method was designed to eliminate problems with signal noise which is a known trait of SNP assay data. Ionita-Laza et al. [30] suggested a method to apply genome-wide familybased association studies on raw-intensity data. The Birdsuite package includes a pipeline to prepare the data for PLINK analysis. Other sources have suggested similar association study-based strategies, but an agreed approach is a subject of great discussion. Calls have been made by authors such as Scherer et al. [31] to decide on a single technique but future decisions in the field will be extremely enlightening. As is commented much upon in literature describing SNP association study techniques, sample size and power of tests are major factors in a successful study [32]. This must also be considered when analysing copy number data. As we have discussed, there are a number of analysis options available for SNP array CNV detection, pipelines to allow guided analysis and stand alone options for more flexible analysis. Some of these applications are platform targeted but we have found that the best outcome is given by using multiple algorithms and comparing data. SUPPLEMENTARY DATA Supplementary data are available online at bib.oxfordjournals.org/. Key Points A wide variety of software is available for CNV detection from data produced by SNP arrays. This review seeks to discuss options and statistical methods currently available for analysis of signal intensity data. Changes in assay selection techniques for SNP arrays have made them more appealing for copy number detection as well as genotyping. Targeted probe design has made the SNP array a reliable and cheaper option for copy number analysis. After testing a selection of the available software, comparisons were performed using Hapmap samples and published copy number data. Of the events found in our data 20 ^ 49% were replicated in previously published studies but the results clearly showed variation in data caused by differences in algorithms. An important recommendation when choosing software for analysis is the use of a second algorithm on a dataset to produce more informative results. This enables the user to eliminate false positives not found by both software and increases confidence in replicated events. Acknowledgements The authors thank Dr Helen Butler for her ideas and contributions to the manuscript. FUNDING JR and LW are funded by Wellcome Trust Grants. CY is funded by a UK Medical Research Council Special Training Fellowship in Biomedical Informatics (Ref No. G ). References 1. Iafrate AJ, Feuk L, Rivera MN, et al. Detection of largescale variation in the human genome. Nat Genet 2004; 36(9): Redon R, Ishikawa S, Fitch KR, et al. Global variation in copy number in the human genome. Nature 2006; 444(7118): Tuzun E, Sharp AJ, Bailey JA, et al. Fine-scale structural variation of the human genome. Nat Genet 2005; 37(7): Sebat J, Lakshmi B, Troge J, et al. Large-scale copy number polymorphism in the human genome. Science 2004; 305(5683): de Smith AJ, Tsalenko A, Sampas N, et al. Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex diseases. Hum Mol Genet 2007; 16(23): Carter NP. Methods and strategies for analyzing copy number variation using DNA microarrays. Nat Genet 2007;39(7 Suppl):S Korbel JO, Urban AE, Affourtit JP, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 2007;318(5849): Kennedy GC, Matsuzaki H, Dong S, etal. Large-scale genotyping of complex DNA. NatBiotechnol 2003;21(10): Peiffer DA, Le JM, Steemers FJ, etal. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res 2006;16(9): International Schizophrenia Consortium Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 2008;455(7210): Yang TL, Chen XD, Guo Y, et al. Genome-wide copynumber-variation study identified a susceptibility gene, UGT2B17, for osteoporosis. Am J Hum Genet 2008; 83(6): McCarroll SA, Hadnott TN, Perry GH, et al. Common deletion polymorphisms in the human genome. Nat Genet 2006;38(1): Cooper GM, Zerr T, Kidd JM, et al. Systematic assessment of copy number variant detection via genome-wide SNP genotyping. Nat Genet 2008;40(10): McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat Genet 2007; 39(7 Suppl):S37 42.

14 366 Winchester et al. 15. McCarroll SA, Kuruvilla FG, Korn JM, et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet 2008;40(10): Korn JM, Kuruvilla FG, McCarroll SA, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet 2008;40(10): Day N, Hemmaplardh A, Thurman RE, et al. Unsupervised segmentation of continuous genomic data. Bioinformatics 2007;23(11): Colella S, Yau C, Taylor JM, etal. QuantiSNP: an objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res 2007;35(6): Wang K, Li M, Hadley D, et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 2007;17(11): Maestrini E, Pagnamenta AT, Lamb JA, et al. High-density SNP association study and copy number variation analysis of the AUTS1 and AUTS5 loci implicate the IMMP2L- DOCK4 gene region in autism susceptibility. Mol Psychiatry Wang K, Chen Z, Tadesse MG, et al. Modeling genetic inheritance of copy number variations. Nucleic Acids Res 2008;36(21):e Li C, Beroukhim R, Weir BA, et al. Major copy proportion analysis of tumor samples using SNP arrays. BMC Bioinformatics 2008;9: Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 2004;5(4): Pique-Regi R, Monso-Varona J, Ortega A, et al. Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics 2008;24(3): Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005; 21(19): Rigaill G, Hupe P, Almeida A, et al. ITALICS: an algorithm for normalization and DNA copy number calling for SNP arrays. Bioinformatics 2008;24(6): Franke L, de Kovel CG, Aulchenko YS, et al. Detection, imputation, and association analysis of small deletions and null alleles on oligonucleotide arrays. AmJ Hum Genet 2008; 82(6): Kidd JM, Cooper GM, Donahue WF, et al. Mapping and sequencing of structural variation from eight human genomes. Nature 2008;453(7191): Barnes C, Plagnol V, Fitzgerald T, et al. A robust statistical method for case-control association testing with copy number variation. Nat Genet 2008;40(10): Ionita-Laza I, Perry GH, Raby BA, et al. On the analysis of copy-number variations in genome-wide association studies: a translation of the family-based association test. Genet Epidemiol 2008;32(3): Scherer SW, Lee C, Birney E, etal. Challenges and standards in integrating surveys of structural variation. NatGenet 2007; 39(7 Suppl):S Cardon LR, Bell JI. Association study designs for complex diseases. Nat Rev Genet 2001;2(2):91 9.

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Here we compare the results of this study to potentially overlapping results from four earlier studies