Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis

Size: px
Start display at page:

Download "Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis"

Transcription

1 BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS. VOL 8. NO ^366 doi: /bfgp/elp017 Comparing CNV detection methods for SNP arrays Laura Winchester, Christopher Yau and Jiannis Ragoussis Advance Access publication date 8 September 2009 Abstract Data from whole genome association studies can now be used for dual purposes, genotyping and copy number detection. In this review we discuss some of the methods for using SNP data to detect copy number events. We examine a number of algorithms designed to detect copy number changes through the use of signal-intensity data and consider methods to evaluate the changes found. We describe the use of several statistical models in copy number detection in germline samples. We also present a comparison of data using these methods to assess accuracy of prediction and detection of changes in copy number. Keywords: copy number; SNP array INTRODUCTION Structural variation in the human genome has been intensely studied in recent years [1 5]. Publications have shown rare copy number variations (CNV) with a relationship to certain diseases and much has also been done to study copy number polymorphisms (CNP) in the population, their contribution to structural variation and possible association to complex disease. Multiple methods for the detection of these structural variants exist [6, 7] but we seek to focus on methods designed to interpret results from SNP arrays. The most prominent SNP array types are available from commercial vendors and Illumina. Both companies sell competing arrays and continue to offer increased coverage for detecting copy number events and SNP assays simultaneously. Assay technique for the arrays differ [8, 9] but the signal-intensity output from the both platforms present similar analysis and interpretation problems. Successful application of these technologies has yielded a number of interesting individual CNVs with relationships to complex disease. For example, rare CNVs have been linked to schizophrenia [10] in a study where microdeletions and duplications were shown to be responsible for disrupting genes involved in neurodevelopment. The UGT2B17 gene on Chromosome 4q13.2 was linked to osteoporosis in a case-control study of 727 CNV regions in a Chinese sample set [11]. One approach to copy number event detection has been to investigate common events. Studies such as the McCarroll et al. [12] involved the characterization of deletion variations in the genome, while Redon et al. [2] have mapped the location of events found in multiple samples. Information about identified copy number events is recorded in databases such as The Database of Genomic Variants (DGV) [1]. Using the prior information about CNP location we can investigate copy number events as we would use SNP information in genotyping. Known CNPs can be genotyped in case control populations with similar methods to the SNP-based association study. With the diversity of approaches and analysis options it is important to decide on a method most suited for the particular experimental needs. This review presents methods suggested for analysis of germ line CNV analysis, including both CNP analysis and the detection of rare CNVs. Corresponding author. Jiannis Ragoussis, Genomics, Wellcome Trust Centre For Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK. Tel: (01865) ; Fax: (01865) ; ioannisr@well.ox.ac.uk LauraWinchester is a DPhil student at Oxford University where her research involves detection of copy number events in genetic disorders, in particular, Specific Language Impairment. Christopher Yau is a Postdoctoral Research Fellow in the Department of Statistics at Oxford University. Jiannis Ragoussis is Head of Genomics at WTCHG. Interests: gene expression regulation in hypoxia and inflammation, genotyping and sequencing technology, identification of chromosomal aneuploidies and CNVs associated with disease. ß The Author Published by Oxford University Press. For permissions, please journals.permissions@oxfordjournals.org

2 354 Winchester et al. CNV DISCOVERY AND DETECTION USING SNP CHIPS The use of SNP arrays in copy number event detection has a number of advantages. As well as the two applications for the data which are SNP genotyping and copy number analysis, there are other aspects that promote their use over other techniques. SNP arrays use less sample per experiment compared to other techniques such as comparative genomic hybridization (CGH) arrays. Cost is also an important factor in the selection of the method. The SNP array is a cost effective technique which allows the user to increase the number of samples tested on a limited budget. Although the advances in high throughput sequencing technology has made copy number discovery much easier, the application of known CNP information means that we can target structural variation in a sample using cheaper techniques such as the SNP array without a large reduction in genome wide coverage. One important consideration, however, is the bias of the SNP chip coverage towards known CNVs [13]. Historically, when SNPs are selected for genotyping arrays certain factors are considered which may decrease the number of copy number variants or polymorphisms typed [14]. Studies have found CNPs to be most common in regions containing high levels of segmental duplication [2], which are areas of low SNP coverage compared to other areas of the genome due to the difficulties of assay design and implementation. Common CNPs may cause assays to fail standard inheritance checks and Hardy Weinberg tests. For example, in a situation where a father is (A, B) and the mother (B, ), the child could be (A, B)or(A, ) or(b, ). However, in SNP genotyping results, the mother would appear to be called (B, B) and the child would be called either (A, B)or(A, A)or(B, B). If the child is really (A, ) then an (A, A) call would seem to violate Mendelian inheritance patterns and often cause the SNP to be rejected. Assays were also often selected and tested on the basis of their use in SNP genotyping, meaning the final result may produce noisy signal, which although per se does not affect the ability to genotype, is a major problem for accurate copy number detection. For instance, SNP data is typically standardized against a reference population in order to reduce the effect of factors including: between-array variation and probe-specific hybridization effects. In doing so, normalization routines implicitly assume that all members (or the large majority) of the reference population have the same copy number but, at locations of common CNV, this assumption is clearly no longer appropriate. At these genomic locations, the process of SNP data normalization and the derivation of copy number estimates should be integrated for optimal performance and the correct derivation of normalization parameters. Several of the new array assay selections have taken the copy number detection into account, for example, Illumina includes unsnpable genome probes on some of its products. These markers were picked to cover events recorded in the Database Genomic Variants (DGV) and some additional regions highlighted by experimental work. The SNP 6.0 chip was developed with an aim to assess SNPs and CNVs simultaneously. McCarroll et al. [15] studied 270 HapMap samples to design probes for their hybrid array. With these changes in assay selection techniques the SNP array has become more appealing for copy number detection and reliable interpretation of these results increases in importance. ILLUMINA PROPRIETARY SOFTWARE FOR COPY NUMBER DETECTION Illumina data can be initially viewed, checked and exported using the proprietary software BeadStudio. As well as the software s quality checking and genotype-calling functions it calculates a number of other values for the signal-intensity data. The normalized R value is used as a representation of intensity on individual SNP plots. The log R ratio value is then calculated from the expected normalized intensity of a sample and observed normalized intensity. The B allele frequency (BAF) is calculated from the difference between the expected position of the cluster group and the actual value. BAF and log R ratio are used by a number of the copy number event detection algorithms. Detection of copy number events within BeadStudio uses simple algorithms which can be run rapidly for an overview of larger events in a sample. The Loss of Heterozygosity (LOH) score is calculated using heterozygote frequency. The CNV partition plug-in uses the log R ratio and BAF and compares the data to 14 different Gaussian distribution models to assess copy number level. Values can be plotted in the Chromosome Browser allowing the

3 Comparing CNV detection methods for SNP arrays 355 Figure 1: BeadStudio Chromosome Viewer. Image from BeadStudio Chromosome Browser showing copy number values for Sample NA Chromosome 22 shown with an event at ^ confirmed by all statistics. CNV value produced by CNV Partition algorithm. user to compare predicted events with BAF or log R ratio at the location for event confirmation (Figure 1). AFFYMETRIX PROPRIETARY SOFTWARE FOR COPY NUMBER DETECTION SNP array data can be analysed with specially designed proprietary software. Within the Genotyping Console samples are grouped into In Bounds (good sample) and Out of bounds (problematic samples) after initial quality checks and other quality control metrics allow the user to investigate probe mismatching and individual SNP clustering. LOH scores can be calculated and the software contains a Chromosome Copy Number Analysis Tool (CNAT), which uses a reference set of data to compare the experiment signal-intensity values against and evaluates copy number changes. Results are processed by the segment reporting tool to produce a basic output of larger detected CNV events. Tools for analysis of the different chip types vary but HumanGenomeSNP Array 6.0 utilizes two externally developed algorithms from the BirdSuite package [16] which dramatically improves detection. Birdseed is used for SNP genotyping and Canary genotypes the known CNPs on the chip. Each CNP has a number of targeted probes,

4 356 Winchester et al. Figure 2: Genotyping Console Genome Viewer. Image from Genotyping Console showing sample NA Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (red mark)showingthesingleevent. data from these are summarized and then compared to a reference set to produce the final call. Results can be viewed in the Integrated Genome Browser (IGB) (Figure 2). HIDDEN MARKOV MODELS (HMMs) IN COPY NUMBER EVENT DETECTION Limitations of available copy number analyses within proprietary software led to the use of other methods to analyse data. The HMM assumes that observed intensities are related to an unobserved copy number state at each locus via an emission distribution (often assumed to be Gaussian). The copy number states are assumed to have a dependence structure such that neighbouring loci are assumed to have similar copy number states. Transitions between copy number states are determined by a transition matrix which describes the probability of moving from one state to another. The probabilistic structure of the HMM allows parameters in the model to be efficiently learnt from data, in both Bayesian and non-bayesian frameworks, by using dynamic programming-based algorithms, such as the expectation maximization (EM) algorithm. When applied to event detection each copy number possibility is assigned a state and the Viterbi algorithm is used to predict the state for each observation value.

5 Comparing CNV detection methods for SNP arrays 357 With prior knowledge of modelling statistics there are a multitude of options for copy number detection. HMMSeg [17] is a command line operated algorithm that is designed to apply HMM to genomic data. Application of correct modelling procedures is not an obvious process to nonstatisticians. For these reasons software has been developed which allows guided application of these types of advanced methods. GUIDED APPLICATION OF THE HMM A number of solutions for guided accurate CNV detection for SNP array data have been published but these are often platform specific. QuantiSNP [18] and PennCNV [19] are academically developed and freely available for prediction purposes. They use the HMM and assist the user to apply it to their own data. The standard output from these tools is a list of detected events and brief summary statistics used for quality checking. Checking the quality of data is extremely important in accurate event prediction. Data with high signal noise often causes false positive predictions and stringency with checks at this stage is highly recommended to eliminate any problem data. Signal noise is a strong limitation particularly with samples prepared by whole genome amplification. Output from QuantiSNP allows the user to plot average and standard deviations for BAF by chromosome or sample to show outliers (Figure 3). PennCNV has a detailed set of guidelines for identifying and rejecting problem data included on the software s support website. Both can run using command line options or integrated into Illumina s BeadStudio plug-in and have unique features to recommend them. The QuantiSNP algorithm output gives a log Bayes factor with its prediction which allows the user to rank events in order of likelihood and place their own cut off on acceptable events. Users can modify parameters to suit their own dataset, for example, changing the length parameter can allow more accurate detection of different sized events for a particular sample set. Later versions of QuantiSNP have increased flexibility for data other than the Figure 3: Graphical representation of quality control data from PennCNV and QuantiSNP algorithms. It is important to use quality control (QC) data from the algorithms to eliminate problem samples which would not be found during standard-genotyping analysis. Plot shows BAF score for each chromosome from analysis of sample NA10861, we can see chromosome 4 and X are outliers. Values produced by PennCNV log file also shown. NB Values shown relate to Illumina 1MDuo array.

6 358 Winchester et al. standard Illumina Infinium array and can used to process data and have proven accuracy on Illumina GoldenGate data [20] where SNP coverage is suitable. PennCNV has a number of downstream analysis options. Most important to highlight is the use of family trio data in analysis [21]. The use of trio information in event prediction allows easier detection of events novel to probands. It also integrates a pipeline for data analysis. The PennCNV package also includes a number of options to allow more analysis of event results such as a script to compare events to known gene libraries or for changing the format to be suitable for viewer such as BeadStudio s Chromosome Browser or the web-based genome browser, UCSC ( Dchip SNP [22] was originally developed for data but has been modified to allow the viewing of Illumina data. It produces an LOH score which can be plotted against chromosome but its functions are best suited to the platform generated values, in particular, the quality control options. The software also has options to carry out paired analysis for cancer data; major copy proportion analysis [22] uses HMM to analyse tumour samples. APPLYING APPROACHES ORIGINALLY USED IN ARRAYCGH A number of methods for copy number event detection were originally developed for arraycgh analysis but have been modified for SNP array analysis. The Circular Binary Segmentation (CBS) [23] algorithm is one such method. It was designed to convert noisy intensity values into regions of equal copy number. The algorithm will continue to divide a region into segments until it finds a segment, which is different to the neighbouring region. This change-point detection is designed to identify all the places which partition the chromosome into segments of the same copy number. An addition to the binary segmentation algorithm was made to allow the defining of single change inside a large segment. Segment ends were joined forming a circle to allow a further likelihood ratio test that the content has different means. Final segments are then given a cluster value, which is the median logratio value of the probes within the region and this value is used to define the copy number status. An alternative to the CBS algorithm was developed by Pique-Regi et al. [24], which can now be applied to SNP arrays. The Genome Alteration Detection Algorithm (GADA) uses sparse Bayesian learning to predict CN changes. For our testing we used a package designed for use in R environment with helpful processing options and detailed instructions for and Illumina data. The advantage of the speed of data processing was clear and we were able to analyse data within a few minutes. There are many other algorithms developed that could potentially be applied to SNP array data. Other reviews [6, 25] focused on the arraycgh format present the reader with a variety of alternative options. CNV DETECTION USING OTHER METHODS Approaches which describe different methods to address CN event detection are common in the literature. SNP conditional mixture modelling (SCIMM) developed by Cooper et al. [13], which is based on the observation that samples with deletions appear to have unique signal-intensity clusters. They applied a mixture-likelihood clustering method within the R statistical package to identify deletions. A secondary algorithm (SCIMM-Search) was developed to help discover probes which detect copy number changes within an array dataset. The algorithms require knowledge of modelling techniques to correctly carry out the analysis. The ITALICS [26] software focuses analysis on removal on unwanted events found in data. Rigaill et al. developed ITALICS (Iterative and Alternative normalisation and Copy number calling for affymetrix Snp arrays) to remove probes with abnormal intensities. Each iteration of the algorithm estimates the biological signal and then uses multiple linear regressions to estimate the nonlinear effects on the signal. The algorithm can be run in R and has the potential to analyse the Human mapping 500K, Genome Wide array 5.0 and 6.0 format but was designed to process data from chip formats containing perfect match and mismatch probes. COMMERCIALLY AVAILABLE SOFTWARE The strength of the software packages available to purchase lies in a number of traits; the ability

7 Comparing CNV detection methods for SNP arrays 359 to combine data from other platforms for comparison, graphical user interfaces, integrated pipelines for analysis and work flows, optimized computational speed and technical support. These factors are all extremely useful to those labs with no or limited bioinformatic core support. Unfortunately, commercial companies are limited in their use of some of the methods developed in the academic environment. They are often prevented from building user interfaces and other features around academic software due to restrictions imposed by free software licences such as GNU Public Licence, and prevention from having access to the latest methods. For our own purposes, we have chosen to look in detail at the Nexus Biodiscovery software. This uses the rank segmentation approach for detection. This approach is based on CBS but has been modified to increase speed of processing. It can be used for, arraycgh or Illumina data and although weaker for Illumina event detection is an extremely useful tool for practically trained scientists. COMBINING COPY NUMBER PREDICTION AND GENOTYPING Copy number detection approaches described thus far have looked only at a single aspect of the data. The Birdsuite set developed by Korn et al. [16] combines SNP genotyping and copy number detection as well as independently genotyping common CNPs. It uses four different methods to analyse an dataset. The Canary algorithm, which genotypes common CNPs and Birdseed, which carries out SNP genotyping are included in the Genotyping Console. Birdseye is used to discover rare CNVs. This uses the HMM to identify and assess previously unknown CNVs in the data. Fawkes is the final stage of Birdsuite; this merges all the results from the other three stages. Combining data in this way gives a more complete picture of structural variation in a sample and allows the user to proceed with single stage of association analysis with increased coverage on the data. Korn et al. compared their software to commercially available algorithms including Nexus and report the higher detection rates of Birdsuite. Franke et al. [27] have also presented a combined approach which focuses on single SNP interpretation. TriTyper uses maximum likelihood estimation to detect deletions in Illumina SNP data in unrelated samples. It incorporates an extra null allele into its genotyping clusters and uses deviations from the HWE as an indicator of when to use triallelic genotyping. It can also use neighbouring SNP data to impute the success of the caller which increases the accuracy of the output. COMPARING THE DETECTION ALGORITHMS There are a large variety of algorithms and software available for copy number event detection. Table 1 shows a summary of the software discussed in this review. A number of these software packages have been tested during the review and a brief synopsis of the results is presented here. Assessing Software To assess the accuracy of the algorithms we compared our data to the results of a well characterized sample. The sample NA12156 is the basis for our comparison (Table 2); it is from the HapMap collection and was sequenced for structural variation by Kidd et al. [28]. We have chosen to record the number of similar events between software and published data. We assume the samples with low numbers of similar events have higher false positive rates; however, we have not experimentally validated the results. While there is no faultless software we have found that at least 20% of events were confirmed by Kidd et al. in all algorithms. 27% of the overlapping detected events were found by more than one algorithm (Supplementary Table 1). Although some algorithms have a lower percentage of overlapping events it is important to consider the number of events found as well as the proportion, 49% of PennCNV detected events were confirmed but other algorithms have actually detected more in total. We carried out a secondary comparison using the CEPH sample NA15510 which has been characterized in a number of publications [2, 7, 28]. Table 3 shows the variation of results between studies. Further investigation of event replication across studies is represented in the Venn Diagrams (Figure 4). PennCNV and Illumina show similar patterns of overlap although we note an increased similarity between the Korbel et al. data and QuantiSNP output. We conclude that although we found a difference between detected events in our data and published results, we found similar variation between different publications, suggesting this is problem in

8 360 Winchester et al. Table 1: Summary of SNP array detection algorithms Software Platform Related publication Details Strengths Weaknesses Birdsuite (Birdseye and Canary) [15] Combined tool set to genotype SNPs & CNPs Unique approach, single association of SNPs and CN Integral part of Genome Console Integral part of BeadStudio Availability limited to data CNAT Technical notes Proprietaryçrun in Genome Console Accuracy of event prediction (missed events) CNVPartition Illumina Technical Proprietaryçrun in Accuracy of event prediction notes BeadStudio (missed events) Dchip SNP [22] Stand alone software Free viewer for all data Limited applications for or Illumina Illumina data GADA [24] Model uses Sparse Bayesian Speed of processing and Accuracy on Illumina weaker or Illumina Learning application within R HMMSeg Multiple [17] HMM application tool to any Flexibility to any dataset Statistical knowledge genomic data required for correct use Not CN specific ITALICS [26] R package for normalization Focus on removal of nonrelevant Designed to work on and CN detection in effects 100K þ 500K data chip (MM probe format) Nexus Biodiscovery Multiple [23] Commercial segmentation detection tool Allows combined data from different platforms Integrated viewer PennCNV Illumina or [19] Perl script based Multiple downstream tools for output QuantiSNP Illumina or [18] HHM PC or LINUX Bayes factor score for command line events, flexibility of run parameters SCIMM and SCIMM-Search Illumina [13] Modelling algorithm applied in R TriTyper Illumina [27] Identify and genotype SNPs with null allele Table 2 : Comparison of algorithms Algorithm Platform and array High detection rates compared to sequence data Able to interpret single SNPs Total of copy number events detected Freeware alternatives are available No way of ranking events due to likelihood Limited support for further event analysis Statistical knowledge required for correct use Only genotypes deletions Number of copy number events confirmed by Kidd et al. [28]. Birdsuite (Birdseye & Canary) (20%) CNAT (Genome Console 3.0.2) (25%) GADA (R 0.7-5) (23%) GADA (R 0.7-5) Illumina 1M Duo (31%) PennCNV (2009Jan06) (49%) PennCNV (2009Jan06) Illumina 1M Duo (37%) QuantiSNP v (41%) QuantiSNP v1.1 Illumina 1M Duo (31%) Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al. [28]. Default parameters are used for each algorithm and any Ychromosome data was omitted. An overlap between software output and confirmed data by Kidd et al. is determined by comparing the start and end points of events. Details of events are shown in SupplementaryTable 1. Percentage shows the number of confirmed CN events compared to the total detected by the algorithm. all comparisons and not unique to algorithms we tested. The overlap of algorithm events of the tested software is below 50% for all cases. We used default parameters for all our algorithms for ease of replication which means some algorithms were not run at their optimal level for our data. We deliberately chose data which did not use an array-based

9 Comparing CNV detection methods for SNP arrays 361 Table 3: Overlap between events detected by SNP array algorithms using multiple publication data Total events found in NA15510 by algorithm Number of copy number events (Kidd) [28] Number of copy number events (Korbel) [7] Number of copy number events (Redon) [2] Events in paper CNVPartition (4%) 22 (5%) 9 (4%) GADA (R 0.7-5) (23%) 85 (18%) 42 (19%) PennCNV (2009Jan06) (6%) 28 (%) 30 (14%) QuantiSNPv (6%) 41(9%) 29(13%) Data from CEPH sample NA15510 on1m array, Illumina platform is used to compare between algorithms and other publications. Default parameters are used for each algorithm and Y chromosome data was omitted. Event lists from publications were generated by combining data from several tables to create a complete list (including all validated and un-validated events). An event was counted if any overlap was found with base event in published data; multiple predictions by an algorithm for one published event were counted as one.value in brackets shows percentage of published events found by algorithm. We note from GADA analysis although a high number of overlaps were found, this was due to the prediction of large events that included smaller events found by Kidd et al.and Korbelet al. Figure 4: Venn diagrams comparing events for NA15510 between different studies. Visual representation of data from CEPH sample NA15510 on 1M array, Illumina platform used to compare between algorithms and other publications [2, 7, 28]. Default parameters are used for each algorithm and Y chromosome data was omitted from count. Event lists from publications were generated by combining data from several tables to create a complete list (including all validated and unvalidated events). An event was counted if any overlap was found with base event in published data, multiple predictions by an algorithm for one published event were counted as one. Each total in the diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is counted. Surprisingly, only 43 overlapping events are found for NA15510 in all the three studies (A). Results from the PennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three software due to the detection of more events overlapping with the Korbel et al. study. Overlap between algorithms is shown in Venn Diagram B where events which are detected by the algorithm and found in at least one of the publication are compared. A large proportion of detected events between PennCNV and QuantiSNP (43) overlap.

10 362 Winchester et al. technique for our NA12156 comparison to prevent a bias between and Illumina; but in doing so we accepted an increase in the number of differently detected events. Kidd et al. have shown similar data when comparing studies and found only a 12.5% overlap of events larger than 5 kb between their results and CN data generated by 6.0 array. Similarities of events detected between different Software We chose to test a single sample (NA10861) on a range of the available algorithms to compare the similarity between event detection. In all cases we found the academically developed software to be more sensitive and detect more events than proprietary algorithms (Table 4). The data also shows an increased number of events found from the sample using the SNP6.0 array; we assume this reflects the increase in the number of CNP probes on the array relative to Illumina s 1M chip. Table 5 shows the amount of overlap in event prediction. We show two results for each comparison counting the number of events overlapping for each algorithm separately. The difference in values represents the number of smaller events often found in one event by a different algorithm. In general, we found a higher number of overlapping events Table 4 : Comparison of event numbers detected for a single sample (NA10861) Algorithm Platform and array Birdsuite (Canary & Birdseye) CNAT(GenomeConsole3.0.2) CNVPartition Illumina 1M Duo 16 GADA (R 0.7-5) GADA (R 0.7-5) Illumina 1M Duo 87 Nexus Biodiscovery Nexus Biodiscovery Illumina 1M Duo 8 PennCNV (2009Jan06) PennCNV (2009Jan06) Illumina 1M Duo 43 QuantiSNP v QuantiSNP v1.1 Illumina 1M Duo 60 Number of CN events detected HapMap samples provided as demonstration data were analysed on both and Illumina platforms to give an easily reproducible comparison of event prediction. Events shown have been detected by the algorithm for CEPH sample NA Default parameters were used for all algorithms and any Ychromosome data was omitted. Data from the array has a higher number of detected events probably linked to the number of specifically targeted probes. Proprietary software from both Illumina and has a low detection rate. between algorithms run on 6.0 arrays data. We expected the low resemblance between data generated on different platforms as a result of the different probe sets; however, we are pleased to find some overlap. We have included a comparison to events published by Redon et al. [2]; although the study does not include a comprehensive list for this sample it does show that the algorithms are detecting confirmed events. During our comparison we often saw a difference in the size of the predicted event between algorithms (Figure 5). This was to be expected when using different platforms as probe locations vary, but was also seen when analysing an identical dataset. This kind of effect can even be produced when simply altering algorithm parameters and should be a consideration when looking at breakpoints of detected events. We found that the available software tend to target and support one particular platform for analysis, which unfortunately, can limit options. Recommending algorithms Comparison of events in a dataset is a good way of assessing accuracy of detection algorithms but it is also important to take into account that the different predictions can also be informative in showing false positives caused by noisy data and conversely that those in agreement are the strongest candidates for events. Multiple predictions from different software for the same event increase confidence in the data and give clearer indications of the event boundaries or any discrepancy in this information. We would recommend using a second algorithm on a single dataset to produce the most informative results and also utilize the different advantages of each software. We also suggest using software designed specifically for the platform which generated the data as several of the dual use algorithms have been shown to weaker in one format. We have selected a range of algorithms to discuss and test and the list in Table 1 is not exhaustive, only an overview of some of the possibilities. It is also important to state, even using different algorithms one cannot definitively confirm the presence of a CN event without separate biological replication and it is unlikely that any list of events detected will contain all CNVs in a sample. FURTHER ANALYSIS OF DETECTED CNVs With a number of reliable options available for the detection of copy number events it becomes

11 Comparing CNV detection methods for SNP arrays 363 Table 5: Comparison of software event predictions Published results (Redon) Birdsuite CNAT CNV Partition Illumina GADA GADA Illumina Nexus Nexus Illumina PennCNV PennCNV Illumina QuantiSNP QuantiSNP Illumina Publisheddata(Redon) 17(4%) 4(40%) 3(19%) 32(5%) 2(2%) 11(10%) 2(25%) 12(18%) 7(16%) 18(9%) 8(13%) Birdsuite 17(44%) 9(90%) 13(81%) 135(22%) 21(24%) 62(56%) 6(75%) 43(64%) 20(47%) 97(50%) 20(33%) CNAT 4(10%) 15(4%) 4(25%) 34(6%) 0 23(21%) 1(13%) 13(19%) 2(5%) 17(9%) 5(8%) CNV Partition Illumina 3 (8%) 16 (4%) 4 (40%) 37 (6%) 7 (8%) 20 (18%) 7 (88%) 9 (13%) 11 (26%) 16 (8%) 16 (27%) GADA 17 (44%) 106 (28%) 9 (90%) 13 (81%) 32 (37%) 91 (82%) 7 (88%) 58 (87%) 23 (53%) 153 (79%) 27 (45%) GADA Illumina 2(5%) 96(25%) 0 13(81%) 208(34%) 25(23%) 2(25%) 26(30%) 17(40%) 67(35%) 23(38%) Nexus 7(18%) 57(15%) 10(100%) 7(44%) 116(19%) 8(9%) 4(50%) 45(67%) 15(35%) 78(40%) 17(28%) Nexus Illumina 2(5%) 6(2%) 1(10%) 7(44%) 22(4%) 2(2%) 4(4%) 6(9%) 7(16%) 10(5%) 9(15%) PennCNV 11 (28%) 51 (13%) 10 (100%) 9 (56%) 105 (17%) 10 (11%) 65 (59%) 6 (75%) 19 (44%) 71 (37%) 21 (35%) PennCNV Illumina 6 (15%) 25 (7%) 2 (20%) 11 (69%) 44 (7%) 9 (10%) 23 (21%) 6 (75%) 18 (27%) 26 (13%) 28 (47%) QuantiSNP 14 (36%) 97 (25%) 10 (100%) 10 (63%) 199 (32%) 18 (21%) 86 (77%) 7 (88%) 65 (97%) 21 (49%) 24 (40%) QuantiSNP Illumina 6(15%) 14(4%) 5(50%) 15(94%) 55(9%) 10(11%) 30(27%) 8(100%) 23(34%) 32(74%) 31(16%) Algorithms were run on demonstration data for sample NA10861on 6.0 chips and Illumina1MDuo arrays.defaultparameters were used and anyychromosome data was omitted.for algorithm overall totals see Table 4. Events detected in both software are shown. Events counted as common between algorithms if part of region predicted overlaps with the other. Each comparison is carried out twice to show cases where smaller events within one algorithm make up one event in the other, therefore overlap of events depends on analysis orientation.total value represents number of events for software on horizontal axis found in the other software dataset, bracketed value shows percentage of events detected by same software.we have found the most similarities are between data from similar platforms or algorithm method; for example PennCNV and QuantiSNP are both based on the HMM algorithm and as such event prediction should be very similar. We have also noted a higher number of similar events from algorithms using data.

12 364 Winchester et al. Figure 5: Image from UCSC Browser showing the detection of a single event using different algorithms. The deletion described is a known CNP and is recorded several times in the DGV. Each track represents a different algorithm or platform. All results for detection algorithms shown used default parameters and test sample NA increasingly important to be able to summarize and use this data. Initially, we are often interested in looking for novel events in certain genes or regions. Tracks of events can be viewed in databases such as the web-based genome browser, UCSC ( and events can be compared to known copy number data in the DGV such as displayed in Figure 3. Importing several tracks of data into a browser simultaneously will allow the user to compare different result sets. Analysis of multiple events per sample is a more complicated procedure. Events and samples can be explored using pathway analysis tools to look for interesting groups or combinations of events in different genes but methods of confirming the significance of an event are required. A number of publications exist presenting ways of applying association study methods to copy number data. Barnes etal. [29] developed an R package, CNVtools, which allows the user to carry out case-control association

13 Comparing CNV detection methods for SNP arrays 365 analysis on a single CNV of interest. The publication tests a series of five alternative modelling methods before recommending a likelihood ratio test which combines CNV calling and association testing into a single model. This method was designed to eliminate problems with signal noise which is a known trait of SNP assay data. Ionita-Laza et al. [30] suggested a method to apply genome-wide familybased association studies on raw-intensity data. The Birdsuite package includes a pipeline to prepare the data for PLINK analysis. Other sources have suggested similar association study-based strategies, but an agreed approach is a subject of great discussion. Calls have been made by authors such as Scherer et al. [31] to decide on a single technique but future decisions in the field will be extremely enlightening. As is commented much upon in literature describing SNP association study techniques, sample size and power of tests are major factors in a successful study [32]. This must also be considered when analysing copy number data. As we have discussed, there are a number of analysis options available for SNP array CNV detection, pipelines to allow guided analysis and stand alone options for more flexible analysis. Some of these applications are platform targeted but we have found that the best outcome is given by using multiple algorithms and comparing data. SUPPLEMENTARY DATA Supplementary data are available online at bib.oxfordjournals.org/. Key Points A wide variety of software is available for CNV detection from data produced by SNP arrays. This review seeks to discuss options and statistical methods currently available for analysis of signal intensity data. Changes in assay selection techniques for SNP arrays have made them more appealing for copy number detection as well as genotyping. Targeted probe design has made the SNP array a reliable and cheaper option for copy number analysis. After testing a selection of the available software, comparisons were performed using Hapmap samples and published copy number data. Of the events found in our data 20 ^ 49% were replicated in previously published studies but the results clearly showed variation in data caused by differences in algorithms. An important recommendation when choosing software for analysis is the use of a second algorithm on a dataset to produce more informative results. This enables the user to eliminate false positives not found by both software and increases confidence in replicated events. Acknowledgements The authors thank Dr Helen Butler for her ideas and contributions to the manuscript. FUNDING JR and LW are funded by Wellcome Trust Grants. CY is funded by a UK Medical Research Council Special Training Fellowship in Biomedical Informatics (Ref No. G ). References 1. Iafrate AJ, Feuk L, Rivera MN, et al. Detection of largescale variation in the human genome. Nat Genet 2004; 36(9): Redon R, Ishikawa S, Fitch KR, et al. Global variation in copy number in the human genome. Nature 2006; 444(7118): Tuzun E, Sharp AJ, Bailey JA, et al. Fine-scale structural variation of the human genome. Nat Genet 2005; 37(7): Sebat J, Lakshmi B, Troge J, et al. Large-scale copy number polymorphism in the human genome. Science 2004; 305(5683): de Smith AJ, Tsalenko A, Sampas N, et al. Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex diseases. Hum Mol Genet 2007; 16(23): Carter NP. Methods and strategies for analyzing copy number variation using DNA microarrays. Nat Genet 2007;39(7 Suppl):S Korbel JO, Urban AE, Affourtit JP, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 2007;318(5849): Kennedy GC, Matsuzaki H, Dong S, etal. Large-scale genotyping of complex DNA. NatBiotechnol 2003;21(10): Peiffer DA, Le JM, Steemers FJ, etal. High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. Genome Res 2006;16(9): International Schizophrenia Consortium Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 2008;455(7210): Yang TL, Chen XD, Guo Y, et al. Genome-wide copynumber-variation study identified a susceptibility gene, UGT2B17, for osteoporosis. Am J Hum Genet 2008; 83(6): McCarroll SA, Hadnott TN, Perry GH, et al. Common deletion polymorphisms in the human genome. Nat Genet 2006;38(1): Cooper GM, Zerr T, Kidd JM, et al. Systematic assessment of copy number variant detection via genome-wide SNP genotyping. Nat Genet 2008;40(10): McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat Genet 2007; 39(7 Suppl):S37 42.

14 366 Winchester et al. 15. McCarroll SA, Kuruvilla FG, Korn JM, et al. Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet 2008;40(10): Korn JM, Kuruvilla FG, McCarroll SA, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet 2008;40(10): Day N, Hemmaplardh A, Thurman RE, et al. Unsupervised segmentation of continuous genomic data. Bioinformatics 2007;23(11): Colella S, Yau C, Taylor JM, etal. QuantiSNP: an objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res 2007;35(6): Wang K, Li M, Hadley D, et al. PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 2007;17(11): Maestrini E, Pagnamenta AT, Lamb JA, et al. High-density SNP association study and copy number variation analysis of the AUTS1 and AUTS5 loci implicate the IMMP2L- DOCK4 gene region in autism susceptibility. Mol Psychiatry Wang K, Chen Z, Tadesse MG, et al. Modeling genetic inheritance of copy number variations. Nucleic Acids Res 2008;36(21):e Li C, Beroukhim R, Weir BA, et al. Major copy proportion analysis of tumor samples using SNP arrays. BMC Bioinformatics 2008;9: Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 2004;5(4): Pique-Regi R, Monso-Varona J, Ortega A, et al. Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics 2008;24(3): Lai WR, Johnson MD, Kucherlapati R, Park PJ. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005; 21(19): Rigaill G, Hupe P, Almeida A, et al. ITALICS: an algorithm for normalization and DNA copy number calling for SNP arrays. Bioinformatics 2008;24(6): Franke L, de Kovel CG, Aulchenko YS, et al. Detection, imputation, and association analysis of small deletions and null alleles on oligonucleotide arrays. AmJ Hum Genet 2008; 82(6): Kidd JM, Cooper GM, Donahue WF, et al. Mapping and sequencing of structural variation from eight human genomes. Nature 2008;453(7191): Barnes C, Plagnol V, Fitzgerald T, et al. A robust statistical method for case-control association testing with copy number variation. Nat Genet 2008;40(10): Ionita-Laza I, Perry GH, Raby BA, et al. On the analysis of copy-number variations in genome-wide association studies: a translation of the family-based association test. Genet Epidemiol 2008;32(3): Scherer SW, Lee C, Birney E, etal. Challenges and standards in integrating surveys of structural variation. NatGenet 2007; 39(7 Suppl):S Cardon LR, Bell JI. Association study designs for complex diseases. Nat Rev Genet 2001;2(2):91 9.

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Supplementary note: Comparison of deletion variants identified in this study and four earlier studies Here we compare the results of this study to potentially overlapping results from four earlier studies

More information

Understanding DNA Copy Number Data

Understanding DNA Copy Number Data Understanding DNA Copy Number Data Adam B. Olshen Department of Epidemiology and Biostatistics Helen Diller Family Comprehensive Cancer Center University of California, San Francisco http://cc.ucsf.edu/people/olshena_adam.php

More information

Identification of regions with common copy-number variations using SNP array

Identification of regions with common copy-number variations using SNP array Identification of regions with common copy-number variations using SNP array Agus Salim Epidemiology and Public Health National University of Singapore Copy Number Variation (CNV) Copy number alteration

More information

Associating Copy Number and SNP Variation with Human Disease. Autism Segmental duplication Neurobehavioral, includes social disability

Associating Copy Number and SNP Variation with Human Disease. Autism Segmental duplication Neurobehavioral, includes social disability Technical Note Associating Copy Number and SNP Variation with Human Disease Abstract The Genome-Wide Human SNP Array 6.0 is an affordable tool to examine the role of copy number variation in disease by

More information

LTA Analysis of HapMap Genotype Data

LTA Analysis of HapMap Genotype Data LTA Analysis of HapMap Genotype Data Introduction. This supplement to Global variation in copy number in the human genome, by Redon et al., describes the details of the LTA analysis used to screen HapMap

More information

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction Optimization strategy of Copy Number Variant calling using Multiplicom solutions Michael Vyverman, PhD; Laura Standaert, PhD and Wouter Bossuyt, PhD Abstract Copy number variations (CNVs) represent a significant

More information

Agilent s Copy Number Variation (CNV) Portfolio

Agilent s Copy Number Variation (CNV) Portfolio Technical Overview Agilent s Copy Number Variation (CNV) Portfolio Abstract Copy Number Variation (CNV) is now recognized as a prevalent form of structural variation in the genome contributing to human

More information

Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays

Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays Published online 8 February 2010 Nucleic Acids Research, 2010, Vol. 38, No. 9 e105 doi:10.1093/nar/gkq040 Comparative analyses of seven algorithms for copy number variant identification from single nucleotide

More information

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Stanford Biostatistics Workshop Pierre Neuvial with Henrik Bengtsson and Terry Speed Department of Statistics, UC Berkeley

More information

Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre

Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre Structural variation (SVs) Copy-number variations C Deletion A B C Balanced rearrangements A B A B C B A C Duplication Inversion Causes

More information

cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs

cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs cnvhap: an integrative population and haplotype based multiplatform model of SNPs and CNVs Lachlan J M Coin 1, Julian E Asher, Robin G Walters, Julia S El-Sayed Moustafa, Adam J de Smith, Rob Sladek 3,

More information

Introduction to LOH and Allele Specific Copy Number User Forum

Introduction to LOH and Allele Specific Copy Number User Forum Introduction to LOH and Allele Specific Copy Number User Forum Jonathan Gerstenhaber Introduction to LOH and ASCN User Forum Contents 1. Loss of heterozygosity Analysis procedure Types of baselines 2.

More information

Copy Number Variations and Association Mapping Advanced Topics in Computa8onal Genomics

Copy Number Variations and Association Mapping Advanced Topics in Computa8onal Genomics Copy Number Variations and Association Mapping 02-715 Advanced Topics in Computa8onal Genomics SNP and CNV Genotyping SNP genotyping assumes two copy numbers at each locus (i.e., no CNVs) CNV genotyping

More information

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used

More information

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis HMG Advance Access published December 21, 2012 Human Molecular Genetics, 2012 1 13 doi:10.1093/hmg/dds512 Whole-genome detection of disease-associated deletions or excess homozygosity in a case control

More information

New Enhancements: GWAS Workflows with SVS

New Enhancements: GWAS Workflows with SVS New Enhancements: GWAS Workflows with SVS August 9 th, 2017 Gabe Rudy VP Product & Engineering 20 most promising Biotech Technology Providers Top 10 Analytics Solution Providers Hype Cycle for Life sciences

More information

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit APPLICATION NOTE Ion PGM System Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit Key findings The Ion PGM System, in concert with the Ion ReproSeq PGS View Kit and Ion Reporter

More information

CNV Detection and Interpretation in Genomic Data

CNV Detection and Interpretation in Genomic Data CNV Detection and Interpretation in Genomic Data Benjamin W. Darbro, M.D., Ph.D. Assistant Professor of Pediatrics Director of the Shivanand R. Patil Cytogenetics and Molecular Laboratory Overview What

More information

Integrated Analysis of Copy Number and Gene Expression

Integrated Analysis of Copy Number and Gene Expression Integrated Analysis of Copy Number and Gene Expression Nexus Copy Number provides user-friendly interface and functionalities to integrate copy number analysis with gene expression results for the purpose

More information

Integrated detection and population-genetic analysis. of SNPs and copy number variation

Integrated detection and population-genetic analysis. of SNPs and copy number variation Integrated detection and population-genetic analysis of SNPs and copy number variation Steven A. McCarroll 1,2,*, Finny G. Kuruvilla 1,2,*, Joshua M. Korn 1,SimonCawley 3, James Nemesh 1, Alec Wysoker

More information

Genomic structural variation

Genomic structural variation Genomic structural variation Mario Cáceres The new genomic variation DNA sequence differs across individuals much more than researchers had suspected through structural changes A huge amount of structural

More information

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin,

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin, ESM Methods Hyperinsulinemic-euglycemic clamp procedure During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin, Clayton, NC) was followed by a constant rate (60 mu m

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION CONTENTS A. AUTISM SPECTRUM DISORDER (ASD) SAMPLE AND CONTROL COLLECTIONS 4 ASD samples 4 Control cohorts 4 B. GENOTYPING AND DATA CLEANING 6 SNP quality control 6 Intensity quality control for CNV detection

More information

November 9, Johns Hopkins School of Medicine, Baltimore, MD,

November 9, Johns Hopkins School of Medicine, Baltimore, MD, Fast detection of de-novo copy number variants from case-parent SNP arrays identifies a deletion on chromosome 7p14.1 associated with non-syndromic isolated cleft lip/palate Samuel G. Younkin 1, Robert

More information

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz Software Manual Institute of Bioinformatics, Johannes Kepler University Linz cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University

More information

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays

Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays Optimizing Copy Number Variation Analysis Using Genome-wide Short Sequence Oligonucleotide Arrays The Harvard community has made this article openly available. Please share how this access benefits you.

More information

Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases

Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases arxiv:1010.5040v1 [stat.me] 25 Oct 2010 Statistical Science 2009, Vol. 24, No. 4, 530 546 DOI: 10.1214/09-STS304 c Institute of Mathematical Statistics, 2009 Using GWAS Data to Identify Copy Number Variants

More information

Tutorial on Genome-Wide Association Studies

Tutorial on Genome-Wide Association Studies Tutorial on Genome-Wide Association Studies Assistant Professor Institute for Computational Biology Department of Epidemiology and Biostatistics Case Western Reserve University Acknowledgements Dana Crawford

More information

Below, we included the point-to-point response to the comments of both reviewers.

Below, we included the point-to-point response to the comments of both reviewers. To the Editor and Reviewers: We would like to thank the editor and reviewers for careful reading, and constructive suggestions for our manuscript. According to comments from both reviewers, we have comprehensively

More information

Comparison of segmentation methods in cancer samples

Comparison of segmentation methods in cancer samples fig/logolille2. Comparison of segmentation methods in cancer samples Morgane Pierre-Jean, Guillem Rigaill, Pierre Neuvial Laboratoire Statistique et Génome Université d Évry Val d Éssonne UMR CNRS 8071

More information

Golden Helix s End-to-End Solution for Clinical Labs

Golden Helix s End-to-End Solution for Clinical Labs Golden Helix s End-to-End Solution for Clinical Labs Steven Hystad - Field Application Scientist Nathan Fortier Senior Software Engineer 20 most promising Biotech Technology Providers Top 10 Analytics

More information

Association for Molecular Pathology Promoting Clinical Practice, Basic Research, and Education in Molecular Pathology

Association for Molecular Pathology Promoting Clinical Practice, Basic Research, and Education in Molecular Pathology Association for Molecular Pathology Promoting Clinical Practice, Basic Research, and Education in Molecular Pathology 9650 Rockville Pike, Bethesda, Maryland 20814 Tel: 301-634-7939 Fax: 301-634-7990 Email:

More information

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014

Challenges of CGH array testing in children with developmental delay. Dr Sally Davies 17 th September 2014 Challenges of CGH array testing in children with developmental delay Dr Sally Davies 17 th September 2014 CGH array What is CGH array? Understanding the test Benefits Results to expect Consent issues Ethical

More information

Integrated detection and population-genetic analysis of SNPs and copy number variation

Integrated detection and population-genetic analysis of SNPs and copy number variation 8 Nature Publishing Group http://www.nature.com/naturegenetics Integrated detection and population-genetic analysis of SNPs and copy number variation Steven A McCarroll 4,, Finny G Kuruvilla 4,, Joshua

More information

Integrated detection and population-genetic analysis of SNPs and copy number variation

Integrated detection and population-genetic analysis of SNPs and copy number variation Integrated detection and population-genetic analysis of SNPs and copy number variation Steven A McCarroll 4,, Finny G Kuruvilla 4,, Joshua M Korn 6, Simon Cawley 7, James Nemesh, Alec Wysoker, Michael

More information

DNA-seq Bioinformatics Analysis: Copy Number Variation

DNA-seq Bioinformatics Analysis: Copy Number Variation DNA-seq Bioinformatics Analysis: Copy Number Variation Elodie Girard elodie.girard@curie.fr U900 institut Curie, INSERM, Mines ParisTech, PSL Research University Paris, France NGS Applications 5C HiC DNA-seq

More information

Structural Variants and Susceptibility to Common Human Disorders Dr. Xavier Estivill

Structural Variants and Susceptibility to Common Human Disorders Dr. Xavier Estivill Structural Variants and Susceptibility Genetic Causes of Disease Lab Genes and Disease Program Center for Genomic Regulation (CRG) Barcelona 1 Complex genetic diseases Changes in prevalence (>10 fold)

More information

Exercises: Differential Methylation

Exercises: Differential Methylation Exercises: Differential Methylation Version 2018-04 Exercises: Differential Methylation 2 Licence This manual is 2014-18, Simon Andrews. This manual is distributed under the creative commons Attribution-Non-Commercial-Share

More information

Global variation in copy number in the human genome

Global variation in copy number in the human genome Global variation in copy number in the human genome Redon et. al. Nature 444:444-454 (2006) 12.03.2007 Tarmo Puurand Study 270 individuals (HapMap collection) Affymetrix 500K Whole Genome TilePath (WGTP)

More information

Computational Analysis of Genome-Wide DNA Copy Number Changes

Computational Analysis of Genome-Wide DNA Copy Number Changes Computational Analysis of Genome-Wide DNA Copy Number Changes Lei Song Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements

More information

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed.

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed. Reviewers' Comments: Reviewer #1 (Remarks to the Author) The manuscript titled 'Association of variations in HLA-class II and other loci with susceptibility to lung adenocarcinoma with EGFR mutation' evaluated

More information

A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data

A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data A Multi-Sample Based Method for Identifying Common CNVs in Normal Human Genomic Structure Using High- Resolution acgh Data Chihyun Park 1, Jaegyoon Ahn 1, Youngmi Yoon 2, Sanghyun Park 1 * 1 Department

More information

On Missing Data and Genotyping Errors in Association Studies

On Missing Data and Genotyping Errors in Association Studies On Missing Data and Genotyping Errors in Association Studies Department of Biostatistics Johns Hopkins Bloomberg School of Public Health May 16, 2008 Specific Aims of our R01 1 Develop and evaluate new

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz Software Manual Institute of Bioinformatics, Johannes Kepler University Linz cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University

More information

Evaluating Classifiers for Disease Gene Discovery

Evaluating Classifiers for Disease Gene Discovery Evaluating Classifiers for Disease Gene Discovery Kino Coursey Lon Turnbull khc0021@unt.edu lt0013@unt.edu Abstract Identification of genes involved in human hereditary disease is an important bioinfomatics

More information

Modeling genetic inheritance of copy number variations

Modeling genetic inheritance of copy number variations Published online 2 October 2008 Nucleic Acids Research, 2008, Vol. 36, No. 21 e138 doi:10.1093/nar/gkn641 Modeling genetic inheritance of copy number variations Kai Wang 1,2, *, Zhen Chen 3, Mahlet G.

More information

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Application Note Authors John McGuigan, Megan Manion,

More information

Nature Genetics: doi: /ng Supplementary Figure 1

Nature Genetics: doi: /ng Supplementary Figure 1 Supplementary Figure 1 Illustrative example of ptdt using height The expected value of a child s polygenic risk score (PRS) for a trait is the average of maternal and paternal PRS values. For example,

More information

Supplementary Figures

Supplementary Figures Supplementary Figures Supplementary Fig 1. Comparison of sub-samples on the first two principal components of genetic variation. TheBritishsampleisplottedwithredpoints.The sub-samples of the diverse sample

More information

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 PGAR: ASD Candidate Gene Prioritization System Using Expression Patterns Steven Cogill and Liangjiang Wang Department of Genetics and

More information

Vega: Variational Segmentation for Copy Number Detection

Vega: Variational Segmentation for Copy Number Detection Vega: Variational Segmentation for Copy Number Detection Sandro Morganella Luigi Cerulo Giuseppe Viglietto Michele Ceccarelli Contents 1 Overview 1 2 Installation 1 3 Vega.RData Description 2 4 Run Vega

More information

Cost effective, computer-aided analytical performance evaluation of chromosomal microarrays for clinical laboratories

Cost effective, computer-aided analytical performance evaluation of chromosomal microarrays for clinical laboratories University of Iowa Iowa Research Online Theses and Dissertations Summer 2012 Cost effective, computer-aided analytical performance evaluation of chromosomal microarrays for clinical laboratories Corey

More information

PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data

PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data Methods PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data Kai Wang, 1 Mingyao Li, 2 Dexter Hadley, 1,3 Rui Liu,

More information

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s Using Bayesian Networks to Analyze Expression Data Xu Siwei, s0789023 Muhammad Ali Faisal, s0677834 Tejal Joshi, s0677858 Outline Introduction Bayesian Networks Equivalence Classes Applying to Expression

More information

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data. Supplementary Figure 1 PCA for ancestry in SNV data. (a) EIGENSTRAT principal-component analysis (PCA) of SNV genotype data on all samples. (b) PCA of only proband SNV genotype data. (c) PCA of SNV genotype

More information

CHROMOSOMAL MICROARRAY (CGH+SNP)

CHROMOSOMAL MICROARRAY (CGH+SNP) Chromosome imbalances are a significant cause of developmental delay, mental retardation, autism spectrum disorders, dysmorphic features and/or birth defects. The imbalance of genetic material may be due

More information

Screening for novel oncology biomarker panels using both DNA and protein microarrays. John Anson, PhD VP Biomarker Discovery

Screening for novel oncology biomarker panels using both DNA and protein microarrays. John Anson, PhD VP Biomarker Discovery Screening for novel oncology biomarker panels using both DNA and protein microarrays John Anson, PhD VP Biomarker Discovery Outline of presentation Introduction to OGT and our approach to biomarker studies

More information

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage:

Genomics 101 (2013) Contents lists available at SciVerse ScienceDirect. Genomics. journal homepage: Genomics 101 (2013) 134 138 Contents lists available at SciVerse ScienceDirect Genomics journal homepage: www.elsevier.com/locate/ygeno Gene-based copy number variation study reveals a microdeletion at

More information

Quality Control Analysis of Add Health GWAS Data

Quality Control Analysis of Add Health GWAS Data 2018 Add Health Documentation Report prepared by Heather M. Highland Quality Control Analysis of Add Health GWAS Data Christy L. Avery Qing Duan Yun Li Kathleen Mullan Harris CAROLINA POPULATION CENTER

More information

Ginkgo Interactive analysis and quality assessment of single-cell CNV data

Ginkgo Interactive analysis and quality assessment of single-cell CNV data Ginkgo Interactive analysis and quality assessment of single-cell CNV data @RobAboukhalil Robert Aboukhalil, Tyler Garvin, Jude Kendall, Timour Baslan, Gurinder S. Atwal, Jim Hicks, Michael Wigler, Michael

More information

Supplementary Material to. Genome-wide association study identifies new HLA Class II haplotypes strongly protective against narcolepsy

Supplementary Material to. Genome-wide association study identifies new HLA Class II haplotypes strongly protective against narcolepsy Supplementary Material to Genome-wide association study identifies new HLA Class II haplotypes strongly protective against narcolepsy Hyun Hor, 1,2, Zoltán Kutalik, 3,4, Yves Dauvilliers, 2,5 Armand Valsesia,

More information

CNV analysis in the Lithuanian population

CNV analysis in the Lithuanian population Urnikyte et al. BMC Genetics (2016) 17:64 DOI 10.1186/s12863-016-0373-6 RESEARCH ARTICLE CNV analysis in the Lithuanian population A. Urnikyte 1*, I. Domarkiene 1, S. Stoma 3, L. Ambrozaityte 1, I. Uktveryte

More information

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University Role of Chemical lexposure in Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University CNV Discovery Reference Genetic

More information

Analysis of CGH and SNP arrays for the detection of chromosomal aberrations in single cells

Analysis of CGH and SNP arrays for the detection of chromosomal aberrations in single cells Analysis of CGH and SNP arrays for the detection of chromosomal aberrations in single cells Peter Konings 1 Evelyne Vanneste 1,2 Thierry Voet 1 Cédric Le Caignec 1 Michèle Ampe 1 Cindy Melotte 1 Sophie

More information

Microarray Comparative Genomic Hybridisation (array CGH)

Microarray Comparative Genomic Hybridisation (array CGH) Saint Mary s Hospital Manchester Centre for Genomic Medicine Information for Patients Microarray Comparative Genomic Hybridisation (array CGH) An array CGH test looks for small changes in a person s chromosomes,

More information

Investigating rare diseases with Agilent NGS solutions

Investigating rare diseases with Agilent NGS solutions Investigating rare diseases with Agilent NGS solutions Chitra Kotwaliwale, Ph.D. 1 Rare diseases affect 350 million people worldwide 7,000 rare diseases 80% are genetic 60 million affected in the US, Europe

More information

Assessing Accuracy of Genotype Imputation in American Indians

Assessing Accuracy of Genotype Imputation in American Indians Assessing Accuracy of Genotype Imputation in American Indians Alka Malhotra*, Sayuko Kobes, Clifton Bogardus, William C. Knowler, Leslie J. Baier, Robert L. Hanson Phoenix Epidemiology and Clinical Research

More information

Genome-Wide Analysis of Copy Number Variations in Normal Population Identified by SNP Arrays

Genome-Wide Analysis of Copy Number Variations in Normal Population Identified by SNP Arrays 54 The Open Biology Journal, 2009, 2, 54-65 Open Access Genome-Wide Analysis of Copy Number Variations in Normal Population Identified by SNP Arrays Jian Wang 1,2, Tsz-Kwong Man 1,3,4, Kwong Kwok Wong

More information

Comprehensive performance comparison of high-resolution array platforms for genome-wide Copy Number Variation (CNV) analysis in humans

Comprehensive performance comparison of high-resolution array platforms for genome-wide Copy Number Variation (CNV) analysis in humans Haraksingh et al. BMC Genomics (2017) 18:321 DOI 10.1186/s12864-017-3658-x RESEARCH ARTICLE Comprehensive performance comparison of high-resolution array platforms for genome-wide Copy Number Variation

More information

Nature Methods: doi: /nmeth.3115

Nature Methods: doi: /nmeth.3115 Supplementary Figure 1 Analysis of DNA methylation in a cancer cohort based on Infinium 450K data. RnBeads was used to rediscover a clinically distinct subgroup of glioblastoma patients characterized by

More information

White Paper. Copy number variant detection. Sample to Insight. August 19, 2015

White Paper. Copy number variant detection. Sample to Insight. August 19, 2015 White Paper Copy number variant detection August 19, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 Fax: +45 86 20 12 22 www.clcbio.com

More information

CNV PCA Search Tutorial

CNV PCA Search Tutorial CNV PCA Search Tutorial Release 8.1 Golden Helix, Inc. March 18, 2014 Contents 1. Data Preparation 2 A. Join Log Ratio Data with Phenotype Information.............................. 2 B. Activate only

More information

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University

Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics. Mike West Duke University Aspects of Statistical Modelling & Data Analysis in Gene Expression Genomics Mike West Duke University Papers, software, many links: www.isds.duke.edu/~mw ABS04 web site: Lecture slides, stats notes, papers,

More information

Nature Biotechnology: doi: /nbt.1904

Nature Biotechnology: doi: /nbt.1904 Supplementary Information Comparison between assembly-based SV calls and array CGH results Genome-wide array assessment of copy number changes, such as array comparative genomic hybridization (acgh), is

More information

Structural Variation and Medical Genomics

Structural Variation and Medical Genomics Structural Variation and Medical Genomics Andrew King Department of Biomedical Informatics July 8, 2014 You already know about small scale genetic mutations Single nucleotide polymorphism (SNPs) Deletions,

More information

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK CHAPTER 6 DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK Genetic research aimed at the identification of new breast cancer susceptibility genes is at an interesting crossroad. On the one hand, the existence

More information

Supplementary Figure 1

Supplementary Figure 1 Supplementary Figure 1 An example of the gene-term-disease network automatically generated by Phenolyzer web server for 'autism'. The largest word represents the user s input term, Autism. The pink round

More information

Multimarker Genetic Analysis Methods for High Throughput Array Data

Multimarker Genetic Analysis Methods for High Throughput Array Data Multimarker Genetic Analysis Methods for High Throughput Array Data by Iuliana Ionita A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department

More information

Analysis of acgh data: statistical models and computational challenges

Analysis of acgh data: statistical models and computational challenges : statistical models and computational challenges Ramón Díaz-Uriarte 2007-02-13 Díaz-Uriarte, R. acgh analysis: models and computation 2007-02-13 1 / 38 Outline 1 Introduction Alternative approaches What

More information

Identifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology

Identifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology Identifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology Yicun Ni (ID#: 9064804041), Jin Ruan (ID#: 9070059457), Ying Zhang (ID#: 9070063723) Abstract As the

More information

ChIP-seq data analysis

ChIP-seq data analysis ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017 Contents Background ChIP-seq protocol ChIP-seq data analysis Transcriptional regulation Transcriptional

More information

New methods for discovering common and rare genetic variants in human disease

New methods for discovering common and rare genetic variants in human disease Washington University in St. Louis Washington University Open Scholarship All Theses and Dissertations (ETDs) 1-1-2011 New methods for discovering common and rare genetic variants in human disease Peng

More information

Copy Number Variations

Copy Number Variations Copy Number Variations Illumina Seminar - Milan June 18, 2009 Untangling the complexity of mendelian and complex diseases Federica Torri Dept of Science & Biomedical Technologies Fondazione Filarete, University

More information

DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING

DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING DANIEL R. SCHRIDER * Department of Biology and School of Informatics and Computing, Indiana University, 1001 E Third

More information

Introduction to Genetics and Genomics

Introduction to Genetics and Genomics 2016 Introduction to enetics and enomics 3. ssociation Studies ggibson.gt@gmail.com http://www.cig.gatech.edu Outline eneral overview of association studies Sample results hree steps to WS: primary scan,

More information

Introduction to the Genetics of Complex Disease

Introduction to the Genetics of Complex Disease Introduction to the Genetics of Complex Disease Jeremiah M. Scharf, MD, PhD Departments of Neurology, Psychiatry and Center for Human Genetic Research Massachusetts General Hospital Breakthroughs in Genome

More information

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Accessing and Using ENCODE Data Dr. Peggy J. Farnham 1 William M Keck Professor of Biochemistry Keck School of Medicine University of Southern California How many human genes are encoded in our 3x10 9 bp? C. elegans (worm) 959 cells and 1x10 8 bp 20,000

More information

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015

More information

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes. Supplementary Figure 1 Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes. (a,b) Values of coefficients associated with genomic features, separately

More information

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB Analysis Kits Next-generation performance in liquid biopsies 2 Accelerating clinical research From liquid biopsy to next-generation

More information

Systematic Analysis for Identification of Genes Impacting Cancers

Systematic Analysis for Identification of Genes Impacting Cancers Systematic Analysis for Identification of Genes Impacting Cancers Arpita Singhal Stanford University Saint Francis High School ABSTRACT Currently, vast amounts of molecular information involving genomic

More information

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY.

SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. SAMPLE REPORT SNP Array NOTE: THIS IS A SAMPLE REPORT AND MAY NOT REFLECT ACTUAL PATIENT DATA. FORMAT AND/OR CONTENT MAY BE UPDATED PERIODICALLY. RESULTS SNP Array Copy Number Variations Result: GAIN,

More information

Shape-based retrieval of CNV regions in read coverage data. Sangkyun Hong and Jeehee Yoon*

Shape-based retrieval of CNV regions in read coverage data. Sangkyun Hong and Jeehee Yoon* 254 Int. J. Data Mining and Bioinformatics, Vol. 9, No. 3, 2014 Shape-based retrieval of CNV regions in read coverage data Sangkyun Hong and Jeehee Yoon* Department of Computer Engineering, Hallym University

More information

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data Tong WW, McComb ME, Perlman DH, Huang H, O Connor PB, Costello

More information

STATISTICAL METHODS FOR THE DETECTION AND ANALYSES OF STRUCTURAL VARIANTS IN THE HUMAN GENOME. Shu Mei, Teo

STATISTICAL METHODS FOR THE DETECTION AND ANALYSES OF STRUCTURAL VARIANTS IN THE HUMAN GENOME. Shu Mei, Teo Department of Medical Epidemiology and Biostatistics Karolinska Institutet, Stockholm, Sweden & Saw Swee Hock School of Public Health National University of Singapore, Singapore STATISTICAL METHODS FOR

More information

and SNPs: Understanding Human Structural Variation in Disease. My

and SNPs: Understanding Human Structural Variation in Disease. My CNVs vs. SNPs: Understanding Human Structural Variation in Disease [0:00:00] Hello and welcome to today s Science/AAAS live webinar entitled, CNVs and SNPs: Understanding Human Structural Variation in

More information

Implementation of the DDD/ClinGen OGT (CytoSure v3) Microarray

Implementation of the DDD/ClinGen OGT (CytoSure v3) Microarray Implementation of the DDD/ClinGen OGT (CytoSure v3) Microarray OGT UGM Birmingham 08/09/2016 Dom McMullan Birmingham Women's NHS Trust WM chromosomal microarray (CMA) testing Population of ~6 million (10%)

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA

UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA UNIVERSITI TEKNOLOGI MARA COPY NUMBER VARIATIONS OF ORANG ASLI (NEGRITO) FROM PENINSULAR MALAYSIA SITI SHUHADA MOKHTAR Thesis submitted in fulfillment of the requirements for the degree of Master of Science

More information