Detection of copy number variations in PCR-enriched targeted sequencing data German Demidov Parseq Lab, Saint-Petersburg University of Russian Academy of Sciences, current: Center for Genomic Regulation german.demidov@crg.eu July 29, 2016 German Demidov (CRG) CNVs detection July 29, 2016 1 / 22
Overview 1 Problem Formulation Neonatal Screening Data 2 Methods Counting of coverages Quality Control Unsupervised Algorithm Supervised Algorithm 3 Results Validation Results 4 Open Questions German Demidov (CRG) CNVs detection July 29, 2016 2 / 22
Neonatal Screening Cystic fibrosis, Phenylketonuria, Galactosemia and others. We are interested only in Mendelian disorders. They are rare and treatable (if the sample was diagnosed early). Otherwise irreversible damage. (image from progenity.com) German Demidov (CRG) CNVs detection July 29, 2016 3 / 22
Neonatal Screening for CF, PKU, GALT Pipeline (CF) Immunoreactive Trypsin Test Immunoreactive Trypsin Test 2 Sweat Probe Panel for approx. 10 common mutations. Time, Sensitivity? Alternative Immunoreactive Trypsin Test Immunoreactive Trypsin Test 2 NGS for more than 300 mutations for 3 disorders Sweat Probe. German Demidov (CRG) CNVs detection July 29, 2016 4 / 22
Problem CNVs can be as short as one exon + small intonic regions. From 1% up to 5% of samples have CNVs [for particular disorders]. Alternatives? FISH, MLPA, qpcr, SNVs? The goal Detect germline CNVs in multiplex PCR enriched amplicon sequencing data using only coverages. German Demidov (CRG) CNVs detection July 29, 2016 5 / 22
Multiplex PCR Image source http://rosalind.info German Demidov (CRG) CNVs detection July 29, 2016 6 / 22
Multiplex PCR Image source: unpublished, Bushmanova et al., bioinformaticsinstitute.ru German Demidov (CRG) CNVs detection July 29, 2016 7 / 22
Description of Data Multiplex PCR (divided into 2 pools of primers) + IonTorrent Sequencing. 128 amplicons per 3 genes and several intronic regions. One run 48 samples. Average coverage from 10 reads to 1200 per amplicon (samples from dried blood spot). German Demidov (CRG) CNVs detection July 29, 2016 8 / 22
Overview German Demidov (CRG) CNVs detection July 29, 2016 9 / 22
Counting of coverages 2 pools We know that primers were divided into 2 pools that generate non-overlapping amplicons (inside each pool), we can count coverages more efficiently. Mapping German Demidov (CRG) CNVs detection July 29, 2016 10 / 22
Counting of coverages Chimeric Sequences We have found that 1 Sufficient (from 1 to 5 percents) part of reads have strange soft clipped parts. 2 We used blast and found that these parts actually come mostly from targeted regions. We realign them. Mapping German Demidov (CRG) CNVs detection July 29, 2016 11 / 22
Counting of coverages German Demidov (CRG) CNVs detection July 29, 2016 12 / 22
Quality Control Samples arrive to the lab from other labs or hospitals. Some samples DNAs were poorly extracted. Some samples have CNVs and we should not mix these categories. We developed an algorithm that filters poorly extracted samples out before the analysis. We have 3 genes and we can assume that only one of them has CNV inside One of 3 genes may fail QC control. German Demidov (CRG) CNVs detection July 29, 2016 13 / 22
General Idea Typical approaches There are several sources of variation in coverages. We can normalise on GC-content, length, etc. Our alternative Amplicon-based sequencing has a lot of sources of variation that is not possible to infer. Some amplicons in a panel should show similar efficiency. We can use clusters of correlated amplicons for normalisation. German Demidov (CRG) CNVs detection July 29, 2016 14 / 22
Unsupervised Algorithm Figure: Regression and prediction intervals (Right figure source: novayagazeta.ru) German Demidov (CRG) CNVs detection July 29, 2016 15 / 22
Image source: dzone.com German Demidov (CRG) CNVs detection July 29, 2016 16 / 22 Supervised Algorithm We can use output of Unsupervised Algorithm or pre-defined Control Dataset. We tried to detect if single amplicons shows a CNV presence, now we want to detect CNV sites. Idea: to use Mahalanobis distance and classify each exon.
Supervised Algorithm Data whitening Normalise coverages within the clusters of correlated amplicons and calculate Mahalanobis distance that should be (in theory) χ 2 -distributed. Three models Having H 0 = M Normal, we can construct H a = (M HetDel M HetDup ). Three questions Can each data point from the region be generated by H a? If so, is it highly probable that it was generated by H 0? If so, is H a the most probable explanation for the region? German Demidov (CRG) CNVs detection July 29, 2016 17 / 22
Supervised Algorithm German Demidov (CRG) CNVs detection July 29, 2016 18 / 22
Validation More than 500 samples, more than 1000 sequencing results. 16 de novo discovered variants. One of them was novel (PAHdele4). 810 samples were negative for each algorithm. Unsupervised Supervised Sens 90.36% 90.36% Spec 94.97 % 94.62% Figure: Unsupervised algorithm Figure: Supervised algorithm German Demidov (CRG) CNVs detection July 29, 2016 19 / 22