SVIM: Structural variant identification with long reads DAVID HELLER MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS, BERLIN JUNE 2O18, SMRT LEIDEN

Structural variation (SV) Variants larger than 50bps Affect more base pairs than SNVs and Indels Large influence on phenotype and disease 2

Long read alignments (.bam) SVIM Structural Variant caller for PacBio reads Detects six different classes of SVs with high precision and sensitivity Deletions (1) Collect SV evidences Evidences in read alignments (DEL, INS) SV evidences Evidences between split alignments (DEL, INS, INV, BRK, DUP) (2) Cluster SV evidences by their genomic location and span SV evidence clusters (3) Confirm / Genotype evidence clusters: Count local reads supporting ref/alt allele Analyze read alignments SV evidence clusters + support Dotplot analysis (4) Merge and classify SV evidence clusters Interspersed Duplications Cut&Paste Insertions Novel Insertions Inversions Tandem duplications COLLECT CONFIRM COMBINE 3

1. Collect SV evidences (1) Collect SV evidences Evidences in read alignments (DEL, INS) Long read alignments (.bam) SV evidences Evidences between split alignments (DEL, INS, INV, BRK, DUP) (2) Cluster SV evidences by their genomic location and span COLLECT Collect evidences for SVs from each individual read Search.. in alignments for long gaps in reference or read (CIGAR string) between alignments for discordant positions and orientations that indicate deleted, inserted, inverted, chimeric or duplicated pieces Collected SV evidences are mere hints to SVs Multiple hints need to be combined to pinpoint exact type and location of the event è Clustering of SV evidences 4

Clustering of SVs Merges evidences from multiple reads Helps to distinguish correct evidences from errors (e.g. sequencing error, alignment error) Span-position distance consists of position distance (difference in location) and span distance (diff. in length) Genome Deletion evidences 5

2. Confirm SVs (3) Confirm / Genotype evidence clusters: Count local reads supporting ref/alt allele Analyze read alignments SV evidence clusters Dotplot analysis CONFIRM Count number of reads that support / contradict each SV for: Confirmation (many contradicting reads indicate false positive) Genotyping (~50% supporting reads indicates heterozygous event, ~100% indicate homozygous) Two orthogonal approaches Read alignment analysis (alignment-based) Dotplot analysis (k-mer based) 6

Combine SVs Deletions SV evidence clusters + support (4) Merge and classify SV evidence clusters Interspersed Duplications Cut&Paste Insertions Novel Insertions Inversions Tandem duplications COMBINE Combine multiple SVs to classify higher-order events Deletion of sequence S + Insertion of sequence S somewhere else è Cut&Paste Insertion Deletion of sequence S è Deletion Insertion of sequence S è Duplication Insertion of novel sequence N è Novel Insertion 7

Results 8

Results on 6x simulated dataset 1.00 Deletions Insertions 1.00 Deletions Insertions 0.75 0.75 0.50 0.50 Homozygous Precision 0.25 0.00 1.00 0.75 Inversions Tandem Duplications Tool PBHoney Spots PBHoney Tails SVIM Sniffles Precision 0.25 0.00 1.00 0.75 Inversions Tandem Duplications Heterozygous 0.50 0.50 0.25 0.25 0.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Recall 0.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Recall 9

Results on public 53x real dataset Genome in a Bottle dataset for NA12878 individual 53x coverage PacBio data (SRR3197748) 2676 / 68 high-confidence deletion / insertion calls Parikh, Hemang, et al. "svclassify: a method to establish benchmark structural variant calls." BMC genomics 17.1 (2016): 64. Precision 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 Deletions Insertions 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Recall 53x coverage 6x coverage Tool PBHoney Spots PBHoney Tails SVIM Sniffles 10

Results on public 53x real dataset Implant SVs into the reference genome Align reads to this altered reference genome This simulates inverse of implanted SV: 100 deletions are simulated by inserting sequence into the reference genome. 100 inversions are simulated by inverting regions in the reference. 100 insertions are simulated by moving regions in the reference. Precision 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 Deletions Inversions Cut&Paste Insertions 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Recall 53x coverage 6x coverage Tool PBHoney Spots PBHoney Tails SVIM Sniffles 11

Conclusion SVIM is a tool for SV detection from PacBio reads It detects and distinguishes six different SV classes Determines genomic origin and destination of insertions and duplications Improved recall and precision compared to competing methods Large improvement on low-coverage datasets github.com/eldariont/svim 12

Acknowledgements Martin Vingron NGMLR : Fritz Sedlazeck and Philipp Rescheneder Thanks for your attention! Questions?!" heller_d@molgen.mpg.de github.com/eldariont/svim 13

Runtime comparison Tool Threads CPU time (min) Wall clock time (min) PBHoney-Spots 1 951 958 PBHoney-Spots 10 943 124 PBHoney-Tails 1 76 77 Sniffles 1 238 240 Sniffles 10 1712 233 SVIM 1 231 235 14

Simulation protocol Simulated genome (chr21 and chr22 only) with 100 deletions, 100 insertions, 100 tandem duplications, 100 interspersed duplications, 100 inversions between 100bps and 10kbps in size (RSVSim) Simulated long reads to 6-fold coverage (SimLoRD) Run different SV callers on simulated reads Compare detected SVs with simulated (correct) SVs (require 90% reciprocal overlap) Bartenhagen, C., & Dugas, M. (2013). RSVSim: an R/Bioconductor package for the simulation of structural variations. Bioinformatics, 29(13), 1679-1681. Stöcker, B. K., Köster, J., & Rahmann, S. (2016). Simlord: Simulation of long read data. Bioinformatics, 32(17), 2704-2706. 15

Dotplot analysis Read Deletion Reference 16