SVIM: Structural variant identification with long reads DAVID HELLER MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS, BERLIN JUNE 2O18, SMRT LEIDEN

Similar documents
Comprehensive Genome and Transcriptome Structural Analysis of a Breast Cancer Cell Line using PacBio Long Read Sequencing

CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

Illuminating the genetics of complex human diseases

CITATION FILE CONTENT/FORMAT

Nature Biotechnology: doi: /nbt.1904

PSSV User Manual (V1.0)

DNA-seq Bioinformatics Analysis: Copy Number Variation

Structural Variation and Medical Genomics

LEIDEN, THE NETHERLANDS

Supplementary Information. Supplementary Figures

PSSV User Manual (V2.1)

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Genome. Institute. GenomeVIP: A Genomics Analysis Pipeline for Cloud Computing with Germline and Somatic Calling on Amazon s Cloud. R. Jay Mashl.

Calling DNA variants SNVs, CNVs, and SVs. Steve Laurie Variant Effect Predictor Training Course Prague, 6 th November 2017

BWA alignment to reference transcriptome and genome. Convert transcriptome mappings back to genome space

MEDICAL GENOMICS LABORATORY. Next-Gen Sequencing and Deletion/Duplication Analysis of NF1 Only (NF1-NG)

Shape-based retrieval of CNV regions in read coverage data. Sangkyun Hong and Jeehee Yoon*

Genomic structural variation

Introduction to LOH and Allele Specific Copy Number User Forum

COMPUTATIONAL OPTIMISATION OF TARGETED DNA SEQUENCING FOR CANCER DETECTION

Analysis with SureCall 2.1

Variations in Chromosome Structure & Function. Ch. 8

Global variation in copy number in the human genome

DETECTION OF LOW FREQUENCY CXCR4-USING HIV-1 WITH ULTRA-DEEP PYROSEQUENCING. John Archer. Faculty of Life Sciences University of Manchester

Names: Period: Punnett Square for Sex Chromosomes:

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping

Next Generation Sequencing as a tool for breakpoint analysis in rearrangements of the globin-gene clusters

De Novo Viral Quasispecies Assembly using Overlap Graphs

Towards Personalized Medicine: An Improved De Novo Assembly Procedure for Early Detection of Drug Resistant HIV Minor Quasispecies in Patient Samples

UNIT 3 GENETICS LESSON #30: TRAITS, GENES, & ALLELES. Many things come in many forms. Give me an example of something that comes in many forms.

SUPPLEMENTARY INFORMATION

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Golden Helix s End-to-End Solution for Clinical Labs

DETECTING HIGHLY DIFFERENTIATED COPY-NUMBER VARIANTS FROM POOLED POPULATION SEQUENCING

Unit 5 Review Name: Period:

Review: Genome assembly Reads

Developmental Psychology 2017

Colorspace & Matching

Research Strategy: 1. Background and Significance

Multiple Copy Number Variations in a Patient with Developmental Delay ASCLS- March 31, 2016

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Andrew Parrish, Richard Caswell, Garan Jones, Christopher M. Watson, Laura A. Crinnion 3,4, Sian Ellard 1,2

Supplementary Figure 1. Estimation of tumour content

Spectrum of mutations in monogenic diabetes genes identified from high-throughput DNA sequencing of 6888 individuals

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

CNV detection. Introduction and detection in NGS data. G. Demidov 1,2. NGSchool2016. Centre for Genomic Regulation. CNV detection. G.

Assessing Laboratory Performance for Next Generation Sequencing Based Detection of Germline Variants through Proficiency Testing

Supplemental Tables and Figures

Linked read sequencing resolves complex genomic rearrangements in gastric cancer metastases

Chapter 15: The Chromosomal Basis of Inheritance

Chromosome Structure & Recombination

6/12/2018. Disclosures. Clinical Genomics The CLIA Lab Perspective. Outline. COH HopeSeq Heme Panels

CHR POS REF OBS ALLELE BUILD CLINICAL_SIGNIFICANCE

TOWARDS ACCURATE GERMLINE AND SOMATIC INDEL DISCOVERY WITH MICRO-ASSEMBLY. Giuseppe Narzisi, PhD Bioinformatics Scientist

Genetics 275 Examination February 10, 2003.

BECAUSE all genetic variation on which natural selection

GENOME-WIDE ASSOCIATION STUDIES

InGen: Dino Genetics Lab Lab Related Activity: DNA and Genetics

Chromosomal Mutations

Reducing INDEL calling errors in whole genome and exome sequencing data.

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library

Home Brewed Personalized Genomics

SUPPLEMENTARY INFORMATION

Performance Characteristics BRCA MASTR Plus Dx

Below, we included the point-to-point response to the comments of both reviewers.

Modeling Chromosome Inheritance

Human Genetic Disorders

Issues arising from UKNEQAS schemes. Ottie O Brien, Northern Genetics Service, Newcastle, UK 15 th May 2014

Chapter 12 Multiple Choice

MEDICAL GENOMICS LABORATORY. Peripheral Nerve Sheath Tumor Panel by Next-Gen Sequencing (PNT-NG)

CNV Detection and Interpretation in Genomic Data

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

Merging single gene-level CNV with sequence variant interpretation following the ACMGG/AMP sequence variant guidelines

Supplementary Figures

Supplementary Figure 1. Schematic diagram of o2n-seq. Double-stranded DNA was sheared, end-repaired, and underwent A-tailing by standard protocols.

Mendelian Genetics. KEY CONCEPT Mendel s research showed that traits are inherited as discrete units.

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

1) DNA unzips - hydrogen bonds between base pairs are broken by special enzymes.

CHROMOSOMAL THEORY OF INHERITANCE

Chapter 15: The Chromosomal Basis of Inheritance

Ambient temperature regulated flowering time

Lab Activity Report: Mendelian Genetics - Genetic Disorders

Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep

Comprehensive Chromosome Screening Is NextGen Likely to be the Final Best Platform and What are its Advantages and Quirks?

Nature Methods: doi: /nmeth.3115

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser

The Chromosomal Basis of Inheritance

LTA Analysis of HapMap Genotype Data

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Understanding DNA Copy Number Data

Molecular Characterization of Tumors Using Next-Generation Sequencing

Breast and ovarian cancer in Serbia: the importance of mutation detection in hereditary predisposition genes using NGS

GENOME-WIDE DETECTION OF ALTERNATIVE SPLICING IN EXPRESSED SEQUENCES USING PARTIAL ORDER MULTIPLE SEQUENCE ALIGNMENT GRAPHS

Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Detection of copy number variations in PCR-enriched targeted sequencing data

No mutations were identified.

Transcription:

SVIM: Structural variant identification with long reads DAVID HELLER MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS, BERLIN JUNE 2O18, SMRT LEIDEN

Structural variation (SV) Variants larger than 50bps Affect more base pairs than SNVs and Indels Large influence on phenotype and disease 2

Long read alignments (.bam) SVIM Structural Variant caller for PacBio reads Detects six different classes of SVs with high precision and sensitivity Deletions (1) Collect SV evidences Evidences in read alignments (DEL, INS) SV evidences Evidences between split alignments (DEL, INS, INV, BRK, DUP) (2) Cluster SV evidences by their genomic location and span SV evidence clusters (3) Confirm / Genotype evidence clusters: Count local reads supporting ref/alt allele Analyze read alignments SV evidence clusters + support Dotplot analysis (4) Merge and classify SV evidence clusters Interspersed Duplications Cut&Paste Insertions Novel Insertions Inversions Tandem duplications COLLECT CONFIRM COMBINE 3

1. Collect SV evidences (1) Collect SV evidences Evidences in read alignments (DEL, INS) Long read alignments (.bam) SV evidences Evidences between split alignments (DEL, INS, INV, BRK, DUP) (2) Cluster SV evidences by their genomic location and span COLLECT Collect evidences for SVs from each individual read Search.. in alignments for long gaps in reference or read (CIGAR string) between alignments for discordant positions and orientations that indicate deleted, inserted, inverted, chimeric or duplicated pieces Collected SV evidences are mere hints to SVs Multiple hints need to be combined to pinpoint exact type and location of the event è Clustering of SV evidences 4

Clustering of SVs Merges evidences from multiple reads Helps to distinguish correct evidences from errors (e.g. sequencing error, alignment error) Span-position distance consists of position distance (difference in location) and span distance (diff. in length) Genome Deletion evidences 5

2. Confirm SVs (3) Confirm / Genotype evidence clusters: Count local reads supporting ref/alt allele Analyze read alignments SV evidence clusters Dotplot analysis CONFIRM Count number of reads that support / contradict each SV for: Confirmation (many contradicting reads indicate false positive) Genotyping (~50% supporting reads indicates heterozygous event, ~100% indicate homozygous) Two orthogonal approaches Read alignment analysis (alignment-based) Dotplot analysis (k-mer based) 6

Combine SVs Deletions SV evidence clusters + support (4) Merge and classify SV evidence clusters Interspersed Duplications Cut&Paste Insertions Novel Insertions Inversions Tandem duplications COMBINE Combine multiple SVs to classify higher-order events Deletion of sequence S + Insertion of sequence S somewhere else è Cut&Paste Insertion Deletion of sequence S è Deletion Insertion of sequence S è Duplication Insertion of novel sequence N è Novel Insertion 7

Results 8

Results on 6x simulated dataset 1.00 Deletions Insertions 1.00 Deletions Insertions 0.75 0.75 0.50 0.50 Homozygous Precision 0.25 0.00 1.00 0.75 Inversions Tandem Duplications Tool PBHoney Spots PBHoney Tails SVIM Sniffles Precision 0.25 0.00 1.00 0.75 Inversions Tandem Duplications Heterozygous 0.50 0.50 0.25 0.25 0.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Recall 0.00 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Recall 9

Results on public 53x real dataset Genome in a Bottle dataset for NA12878 individual 53x coverage PacBio data (SRR3197748) 2676 / 68 high-confidence deletion / insertion calls Parikh, Hemang, et al. "svclassify: a method to establish benchmark structural variant calls." BMC genomics 17.1 (2016): 64. Precision 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 Deletions Insertions 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Recall 53x coverage 6x coverage Tool PBHoney Spots PBHoney Tails SVIM Sniffles 10

Results on public 53x real dataset Implant SVs into the reference genome Align reads to this altered reference genome This simulates inverse of implanted SV: 100 deletions are simulated by inserting sequence into the reference genome. 100 inversions are simulated by inverting regions in the reference. 100 insertions are simulated by moving regions in the reference. Precision 1.00 0.75 0.50 0.25 0.00 1.00 0.75 0.50 0.25 0.00 Deletions Inversions Cut&Paste Insertions 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Recall 53x coverage 6x coverage Tool PBHoney Spots PBHoney Tails SVIM Sniffles 11

Conclusion SVIM is a tool for SV detection from PacBio reads It detects and distinguishes six different SV classes Determines genomic origin and destination of insertions and duplications Improved recall and precision compared to competing methods Large improvement on low-coverage datasets github.com/eldariont/svim 12

Acknowledgements Martin Vingron NGMLR : Fritz Sedlazeck and Philipp Rescheneder Thanks for your attention! Questions?!" heller_d@molgen.mpg.de github.com/eldariont/svim 13

Runtime comparison Tool Threads CPU time (min) Wall clock time (min) PBHoney-Spots 1 951 958 PBHoney-Spots 10 943 124 PBHoney-Tails 1 76 77 Sniffles 1 238 240 Sniffles 10 1712 233 SVIM 1 231 235 14

Simulation protocol Simulated genome (chr21 and chr22 only) with 100 deletions, 100 insertions, 100 tandem duplications, 100 interspersed duplications, 100 inversions between 100bps and 10kbps in size (RSVSim) Simulated long reads to 6-fold coverage (SimLoRD) Run different SV callers on simulated reads Compare detected SVs with simulated (correct) SVs (require 90% reciprocal overlap) Bartenhagen, C., & Dugas, M. (2013). RSVSim: an R/Bioconductor package for the simulation of structural variations. Bioinformatics, 29(13), 1679-1681. Stöcker, B. K., Köster, J., & Rahmann, S. (2016). Simlord: Simulation of long read data. Bioinformatics, 32(17), 2704-2706. 15

Dotplot analysis Read Deletion Reference 16