Reducing INDEL calling errors in whole genome and exome sequencing data.

Similar documents
SCALPEL MICRO-ASSEMBLY APPROACH TO DETECT INDELS WITHIN EXOME-CAPTURE DATA. Giuseppe Narzisi, PhD Schatz Lab

TOWARDS ACCURATE GERMLINE AND SOMATIC INDEL DISCOVERY WITH MICRO-ASSEMBLY. Giuseppe Narzisi, PhD Bioinformatics Scientist

The next 10 years of quantitative biology

Illuminating the genetics of complex human diseases

Algorithms for studying the structure and function of genomes

Big Data Meets DNA How Biological Data Science is improving our health, foods, and energy needs

Algorithms for the analysis of complex genomes

Comprehensive Genome and Transcriptome Structural Analysis of a Breast Cancer Cell Line using PacBio Long Read Sequencing

Big Data Meets DNA How Biological Data Science is improving our health, foods, and energy needs

Algorithms for de novo genome assembly and disease analytics

Human Genetics and Plant Genomics: The long and the short of it

Algorithms for de novo genome assembly and disease analytics

SUPPLEMENTARY INFORMATION

Algorithms for de novo genome assembly and disease analytics

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

Genome Wide Variant Analysis of Simplex Autism Families with an Integrative Clinical-Bioinformatics Pipeline

SAPLING: A Tool for Gene Network Analysis focusing on Psychiatric Genetics

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library

Clinical Genomics of Neuropsychiatric Illnesses. Gholson Lyon, M.D. Ph.D.

Transcriptome and isoform reconstruc1on with short reads. Tangled up in reads

Lecture 20. Disease Genetics

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Sequencing studies implicate inherited mutations in autism

Nature Biotechnology: doi: /nbt.1904

Interactive analysis and quality assessment of single-cell copy-number variations

De Novo Viral Quasispecies Assembly using Overlap Graphs

De novo assembly of complex genomes

CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery

De Novo Gene Disruptions in Children on the Autistic Spectrum

Ginkgo Interactive analysis and quality assessment of single-cell CNV data

Performance Characteristics BRCA MASTR Plus Dx

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Rare Variant Burden Tests. Biostatistics 666

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser

Analysis with SureCall 2.1

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection

Evaluation of MIA FORA NGS HLA test and software. Lisa Creary, PhD Department of Pathology Stanford Blood Center Research & Development Group

2/10/2016. Evaluation of MIA FORA NGS HLA test and software. Disclosure. NGS-HLA typing requirements for the Stanford Blood Center

Frequency(%) KRAS G12 KRAS G13 KRAS A146 KRAS Q61 KRAS K117N PIK3CA H1047 PIK3CA E545 PIK3CA E542K PIK3CA Q546. EGFR exon19 NFS-indel EGFR L858R

Lecture 8 Understanding Transcription RNA-seq analysis. Foundations of Computational Systems Biology David K. Gifford

Calling DNA variants SNVs, CNVs, and SVs. Steve Laurie Variant Effect Predictor Training Course Prague, 6 th November 2017

Victor Guryev. European Research Institute for the Biology of Ageing

Colorspace & Matching

Data mining with Ensembl Biomart. Stéphanie Le Gras

SVIM: Structural variant identification with long reads DAVID HELLER MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS, BERLIN JUNE 2O18, SMRT LEIDEN

Single-strand DNA library preparation improves sequencing of formalin-fixed and paraffin-embedded (FFPE) cancer DNA

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Introduction to LOH and Allele Specific Copy Number User Forum

RNA-Seq guided gene therapy for vision loss. Michael H. Farkas

Golden Helix s End-to-End Solution for Clinical Labs

CITATION FILE CONTENT/FORMAT

Below, we included the point-to-point response to the comments of both reviewers.

TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Transform genomic data into real-life results

Genetics and Genomics in Medicine Chapter 8 Questions

Alper Sarikaya 1, Michael Correll 2, Jorge M. Dinis 1, David H. O Connor 1,3, and Michael Gleicher 1

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

The contribution of de novo coding mutations to autism spectrum disorder

Fluxion Biosciences and Swift Biosciences Somatic variant detection from liquid biopsy samples using targeted NGS

NGS in tissue and liquid biopsy

White Paper. Copy number variant detection. Sample to Insight. August 19, 2015

Detection of copy number variations in PCR-enriched targeted sequencing data

A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families

VARIANT PRIORIZATION AND ANALYSIS INCORPORATING PROBLEMATIC REGIONS OF THE GENOME ANIL PATWARDHAN

To test the possible source of the HBV infection outside the study family, we searched the Genbank

SUPPLEMENTARY INFORMATION

ACE ImmunoID Biomarker Discovery Solutions ACE ImmunoID Platform for Tumor Immunogenomics

How many disease-causing variants in a normal person? Matthew Hurles

Case Study. Overview. Deleterious MLH1 mutation detected on sequencing 10/16/2014

Future applications of full length virus genome sequencing

Summer Institute in Statistical Genetics University of Washington, Seattle Forensic Genetics Module

The feasibility of circulating tumour DNA as an alternative to biopsy for mutational characterization in Stage III melanoma patients

Strength of functional signature correlates with effect size in autism

Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep

Supplementary Information to

Lynch Syndrome and COLARIS Testing

NGS for Cancer Predisposition

Daehwan Kim September 2018

New Enhancements: GWAS Workflows with SVS

NGS ONCOPANELS: FDA S PERSPECTIVE

MEDICAL GENOMICS LABORATORY. Next-Gen Sequencing and Deletion/Duplication Analysis of NF1 Only (NF1-NG)

ACE ImmunoID. ACE ImmunoID. Precision immunogenomics. Precision Genomics for Immuno-Oncology

Transcript reconstruction

Review: Genome assembly Reads

Detecting Identity by Descent and Homozygosity Mapping in Whole-Exome Sequencing Data

Entering the era of mega-genomics

An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage

NGS IN ONCOLOGY: FDA S PERSPECTIVE

The Autism Sequencing Consortium: Large-Scale, High-Throughput Sequencing in Autism Spectrum Disorders

The Importance of Coverage Uniformity Over On-Target Rate for Efficient Targeted NGS

No mutations were identified.

Accel-Amplicon Panels

Genetic Approaches to Alcoholism, Alcohol Abuse Susceptibility, and Therapeutic Response. Ray White Verona March

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

Long non-coding RNAs

CNV Detection and Interpretation in Genomic Data

Breast and ovarian cancer in Serbia: the importance of mutation detection in hereditary predisposition genes using NGS

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies

Transcription:

Reducing INDEL calling errors in whole genome and exome sequencing data. Han Fang November 8, 2014 CSHL Biological Data Science Meeting

Acknowledgments Lyon Lab Yiyang Wu Jason O Rawe Laura J Barron Max Doerfel Constantine Hartofilis Schatz Lab Giuseppe Narzisi Hayan Lee James Gurtowski Tyler Garvin Maria Nattestad Srividya Ramakrishnan Gholson Lyon Michael Schatz Colleagues from Cold Spring Harbor Laboratory Sara Ballouz Wim Verleyen Jesse Gillis Shane McCarthy Eric Antoniou Elena Ghiban Melissa Kramer Stephanie Muller Senem M Eskipehlivan Michael Wigler Ivan Iossifov Michael Ronemus Julie Rosenbaum Rob Aboukhalil IT department

Significantly higher rates of de novo frame-shifts & LGDs in the affected vs. unaffected siblings The contribution of de novo coding mutations to autism spectrum disorder. Iossifov I, O Roak BJ, Sanders SJ, Ronemus M, et al. (2014) Nature. doi:10.1038/nature13908

Sources of INDEL calling errors?

Scalpel: Haplotype Microassembly Extract reads mapping within the exon including (1) well-mapped reads, (2) softclipped reads, and (3) anchored pairs Decompose reads into overlapping k-mers and construct de Bruijn graph from the reads. Find end-to-end haplotype paths spanning the region. Align assembled sequences to reference to detect mutations. Accurate de novo and transmitted indel detection in exome-capture data using microassembly. Narzisi G, O'Rawe JA, Iossifov I, Fang H, Lee YH, Wang Z, Wu Y, Lyon GJ, Wigler M, Schatz MC (2014) Nature Methods. doi: 10.1038/nmeth.3069

Scalpel INDEL Validation 1000 INDELs selected for validation 200 Scalpel-specific 200 GATK HapCaller-specific 200 SOAPindel-specific 200 within the intersection 200 long indels (>30bp)

Concordance between WGS and WES data. Reducing INDEL errors in whole genome and exome sequencing data. Fang H, Wu Y, Narzisi G, O'Rawe JA, Jimenez Barrón LT, Rosenbaum J, Ronemus M, Iossifov I, Schatz MC*, Lyon, GJ* (2014) Genome Medicine. doi: 10.1186/s13073-014-0089-z

Validation results The validation rate of WGS-WES intersection INDELs was in fact very high (95%). Accuracy of INDEL detection with WES is much lower than that with WGS. The WES-specific set had a much smaller fraction of large INDELs. WGS-WES intersection! INDELs! Valid! PPV! INDELs (>5bp)! Valid (>5bp)! PPV (>5bp)! 160! 152! 95.0%! 18! 18! 100%! WGS-specific! 145! 122! 84.1%! 33! 25! 75.8%! WES-specific! 161! 91! 56.5%! 1! 1! 100%!

Example of WES missing a large INDEL WGS WES

Coverage distributions (WGS-specific INDELs regions) C v =75.3% C v =281.5%

Introducing the k-mer Chi-Square scores in Scalpel a) b) In a), C!!"# = 52, C!!"# = 48, so χ! = (!"!!")!!" + (!"!!")! =0.16!!" In b), C!!"# = 90, C!!"# = 10, so χ! = (!"!!")!!" + (!"!!")!!" = 64! Figures are customized from http://cdn.vanillaforums.com/gatk.vanillaforums.com/fileupload/a4/5ac06fc8af4b1b0c474f03e45f9017.png

Benchmarking Effectively distinguish behaviours of problematic INDEL calls from likely true-positives. Can be easily applied to screen INDEL calls and understand their characteristics. 35 30 False Discovery Rate (%) 25 20 15 10 5 0 0 2 4 6 8 10 12 Chi-Square Threshold (<T) 5x 10x 20x 30x 40x 50x 8 < : High quality INDELs (low error-rate - 7%): Low quality INDELs (high error-rate - 51%): 2 apple 2.0 if Co Alt apple 5 2 apple 4.5 if Co Alt apple 10 2 apple 10.8 if Co Alt > 10 2 10.8 if C Alt o apple 10

WGS yielded more high-quality INDELs than WES. Poly-A/T is a major contributor to the low quality INDELs, which gives rise to much more errors in the WES-specific set.

Concordance between standard WGS & PCR-free data

PCR-free data yielded more high-quality INDELs. PCR amplification induced many error-prone poly-a/t INDELs to the library; reducing the rate of amplification could effectively increase calling quality.

60X WGS is needed to recover 95% of INDEL. Detection of het INDELs requires higher coverage.

60X WGS is needed to recover 95% of INDEL. Detection of het INDELs requires higher coverage.

Summary Discussed: 1) Introducing a highly accurate & open-source algorithm, Scalpel (http://scalpel.sourceforge.net/) 2) Higher accuracy of INDEL detection with WGS data than that with WES data. 3) WES data has more false-positives, and misses a lot of large INDELs. 4) STR regions: major sources of INDEL errors, especially near A/T homopolymers. 5) Identify the errors introduced by PCR amplifications and caution about them. Implications: 1) Recommend WGS data for INDEL analysis (60X PCR-free). 2) Classification scheme of INDEL calls based off of Chi-Square scores and alternative allele coverage.