De Novo Viral Quasispecies Assembly using Overlap Graphs

Similar documents
arxiv: v2 [q-bio.pe] 21 Jan 2008

Reliable reconstruction of HIV-1 whole genome haplotypes reveals clonal interference and genetic hitchhiking among immune escape variants

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

DETECTION OF LOW FREQUENCY CXCR4-USING HIV-1 WITH ULTRA-DEEP PYROSEQUENCING. John Archer. Faculty of Life Sciences University of Manchester

Towards Personalized Medicine: An Improved De Novo Assembly Procedure for Early Detection of Drug Resistant HIV Minor Quasispecies in Patient Samples

VirusDetect pipeline - virus detection with small RNA sequencing

The Open Bioinformatics Journal, 2014, 8, 1-5 1

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Review: Genome assembly Reads

To test the possible source of the HBV infection outside the study family, we searched the Genbank

Multiple sequence alignment

Viral genome sequencing: applications to clinical management and public health. Professor Judy Breuer

Table S1: Kinetic parameters of drug and substrate binding to wild type and HIV-1 protease variants. Data adapted from Ref. 6 in main text.

Future applications of full length virus genome sequencing

Alper Sarikaya 1, Michael Correll 2, Jorge M. Dinis 1, David H. O Connor 1,3, and Michael Gleicher 1

Chapter 1. Introduction

Nature Biotechnology: doi: /nbt.1904

Reducing INDEL calling errors in whole genome and exome sequencing data.

CONSTRUCTION OF PHYLOGENETIC TREE USING NEIGHBOR JOINING ALGORITHMS TO IDENTIFY THE HOST AND THE SPREADING OF SARS EPIDEMIC

Inferring genetic diversity from nextgeneration

VIP: an integrated pipeline for metagenomics of virus

Supplementary Figure 1. ALVAC-protein vaccines and macaque immunization. (A) Maximum likelihood

OncoPhase: Quantification of somatic mutation cellular prevalence using phase information

TOWARDS ACCURATE GERMLINE AND SOMATIC INDEL DISCOVERY WITH MICRO-ASSEMBLY. Giuseppe Narzisi, PhD Bioinformatics Scientist

Relative activity (%) SC35M

Applications of Causal Discovery Methods in Biomedicine

SVIM: Structural variant identification with long reads DAVID HELLER MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS, BERLIN JUNE 2O18, SMRT LEIDEN

Characterizing intra-host influenza virus populations to predict emergence

Clinical utility of NGS for the detection of HIV and HCV resistance

Study the Evolution of the Avian Influenza Virus

COMPUTATIONAL OPTIMISATION OF TARGETED DNA SEQUENCING FOR CANCER DETECTION

Inter-country mixing in HIV transmission clusters: A pan-european phylodynamic study

Hands-On Ten The BRCA1 Gene and Protein

Unsupervised Identification of Isotope-Labeled Peptides

GENOME-WIDE DETECTION OF ALTERNATIVE SPLICING IN EXPRESSED SEQUENCES USING PARTIAL ORDER MULTIPLE SEQUENCE ALIGNMENT GRAPHS

Statistical analysis of RIM data (retroviral insertional mutagenesis) Bioinformatics and Statistics The Netherlands Cancer Institute Amsterdam

SCALPEL MICRO-ASSEMBLY APPROACH TO DETECT INDELS WITHIN EXOME-CAPTURE DATA. Giuseppe Narzisi, PhD Schatz Lab

Supplemental Materials and Methods Plasmids and viruses Quantitative Reverse Transcription PCR Generation of molecular standard for quantitative PCR

Analysis with SureCall 2.1

Sequence Analysis of Human Immunodeficiency Virus Type 1

Mapping evolutionary pathways of HIV-1 drug resistance using conditional selection pressure. Christopher Lee, UCLA

Mapping Evolutionary Pathways of HIV-1 Drug Resistance. Christopher Lee, UCLA Dept. of Chemistry & Biochemistry

Supplementary information for: Human micrornas co-silence in well-separated groups and have different essentialities

Shape-based retrieval of CNV regions in read coverage data. Sangkyun Hong and Jeehee Yoon*

Supplementary Methods

Illuminating the genetics of complex human diseases

Introduction to Bioinformatics

Supplementary Appendix

Supplementary Figure 1. Schematic diagram of o2n-seq. Double-stranded DNA was sheared, end-repaired, and underwent A-tailing by standard protocols.

Victor Guryev. European Research Institute for the Biology of Ageing

How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies?"

Antiretroviral Prophylaxis and HIV Drug Resistance. John Mellors University of Pittsburgh

Global variation in copy number in the human genome

Patterns of hemagglutinin evolution and the epidemiology of influenza

A Bioinformatics Method for Identifying RNA Structures within Human Cells

SUPPLEMENTARY INFORMATION In format provided by Javier DeFelipe et al. (MARCH 2013)

Colorspace & Matching

INTERfering and Co-Evolving Prevention and Therapy (INTERCEPT) Proposers Day

Distinguishing epidemiological dependent from treatment (resistance) dependent HIV mutations: Problem Statement

Project PRACE 1IP, WP7.4

PSSV User Manual (V2.1)

Bayesian (Belief) Network Models,

Modeling the Antigenic Evolution of Influenza Viruses from Sequences

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library

Introduction to LOH and Allele Specific Copy Number User Forum

Evaluation of MIA FORA NGS HLA test and software. Lisa Creary, PhD Department of Pathology Stanford Blood Center Research & Development Group

MEDICAL GENOMICS LABORATORY. Next-Gen Sequencing and Deletion/Duplication Analysis of NF1 Only (NF1-NG)

2/10/2016. Evaluation of MIA FORA NGS HLA test and software. Disclosure. NGS-HLA typing requirements for the Stanford Blood Center

Citation for published version (APA): Von Eije, K. J. (2009). RNAi based gene therapy for HIV-1, from bench to bedside

HIV-1 acute infection: evidence for selection?

Learning Convolutional Neural Networks for Graphs

DNA Analysis Techniques for Molecular Genealogy. Luke Hutchison Project Supervisor: Scott R. Woodward

Host Genomics of HIV-1

VL Network Analysis ( ) SS2016 Week 3

Influenza hemagglutinin protein stability correlates with fitness and seasonal evolutionary dynamics

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Parametric Inference using Persistence Diagrams: A Case Study in Population Genetics

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

Nature Methods: doi: /nmeth.3115

Below, we included the point-to-point response to the comments of both reviewers.

Sabin vaccine reversion in the field: a comprehensive analysis of Sabin-like poliovirus

IMPLEMENTATION OF GHOST FOR HAV OUTBREAK DETECTION

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

DeconRNASeq: A Statistical Framework for Deconvolution of Heterogeneous Tissue Samples Based on mrna-seq data

New Enhancements: GWAS Workflows with SVS

Using Network Flow to Bridge the Gap between Genotype and Phenotype. Teresa Przytycka NIH / NLM / NCBI

A Network Partition Algorithm for Mining Gene Functional Modules of Colon Cancer from DNA Microarray Data

Mapping by recurrence and modelling the mutation rate

Structural Variation and Medical Genomics

An Analysis of Genital Tract Derived HIV from Heterosexual Transmission Pairs. Debrah Boeras Emory University October 14, 2008

Rare Variant Burden Tests. Biostatistics 666

Name: Due on Wensday, December 7th Bioinformatics Take Home Exam #9 Pick one most correct answer, unless stated otherwise!

Exploring HIV Evolution: An Opportunity for Research Sam Donovan and Anton E. Weisstein

The Structure of Social Contact Graphs and their impact on Epidemics

NEXT GENERATION SEQUENCING OPENS NEW VIEWS ON VIRUS EVOLUTION AND EPIDEMIOLOGY. 16th International WAVLD symposium, 10th OIE Seminar

Analysis of the genetic diversity of influenza A viruses using next-generation DNA sequencing

Identification of Mutation(s) in. Associated with Neutralization Resistance. Miah Blomquist

Transcription:

De Novo Viral Quasispecies Assembly using Overlap Graphs Alexander Schönhuth joint with Jasmijn Baaijens, Amal Zine El Aabidine, Eric Rivals Milano 18th of November 2016

Viral Quasispecies Assembly: HaploClique Reference A. Töpfer, T. Marschall, R.A. Bull, F. Luciani, A. Schönhuth, N. Beerenwinkel Viral Quasispecies Assembly via Maximal Clique Enumeration PLoS Computational Biology, 10(3), e1003515, 2014 Joint last authorship

QUASISPECIES ASSEMBLY WORKFLOW

VIRAL VS. HUMAN HAPLOTYPING Human Virus # Haplotypes 2? (1 1000) Polymorphic locus biallelic up to all 4 nucleotides Low-frequency variants no yes Genome size 3 Giga 1 kilo 1 Mega Diversity 0.1 %? (1 10 %) Coverage 30-50x 20 000x

VIRAL QUASISPECIES ASSEMBLY: CHALLENGES Strains show at different frequencies Distinguish between closely related strains Low-frequency variants sequencing errors Exploit deep coverage

OVERLAP GRAPH Nodes: Probability distribution over sequences Edges: Connect two nodes P, Q, iff P(S) Q(S) > δ S {A,C,G,T}

OVERLAP GRAPH Edges: nodes (highly likely) reflect same haplotype Max-Clique: A (locally) maximal group of reads from the same haplotype

HAPLOCLIQUE WORKFLOW (PART I) 1. Construct overlap graph (edges as just shown) 2. Enumerate maximal cliques 3. Transform cliques into contigs (implicit error correction!)

HAPLOCLIQUE WORKFLOW (PART II) 60.0% 10.2% 7.3% 5.0% 3.1% 2.7% 1.0% 0.8% 0.2% Iteration 1. Graph construction to contig formation as just shown 2. Contigs are new nodes in next iteration Result: Contigs grow longer along iterations

FULL-LENGTH HAPLOTYPE RECONSTRUCTION Setting: HXB2 JRCSF NL4-3 YU2 89.6 20% 20% 20% 20% 20% gene-wise Hamming distance: 1-17% 600x data set Method Error rate Number of Error-free per bp haplotypes HaploClique 0.012 % 49 99.7 % ShoRAH 1.99 % 196 0 % PredictHaplo 1.65 % 95 0 % QuRe 2.95 % 15 0 %

Metagenome Gene Assembly: Snowball Reference I. Gregor, A. Schönhuth, A.C. McHardy Snowball: Strain Aware Gene Assembly of Metagenomes Bioinformatics, 32(17), i649-i657, 2016 Joint last authorship

METAGENOME GENE ASSEMBLY 3 genes (red, green, blue) for 3 strains Gene assembly reconstructs gene sequence from metagenome short read data

CONCLUSION Mutatis mutandis (different clustering algorithm): analogous framework yields first strain-aware (and not only species-aware) gene assembler

ASSEMBLING POLYPLOID GENOMES MANTRA 1. Construct an overlap graph: edges identical haplotypes 2. Identify suitable collections of cliques 3. Transform cliques into contigs 4. Back to 1.: contigs become nodes Outcome Contigs grow along iterations Contigs reflect individual haplotypes

VIRUSES EVOLVE RAPIDLY

VIRUSES EVOLVE RAPIDLY

MOTIVATION SUMMARY Virus infection through a cloud of mutant strains Viral quasispecies assembly virus pan-genome Mixed sample; need to construct individual haplotypes first Existing methods depend on high-quality reference What to do when there is a virus outbreak and there is no reference genome or the virus has diverged too far from the reference?

De Novo Viral Quasispecies Assembly: SAVAGE Reference J. Baaijens, A.Z. El Aabidine, E. Rivals, A. Schönhuth De novo viral quasispecies assembly using overlap graphs biorxiv:080341 Joint last authors

OVERLAP GRAPH BASED ASSEMBLY 1. Sequencing reads true mutation sequencing error 2. Overlap graph - full length reads - strong edge constraints 3. Contigs Repeat steps 2 and 3 until convergence!

Challenge 1: De novo overlap graph construction on deep sequencing data without reference alignments? ( 10 6 reads 10 12 potential overlaps...)

OVERLAP GRAPH CONSTRUCTION SAVAGE-de-novo Use FM-index + suffix filters [N. Välimäki, S. Ladra, V. Mäkinen. Approximate all-pairs suffix-prefix overlaps.] SAVAGE-ref Construct ad-hoc reference (VICUNA) [Yang et al. De novo assembly of highly diverse viral populations.] 1. compute read-to-reference alignments 2. get induced read-to-read alignments For all overlap candidates, also check overlap quality!

Challenge 2: Now we have the overlap graph, but it s still huge...

FROM OVERLAP GRAPH TO SUPER-READS Step 1: Read orientations Step 2: Transitive edges (-,+) (+,+) (+,-) u u' v v' w + + - double transitive single transitive non-transitive Step 3: Read clustering Step 4: Super-read construction & integrated error correction Cliques Cluster Consensus Error corrected sequencing error now iterate, back to step 1

Overlap graph-based assembly Overlap graph construction ALGORITHM OVERVIEW Algorithm stages: Reads Contigs Local assembly & error correction Global assembly Maximized contigs Illumina reads 1. Pairwise overlaps Undirected overlap graph Directed overlap graph Simplified overlap graph 2. Overlap quality check 3. Read orientations 4. Transitive edge removal Master strain assembly Master strains Super-reads & overlaps 5. Read clustering 7. Output contigs 6. Graph updating until convergence; Contigs

BENCHMARKING RESULTS Simulated Data: HIV 5-VIRUS-MIX (DI GIALLONARDO ET AL) PredictHaplo # contigs largest MAC ref. mis- 500bp contig length cov. N-rate matches indels HIV-1 ref 4 9710 0 79.5 0.659 1.374 0.091 VICUNA ref 3 9800 100 59.1 0.289 1.628 0.175 ShoRAH HIV-1 ref 39 9526 0 98.0 0.318 0.394 0.087 VICUNA ref 63 9657 1.6 97.2 0.307 1.356 0.151 MLEHaplo 185 9104 0 56.0 0 3.396 0.078 SAVAGE de novo 6 9663 0 97.1 0 0.009 0 HIV-1 ref 7 9633 0 97.0 0.004 0.005 0.041 VICUNA ref 9 9638 0 97.4 0 0.004 0 MAC = misassembled contigs; columns 3-7 = %

BENCHMARKING RESULTS Real Data: HIV 5-VIRUS-MIX (DI GALLIONARDO ET AL.) PredictHaplo # contigs largest MAC ref. mis- 500bp contig length cov. N-rate matches indels HIV-1 ref 5 9642 0 99.2 0.259 0.615 0.104 VICUNA ref 5 11000 100 94.5 0.425 0.011 0.136 ShoRAH HIV-1 ref 160 9581 0 98.9 0.378 3.203 0.113 VICUNA ref 169 10854 89.3 99.0 0.770 0.911 0.165 SAVAGE de novo 1937 3320 0 90.7 0.017 0.039 0.043 HIV-1 ref 1525 2493 0 90.1 0.013 0.145 0.041 VICUNA ref 1508 2849 0 89.7 0.014 0.136 0.037 MAC = misassembled contigs; columns 3-7 = %

EVALUATING ALGORITHM STAGES Reads Largest contig Mismatch rate 492 bp 1.276% Local assembly & error correction Contigs 2078 bp 0.045% Global assembly Maximized contigs 3320 bp 0.039% Master strain assembly Master strains 3320 bp 0.085% Data set: 5-virus-mix (Di Giallonardo et al) - https://github.com/cbg-ethz/5-virus-mix

CONCLUSIONS AND FUTURE WORK Challenges: unknown ploidy, low-frequency mutations and sequencing errors. Overlap graphs: full-length read information, detection of co-occurring mutations SAVAGE needs no prior information or reference genome Future work More efficient de novo overlap computations Extend the method to long reads (model indels!) Constructing virus pan-genomes

STRAIN DIVERGENCE: LIMITS SAVAGE-de-novo 10% Two-strain mixture ratio Two-strain mixture ratio 1:1 1:2 1:5 1:10 1:50 1:100 1:1 1:2 1:5 1:10 1:50 1:100 5% Divergence 2.5% 1% 0.75% 0.5% Total genome fraction (%) Overall mismatch rate (%)

Thanks for your attention! Reference J. Baaijens, A.Z. El Aabidine, E. Rivals, A. Schönhuth De novo viral quasispecies assembly using overlap graphs biorxiv:080341 Joint last authors