Colorspace & Matching

Similar documents
Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Supplementary Methods

Exercises: Differential Methylation

ChIP-seq data analysis

P. Tang ( 鄧致剛 ); PJ Huang ( 黄栢榕 ) g( ); g ( ) Bioinformatics Center, Chang Gung University.

Introduction to LOH and Allele Specific Copy Number User Forum

SCALPEL MICRO-ASSEMBLY APPROACH TO DETECT INDELS WITHIN EXOME-CAPTURE DATA. Giuseppe Narzisi, PhD Schatz Lab

Incorporation of Imaging-Based Functional Assessment Procedures into the DICOM Standard Draft version 0.1 7/27/2011

Large-scale identity-by-descent mapping discovers rare haplotypes of large effect. Suyash Shringarpure 23andMe, Inc. ASHG 2017

Analysis with SureCall 2.1

CONNERS K-CPT 2. Conners Kiddie Continuous Performance Test 2 nd Edition C. Keith Conners, Ph.D.

Structural Variation and Medical Genomics

CHR POS REF OBS ALLELE BUILD CLINICAL_SIGNIFICANCE

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Nature Genetics: doi: /ng Supplementary Figure 1

Review: Genome assembly Reads

DNA-seq Bioinformatics Analysis: Copy Number Variation

Frequency(%) KRAS G12 KRAS G13 KRAS A146 KRAS Q61 KRAS K117N PIK3CA H1047 PIK3CA E545 PIK3CA E542K PIK3CA Q546. EGFR exon19 NFS-indel EGFR L858R

ChIP-seq hands-on. Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

COMPLETE DOMINANCE. Autosomal Dominant Inheritance Autosomal Recessive Inheritance

Supplementary Appendix

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

Measuring DNA Methylation with the MinION. Winston Timp Department of Biomedical Engineering Johns Hopkins University 12/1/16

Supplementary Figures

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

Below, we included the point-to-point response to the comments of both reviewers.

Bio 312, Spring 2017 Exam 3 ( 1 ) Name:

Understanding DNA Copy Number Data

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics

CITATION FILE CONTENT/FORMAT

Nature Biotechnology: doi: /nbt.1904

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction

Genetic Approaches to Alcoholism, Alcohol Abuse Susceptibility, and Therapeutic Response. Ray White Verona March

Illuminating the genetics of complex human diseases

UNIT 3 GENETICS LESSON #30: TRAITS, GENES, & ALLELES. Many things come in many forms. Give me an example of something that comes in many forms.

Sensory Cue Integration

Erica J. Yoon Introduction

Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

Computational aspects of ChIP-seq. John Marioni Research Group Leader European Bioinformatics Institute European Molecular Biology Laboratory

Supplementary Figures

Influenza Virus HA Subtype Numbering Conversion Tool and the Identification of Candidate Cross-Reactive Immune Epitopes

Improving Genome Assemblies without Sequencing

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

What we mean more precisely is that this gene controls the difference in seed form between the round and wrinkled strains that Mendel worked with

SVIM: Structural variant identification with long reads DAVID HELLER MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS, BERLIN JUNE 2O18, SMRT LEIDEN

Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping

Breast and ovarian cancer in Serbia: the importance of mutation detection in hereditary predisposition genes using NGS

VirusDetect pipeline - virus detection with small RNA sequencing

Gene-microRNA network module analysis for ovarian cancer

Supplementary Figure 1. Estimation of tumour content

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

Studying Alternative Splicing

Cancer Gene Panels. Dr. Andreas Scherer. Dr. Andreas Scherer President and CEO Golden Helix, Inc. Twitter: andreasscherer

Nature Genetics: doi: /ng Supplementary Figure 1

MENDELIAN GENETICS. MENDEL RULE AND LAWS Please read and make sure you understand the following instructions and knowledge before you go on.

9.65 Sept. 12, 2001 Object recognition HANDOUT with additions in Section IV.b for parts of lecture that were omitted.

Genomic Epidemiology of Salmonella enterica Serotype Enteritidis based on Population Structure of Prevalent Lineages

Hands-On Ten The BRCA1 Gene and Protein

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

KPU Registry ID Number: 1005 Date revised, 5 th December 2014 (This revised version was submitted prior to any data collection)

Paul Bennett, Microsoft Research (CLUES) Joint work with Ben Carterette, Max Chickering, Susan Dumais, Eric Horvitz, Edith Law, and Anton Mityagin.

MENDELIAN GENETICS. Punnet Squares and Pea Plants

Cerebral Cortex. Edmund T. Rolls. Principles of Operation. Presubiculum. Subiculum F S D. Neocortex. PHG & Perirhinal. CA1 Fornix CA3 S D

Mental Imagery. What is Imagery? What We Can Imagine 3/3/10. What is nature of the images? What is the nature of imagery for the other senses?

Supplementary Information. Supplementary Figures

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies

A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families

Northgate StruMap Reader/Writer

Laws of Inheritance. Bởi: OpenStaxCollege

How Many Options? Behavioral Responses to Two Versus Five Alternatives Per Choice. Martin Meissner Harmen Oppewal and Joel Huber

draw and interpret pedigree charts from data on human single allele and multiple allele inheritance patterns; e.g., hemophilia, blood types

Viral genome sequencing: applications to clinical management and public health. Professor Judy Breuer

Natural Scene Statistics and Perception. W.S. Geisler

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

STATISTICAL METHODS IN BIOLOGY

Incorporation of Imaging-Based Functional Assessment Procedures into the DICOM Standard Draft version 0.2 8/10/2011 I. Purpose

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

UNIT 6 GENETICS 12/30/16

De Novo Viral Quasispecies Assembly using Overlap Graphs

Gregor Mendel. What is Genetics? the study of heredity

Reducing INDEL calling errors in whole genome and exome sequencing data.

Whole Genome and Transcriptome Analysis of Anaplastic Meningioma. Patrick Tarpey Cancer Genome Project Wellcome Trust Sanger Institute

ACE ImmunoID Biomarker Discovery Solutions ACE ImmunoID Platform for Tumor Immunogenomics

Statistical Applications in Genetics and Molecular Biology

10CS664: PATTERN RECOGNITION QUESTION BANK

AutoOrthoGen. Multiple Genome Alignment and Comparison

BWA alignment to reference transcriptome and genome. Convert transcriptome mappings back to genome space

Neural Coding. Computing and the Brain. How Is Information Coded in Networks of Spiking Neurons?

For all of the following, you will have to use this website to determine the answers:

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma.

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

DETECTION OF LOW FREQUENCY CXCR4-USING HIV-1 WITH ULTRA-DEEP PYROSEQUENCING. John Archer. Faculty of Life Sciences University of Manchester

VIP: an integrated pipeline for metagenomics of virus

No mutations were identified.

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Transcription:

Colorspace & Matching

Outline Color space and 2-base-encoding Quality Values and filtering Mapping algorithm and considerations Estimate accuracy Coverage 2 2008 Applied Biosystems

Color Space Properties Of 2-Base 2 Encoding (2BE) 5 3 Second Base 1 3 1 3 1 3 2 3 5 -A C G T A C G A T-3 3 -T G C A T G C T A-5 1 3 1 3 1 3 2 3 First Base 3 5 Two dibases that agree in just one base have different colors color(bd) color(be) and color(bd) color(ed) A dibase and its reverse have the same color color(bd) = color(db) Repeated-base dibases have the same color color(bb) = color(dd) 3 2008 Applied Biosystems

Valid Adjacent Color Substitutions C A T C G T C C T C T T Reverse Colors Other two colors (both orientations) Isolated single color changes are likely due to detection error 4 2008 Applied Biosystems

Invalid Adjacent Color Substitutions C A T A G T C G C T C T C C A G T T C T G Color changes inconsistent with SNP, likely due to detection error 5 2008 Applied Biosystems

Outline Color space and 2-base-encoding Quality Values and filtering Mapping algorithm and considerations Estimate accuracy Coverage 6 2008 Applied Biosystems

Quality Value (QV) For Color Call A score calculated based on the probability of an error call at that base Similar to those generated by phred and the KB Basecaller for capillary electrophoresis sequencing q = 10log10 p p = probability of color call error A QV score of 10 represent 10% error rate, whereas a QV score of 20 represents a 1% error rate 7 2008 Applied Biosystems

QVs How Are They Generated? Algorithm uses a variety of inputs, most important are angle of vector and intensity, and a lookup table (pre-computed from training data sets) to predict QV Each color call (cycle) is computed independently No information from preceding or later cycles, nor any sequence context is used 8 2008 Applied Biosystems

Challenge of QVs QVs can only be trained using color calls from reads matched to reference sequence, and QVs may not be as predictive for nonmatched reads Four main classes of beads Monoclonal beads good data Polyclonal beads - melded signal Failed beads sequence and chemistry specific Beads with no signal algorithm still assigns color Hard to distinguish different classes of bad beads by QV alone 9 2008 Applied Biosystems

Filtering By QV - Two Choices Filter Filter out reads based on QV values and patterns Reduces total amount of reads and raw error rate Increases % of reads mapped Lose some otherwise good matching reads Don t filter Filter the data by matching Low quality reads won t match 10 2008 Applied Biosystems

Outline Color space and 2-base-encoding Quality Values and filtering Mapping algorithm and considerations Estimate accuracy Coverage 11 2008 Applied Biosystems

Mapping Algorithm Challenge: In order to align a read to a reference, an initial hit (seed) is needed before beginning extension With short reads if using a continuous word pattern approach for mapping, then a very small word size is needed. This is more computationally and time intensive. Key feature: Multiple discontinuous word patterns allow for faster searching and guarantee finding all hits up to certain number of mismatches 12 2008 Applied Biosystems

Discontinuous Words Continuous words: searching for a perfect match 8/8 bases (word size 8, e.g. used by BLAST) ATTTTTTGGGTAGCCCCTTGGATGAGT AGGGGTAGCCTGATGATGGT Discontinuous words: searching 8/18 matches (effective word size is also 8) ATTTTTTGGGTAGCCCCTTGGATGAGT TTGACCGGCATGGGGGAT 110000011000001111 13 2008 Applied Biosystems

Mapping to reference with discontinuous words For a read length of 15, we can find all matches with 1 mismatch (15_1) using discontinuous words in the 3 schemas of word size 10 # schema_15_1 111111111100000 000001111111111?? Ref 002321031332122013220 1111111111 Read 321031333122013 Ref 002321031332122013220 1111111111 Read 321031333122013 111110000011111? Ref 002321031332122013220 11111 11111 Read 321031333122013 extend Ref 002321031332122013220 11111111 111111 Read 321031333122013 Using continuous words, word size must be at most 7 to find all matches with 1 mismatch 40 times slower than the three schemas above 14 2008 Applied Biosystems

Schema Set for 25-mer Read, With Word Size = 14 # 14 base index on 25, 0 mismatches 000000000011111111111111 # 14 base index on 25, 1 mismatches 111111111111110000000000 111110000000000111111111 000000000011111111111111 # 14 base index on 25, 2 mismatches 000001101111001110110111 110110001000101010111110 111101100111100100001001 100011110011110011011010 011110010100010111010011 101000011100111101101101 010101111011011001100100 More mismatch longer run time 15 2008 Applied Biosystems

Features Of Mapping Tool (mapreads( mapreads) Aligns in color space, translation of reference sequence done internally Fast, only allows mismatches (no indel) Allows adjacent mismatches to be counted as one, if consistent with SNP Allows masking of certain positions (bad calls) in the whole set of reads Reports all matches (up to the Z # limit) with up to M mismatches For fixed reference sequence, running time is linear with number of reads 16 2008 Applied Biosystems

What Are Unique Matches? Definition Unique read only maps to one location Beware it is only unique for the parameters used in alignment If see a single match when the alignment was carried out at 25_3, you have no information that there isn't a 25_4 hit. How is a unique match determined? Separation (parameter: clear zone) If the closest alternative is more than a defined number of errors away, then accept it When requiring a separation of 2, if a match at 25_2 exists, then no other match can occur before 25_4 Best wins: separation = 1, 25_1 is better than 25_2 Default in SOLiD pipelines: separation = max number mismatch allowed 17 2008 Applied Biosystems

How Many Mismatches To Tolerate? A SNP will generate two color mismatches Consider the SNP frequency in the genome when setting up number of mismatches allowed in a read The recommendation for a 50 base read 6 mismatches or 4 mismatches with valid adjacent mismatches treated as a single mismatch 18 2008 Applied Biosystems

Pairing Up The Mates A good mate pair: Orientation Ordering Distance 5' 3' F3 R3 R3 F3 3' 5' Mate rescue when only one of the tags mapped initially Try to map the other tag in expected region with relaxed parameters (allow more mismatches and indels in the read) window to scan for F3 5' 3' R3 3' 5' 19 2008 Applied Biosystems

Outline Color space and 2-base-encoding Quality Values and filtering Mapping algorithm and considerations Estimate accuracy Coverage 20 2008 Applied Biosystems

Accuracy Accuracy is affected by the matching parameters used Increasing number of mismatches allowed will increase number of reads that map and drive up the error rate The accuracy after applying 2-Base Encoding (2BE) rules improves significantly over raw color accuracy 21 2008 Applied Biosystems

Accuracy Assessment With A Known Genome 50 base read mapped with up to 6 MM (DH10B results) 97.60% 2.17% 0.14% 2.40% 0.09% Total number of correct CS calls Single mismatched calls Invalid Adjacent Valid Adjacent Accuracy = 99.91% Raw accuracy (before corrected by 2BE) 97.6% 22 2008 Applied Biosystems

Accuracy Effect of 2BE correction Base accuracy by position in read 100 QV 10 10 Percent Error 1 raw 20 0.1 corrected 30 0.01 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Base Position in Read 23 2008 Applied Biosystems

Outline Color space and 2-base-encoding Quality Values and filtering Mapping algorithm and considerations Estimate accuracy Coverage 24 2008 Applied Biosystems

How Much Coverage Do You Need? For SNP 2-base encoding helps to reduce the coverage needed to detect SNP with high confidence Heterozygous SNP will need higher coverage, compared to homozygous, to detect both alleles If the coverage at a heterozygous position is less than 10X, the probability that one of the alleles will not be detected is 1% or more If the sample preparation method is likely to introduce some bias in allele ratio, coverage should be increased 25 2008 Applied Biosystems

SNP Detection At Different Coverages Wheeler et al PLoS Computational Biol. 2005 Oct;1(5):e53. Epub 2005 Oct 28. 26 2008 Applied Biosystems

Coverage - Variance In Different Genomic Regions Ideally, the coverage would follow a binomial distribution Possible reasons for deviations Characteristics of the reference genome (complexity, repeats, etc) The samples being sequenced (structural variation) Sample preparation and sequencing chemistry frequency coverage Mate-pair data (rather than fragment data) can largely, but not completely, overcome this problem Consistent contiguous regions of over/under-coverage may represent copy number variation Detection of SNPs or InDels in these regions should be treated with caution 27 2008 Applied Biosystems

Quiz Recite the SOLiD 2-base-encoding color table ;-) What s the main advantage of using 2BE with the SOLiD platform? For a given isolated SNP, how many colors will differ between the reference and sample read? For a position with QV = 20, what s the likelihood of error in color call at that position? Does the 'mapreads' tool employ discontinuous or continuous words patterns? What s recommended depth of coverage for SNP calling? homozygous? heterozygous? 28 2008 Applied Biosystems

Appendix

Expected Number Of Errors In A Read Given a read with QVs, one can estimate the expected number of errors in the read If the QV of the i-th call is q i, then the expected number of errors in the read is n qi /10 m = = 1.0022 10 i= 1 p i p i Accounts for q values being rounded to integer 30 2008 Applied Biosystems