PSSV User Manual (V1.0)

Similar documents
PSSV User Manual (V2.1)

DNA-seq Bioinformatics Analysis: Copy Number Variation

Genome. Institute. GenomeVIP: A Genomics Analysis Pipeline for Cloud Computing with Germline and Somatic Calling on Amazon s Cloud. R. Jay Mashl.

SVIM: Structural variant identification with long reads DAVID HELLER MAX PLANCK INSTITUTE FOR MOLECULAR GENETICS, BERLIN JUNE 2O18, SMRT LEIDEN

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Nature Biotechnology: doi: /nbt.1904

BWA alignment to reference transcriptome and genome. Convert transcriptome mappings back to genome space

Golden Helix s End-to-End Solution for Clinical Labs

Research Strategy: 1. Background and Significance

Introduction. Introduction

RNA SEQUENCING AND DATA ANALYSIS

Nature Methods: doi: /nmeth.3115

Genome-wide copy-number calling (CNAs not CNVs!) Dr Geoff Macintyre

Metabolomic and Proteomics Solutions for Integrated Biology. Christine Miller Omics Market Manager ASMS 2015

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz

RNA SEQUENCING AND DATA ANALYSIS

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Supplementary Figure 1. Schematic diagram of o2n-seq. Double-stranded DNA was sheared, end-repaired, and underwent A-tailing by standard protocols.

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Analysis with SureCall 2.1

Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping

Breast and ovarian cancer in Serbia: the importance of mutation detection in hereditary predisposition genes using NGS

Calling DNA variants SNVs, CNVs, and SVs. Steve Laurie Variant Effect Predictor Training Course Prague, 6 th November 2017

Identification of regions with common copy-number variations using SNP array

TOWARDS ACCURATE GERMLINE AND SOMATIC INDEL DISCOVERY WITH MICRO-ASSEMBLY. Giuseppe Narzisi, PhD Bioinformatics Scientist

CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery

ChromHMM Tutorial. Jason Ernst Assistant Professor University of California, Los Angeles

Introduction to LOH and Allele Specific Copy Number User Forum

underlying metastasis and recurrence in HNSCC, we analyzed two groups of patients. The

Package xseq. R topics documented: September 11, 2015

COMPUTATIONAL OPTIMISATION OF TARGETED DNA SEQUENCING FOR CANCER DETECTION

SUPPLEMENTARY INFORMATION

Global variation in copy number in the human genome

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

MethylMix An R package for identifying DNA methylation driven genes

ISOWN: accurate somatic mutation identification in the absence of normal tissue controls

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

Sebastian Jaenicke. trnascan-se. Improved detection of trna genes in genomic sequences

NGS in Cancer Pathology After the Microscope: From Nucleic Acid to Interpretation

Session 4 Rebecca Poulos

A Versatile Algorithm for Finding Patterns in Large Cancer Cell Line Data Sets

Integration of Cancer Genome into GECCO- Genetics and Epidemiology of Colorectal Cancer Consortium

Genome-Wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-Point Method with ChIP-seq Data

Journal: Nature Methods

Session 4 Rebecca Poulos

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Performance Characteristics BRCA MASTR Plus Dx

Detection of aneuploidy in a single cell using the Ion ReproSeq PGS View Kit

microrna analysis Merete Molton Worren Ståle Nygård

Sequencing of 58 STRs using the Illumina ForenSeq TM workflow and analysis of the data with the STRinNGS v.1.0 software

Hands-On Ten The BRCA1 Gene and Protein

Linked read sequencing resolves complex genomic rearrangements in gastric cancer metastases

DNA Sequence Bioinformatics Analysis with the Galaxy Platform

Finding subtle mutations with the Shannon human mrna splicing pipeline

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

LTA Analysis of HapMap Genotype Data

Part-II: Statistical analysis of ChIP-seq data

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies

CITATION FILE CONTENT/FORMAT

Below, we included the point-to-point response to the comments of both reviewers.

Genetic alterations of histone lysine methyltransferases and their significance in breast cancer

ACE ImmunoID Biomarker Discovery Solutions ACE ImmunoID Platform for Tumor Immunogenomics

Structural Variation and Medical Genomics

Transform genomic data into real-life results

Evaluating Classifiers for Disease Gene Discovery

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library

SubcloneSeeker: a computational framework for reconstructing tumor clone structure for cancer variant interpretation and prioritization

ncounter Data Analysis Guidelines for Copy Number Variation (CNV) Molecules That Count NanoString Technologies, Inc.

Victor Guryev. European Research Institute for the Biology of Ageing

High Throughput Sequence (HTS) data analysis. Lei Zhou

Supplementary Methods

Supplementary Materials for

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

MIR retrotransposon sequences provide insulators to the human genome

Assessing Laboratory Performance for Next Generation Sequencing Based Detection of Germline Variants through Proficiency Testing

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Supplementary Information

Detection of copy number variations in PCR-enriched targeted sequencing data

Human beings contain tens of thousands of genes, the basic material for cell

Supplementary Material for IPred - Integrating Ab Initio and Evidence Based Gene Predictions to Improve Prediction Accuracy

Supplemental Information For: The genetics of splicing in neuroblastoma

A Practical Guide to Integrative Genomics by RNA-seq and ChIP-seq Analysis

New Enhancements: GWAS Workflows with SVS

5 th July 2016 ACGS Dr Michelle Wood Laboratory Genetics, Cardiff

Fluxion Biosciences and Swift Biosciences Somatic variant detection from liquid biopsy samples using targeted NGS

Nature Medicine: doi: /nm.4439

ChIP-seq data analysis

Supplementary Figure 1

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Accel-Amplicon Panels

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies

PREPARED FOR: U.S. Army Medical Research and Materiel Command Fort Detrick, Maryland

Figure S4. 15 Mets Whole Exome. 5 Primary Tumors Cancer Panel and WES. Next Generation Sequencing

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

David Tamborero, PhD

CV and publications of Alexey Larionov

ACE ImmunoID. ACE ImmunoID. Precision immunogenomics. Precision Genomics for Immuno-Oncology

Transcription:

PSSV User Manual (V1.0) 1. Introduction A novel pattern-based probabilistic approach, PSSV, is developed to identify somatic structural variations from WGS data. Specifically, discordant and concordant read counts from paired samples are jointly modeled in a Bayesian framework. Each type of read counts at SV regions is assumed to follow a mixture of Poisson distributions with three components to denote nonmutation, heterozygous and homozygous SVs in tumor or normal samples. Then, each SV is modeled as a mixture of hidden states representing different somatic and germline mutation patterns. As a unique feature of this model, we can differentiate heterozygous and homozygous SVs in each sample, enabling the identification of those somatic SVs with heterozygous mutations in normal samples and homozygous mutations in tumor samples. An Expectationmaximization (EM) algorithm is used to iteratively estimate the model parameters and the posterior probability for each somatic SV region. We present an example of the PSSV workflow using a pair of tumor and normal DNA-seq bam files (obtained from a patient sample TCGA-A2-A04P of the TCGA data set); more details of the workflow can be found in our paper: PSSV: A novel pattern-based probabilistic approach for somatic structure variation identification. The R script of PSSV has been tested using R 3.2.2 under Ubuntu 12.04 64 bit. 2. DNA-seq data preprocessing With the BAM format WGS data from a pair of tumor-normal samples, we use a hierarchical clustering approach (which is embedded in BreakDancer) to generate discordant read clusters in each sample, and filter SVs by setting that (i) the number of discordant reads in tumor sample is no less than 4; (ii) the length of mutation region is no longer than 2,000 bases. We then use GATK to calculate the average concordant read depth within each candidate SV as well as at its two flanks in both samples. The concordant read depth within mutation regions can improve the sensitivity of somatic SV prediction. The read depth signals at two flanks can be used to normalize local read counts so as to eliminate the GC bias and make the read counts comparable across all mutation regions. A flowchart of DNA-seq data preprocessing for somatic SV detection is shown in Fig. 1. 1

Tumor.bam Normal.bam Discordant read clusters BreakDancer Discordant read clusters Tumor sample Normal sample Left flank Mutation regions Right flank Left flank Mutation regions Right flank GATK GATK GATK GATK GATK GATK Normalized tumor discordant reads Normalized tumor concordant reads Each candidate mutation region Normalized normal discordant reads Normalized normal concordant reads Fig. 1. Flowchart of DNA-seq data processing for somatic SV detection. 2.1 Clustering discordant reads for SV selection We use BreakDancer to identify SV regions using a pair of tumor and normal whole genome DNA-seq bam files with the following command: bam2cfg.pl -g -h TCGA-A2-A04P-01A.bam TCGA-A2-A04P-10A.bam > TCGA-A2-A04P.cfg BreakDancerMax.pl -t -q 0 -f -d TCGA-A2-A04P.cfg > TCGA-A2-A04P.ctx Note that, for each SV, BreakDancer uses the sum of discordant reads observed at each region to evaluate its significance. However, in our case, we are more focused on the different feature between tumor and normal samples for each region. Above significance calculation cannot be used to select candidate regions to be investigated by our PSSV method. Instead, we take all possible regions based on the hierarchical clustering results of discordant reads in each sample. In the above command, we set threshold as -q 0 to obtain all regions. We provide a small tool, breakdancer_seperate_one, to convert TCGA-A2-A04P.ctx into a more concise file as TCGA-A2-A04P-BreakDancer.txt. breakdancer_seperate_one TCGA-A2-A04P.ctx TCGA-A2-A04P-BreakDancer.txt The data format of TCGA-A2-A04P-BreakDancer.txt is shown as follows: chr1 point1 chr2 point2 type size score tumor normal 1 10211 1 10296 DEL 96 99 5 8 1 10306 10 135524589 INS -161 99 4 8 1 10449 7 10001 INS -108 58 0 3 1 16450 1 16453 INS -196 48 2 1 1 99098 12 66451267 DEL 95 33 0 2 2

1 136741 1 136875 DEL 127 46 1 1 1 537692 1 537723 INS -167 69 4 0 1 565259 1 565311 INV -304 99 13 0 1 569310 1 569405 INS -154 99 3 4 Now, we have a list of SVs with discordant read counts in tumor or normal samples. 2.2 Concordant read calculation for SVs Based on the SV locations reported by BreakDancer, we provide a short R script interval_transfer.r to extract SV locations as well as two flanks. Three files: interval.bed, left.bed and right.bed will be created for each patient sample. The format of these three files meets the requirement of GATK for read calculation. interval.bed left.bed right.bed 1 10212 10296 1 10126 10211 1 10296 10381 1 537693 537723 1 537642 537692 1 537723 537754 1 565260 565311 1 565207 565259 1 565311 565363 1 1034565 1034882 1 1034246 1034564 1 1034882 1034963 1 1086842 1087027 1 1086655 1086841 1 1087027 1087213 1 1088680 1088912 1 1088446 1088679 1 1088912 1089145 1 1131210 1131260 1 1131158 1131209 1 1131260 1131311 1 1598458 1598747 1 1598167 1598457 1 1598747 1599037 GATK is then used to calculate the average read of each file for tumor and normal samples, respectively, using the following commands: java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R hg19.fasta -o TCGA-A2- A04P-interval-tumor -I TCGA-A2-A04P-01A.bam -L interval.bed java -jar GenomeAnalysisTK.jar -T DepthOfCoverage -R hg19.fasta -o TCGA-A2- A04P-interval-normal -I TCGA-A2-A04P-10A.bam -L interval.bed In total, six files including read information at SV regions as well as their two flanks in tumor and normal samples will be generated. 2.3 Detecting somatic SVs After generating discordant and concordant read info using the above pipeline, we provide an R script to detect somatic SVs. After loading previously generated files into the working space of R, we use the following function to integrate all read info together: Candidate_regions<-Discordant_concordant_integration(Discordant_reads, Tumor_interval_location, Tumor_left_location, Tumor_right_location, Normal_interval_location, Normal_left_location, Normal_right_location, Tumor_avg_Coverage, Normal_avg_Coverage) 3

Then, we detect each type of SVs using the proposed PSSV method. Detecting somatic deletions: DEL_results<-Somatic_del_detection(Candidate_regions, Tumor_avg_Coverage, Fig. 2. Number of deletion regions assigned to each state. Detecting somatic insertions: INS_results<-Somatic_ins_detection(Candidate_regions, Tumor_avg_Coverage, Fig. 3. Number of insertion regions assigned to each state. Detecting somatic inversions: INV_results<-Somatic_inv_detection(Candidate_regions, Tumor_avg_Coverage, Fig. 4. Number of inversion regions assigned to each state. 4

In each output data frame (DEL_results, INS_results, INV_results), we have five variables in general that are briefly described as follows: #: chr, start location, end location reads #: discordant read count in tumor sample, concordant read in tumor sample, discordant read count in normal sample, concordant read in the normal sample lambda #: estimated Poisson distribution parameters state #: one out of six candidate states with the highest posterior probability probability #: posterior probability for the reported state locations Fig. 5. An example of the PSSV results. Finally, as described in our PSSV paper, we are mainly focused on those somatic SVs that lay in gene promoter and coding regions. Therefore, with gene annotation file hg19_refseq.txt, we use the following command to annotate each region with a gene if they overlap. genes_with_svs<-annotate_svs(transcripts_hg19, DEL_results, INS_results, INV_results) Fig. 6. An example of gene annotation results. 5