DNA-seq Bioinformatics Analysis: Copy Number Variation Elodie Girard elodie.girard@curie.fr U900 institut Curie, INSERM, Mines ParisTech, PSL Research University Paris, France
NGS Applications 5C HiC DNA-seq Chip-seq RNA-seq ENCODE project 2
Whole Genome or Target DNA sequencing Whole Genome vs Target Sequencing Genome Amplicon Sequencing Sequencing of a dedicated panel of genes/hotspots DNA fragmentation Hybridization PCR amplification Separation Elution Illumina technology (Hiseq, Miseq, NextSeq) Mostly IonTorrent technology (PGM/Proton) 3
Normalized copy number DNA-seq: some applications Detection of: - Single Nucleotide Variations (SNVs) & Insertions/Deletions (Indels) in germline sample in tumoral sample (somatic analysis) 3bp Del SNV - Copy Number Variations (CNV): large duplication or deletion event Tumor/Germline Exomes Germline events among a cohort Distinction between Capture & Amplicon sequencing Genomic position (3-kb window), chr5 4
Variant Calling: germline VS somatic Expectation: Homozygous variant: 0/0 (0% alternative allele) or 1/1 (100% alternative allele) Heterozygous variant: 0/1 (50% reference / 50% alternative) Reality in normal samples: sequencing data = noisy 0/0 : [0% - 25%[ 0/1 : [25%-75%] 1/1 : ]75%-100%] 0/0 0/1 1/1 Allelic ratio (%) Reality in tumor samples: Tumor = mixture of tumor clones & normal cells 0/1 : ]0% - 50%] 1/1 : ]0% - 100%] Normal cells Major clone 5
Normalized copy number DNA-seq: some applications Detection of: - Single Nucleotide Variations (SNVs) & Insertions/Deletions (Indels) in germline sample in tumoral sample (somatic analysis) 3bp Del SNV - Copy Number Variations (CNV): large duplication or deletion event Tumor/Germline Exomes Germline events among a cohort Distinction between Capture & Amplicon sequencing Genomic position (3-kb window), chr5 6
CNV detection methods Different methods: FISH: fluorescence in situ hybridization acgh: array-comparative genomic hybridization SNP array: genome-wide SNP array HTS: High-Throughput sequencing 7
Previous Workflow Reference Genome (Fasta) Reads (Fastq) Quality Control (QC1/QC2) Mapping PCR duplicates Marking --------- MarkDup /!\ Not for small targets / amplicon design /!\ QC3 Aligned and preprocessed reads (BAM) --------- - Marked PCR duplicates - Intersected on target regions - Realigned around indels - Recalibrated Target regions (bed) Target Intersection --------- Intersect Bam [Optional] Preprocess part 1 --------- Local realignment around indels [Optional] Preprocess part 2 --------- Base Quality Score Recalibration 8
Copy Number Variation Germlines Aligned and preprocessed Reads (BAM) Copy Number Variation Tumor Aligned and preprocessed Reads (BAM) Germline Aligned and preprocessed Reads (BAM) Copy Number Alteration 9
Copy Number Variation in Exome-seq Detection of large-scale variation events: amplification, gain, loss, deletion Criteria: (Somatic Germline/Somatic) %GC/Mappability Normalization Ploidy/Cellularity Estimation LOH Detection (Allele specific) Sub-clonal events detection Absence of Control Sample No tool meets all those criteria yet: Sequenza, Titan, facets : %GC Normalization, Allele-Specific (LOH), Cellularity Estimation, sub-clonal events (Facets, Titan) but require a normal sample CopywriteR: use off-target % to estimate CNV without normal samples Contra: Germline event using reference germlines to normalize read depth 10
Metrics Zhao et al., BMC Bioinformatics 2013 11
Normalization Normalization by: GC content Mappability (uniqueness of the region) Matched normal sample B-Allele Frequency Homozygous SNPs: BAF at 0 (AA) or 1 (BB) Heterozygous SNPs: BAF at 0.5 (AB) Allelic imbalance : intermediate values (AAB : 66%/33%) Help assess copy number Allows to determine LOH (Loss-of-heterozygosity) for somatic samples 12
Copy Number Variation & Amplicon-seq [not in Galaxy] CNVpanelizer: germline CNV event using references sample (other germlines) IonCopy: CNA detection without match-normal sample 13
Recurrent event [not in Galaxy] Fragl plot: representation of the frequency of each event among the cohort (red: gain/amplification, green: loss/deletion) 14
Workflow Germline1 Aligned and preprocessed Reads (BAM) Germline N Aligned and preprocessed Reads (BAM) Germline Copy Number Detection [Not in Galaxy ] - Cohort comparison - Litterature Comparison Tumor Aligned and preprocessed Reads (BAM) Germline Aligned and preprocessed Reads (BAM) Somatic Copy Number Detection [Not in Galaxy ] - Cohort comparison - Litterature Comparison 15
Dataset for somatic CNV Public data: Pair of Lung Adenocarcinoma Tumor & Match Germline Paired-end reads of 100bp, Illumina HiSeq2000 Available in EBI-SRA: ERP001071. Corresponding RNA-seq data available (ERP001058, Ju et al., Genome Res, 2012) Use of Sequenza: use of paired tumor-normal DNA sequencing data to estimate tumor cellularity and ploidy & to calculate allele-specific CN profiles Tumor Aligned and preprocessed Reads (BAM) Germline Aligned and preprocessed Reads (BAM) Somatic Copy Number Detection [Not in Galaxy ] - Cohort comparison - Litterature Comparison 16
Import Data 1. Go to http://sigenae-workbench.toulouse.inra.fr/galaxy 2. Go to «Shared Data» in the top menu then «Published Histories» 3. Click on «TP_CNV_SEQUENZA_FILES» then on «Import History» 17
Create Mpileup files In the search bar, find «Mpileup» : Create a per-base description of the alignments Repeat for the normal BAM file. Rename the outputs to identify Tumor from Germline 18
Call CNV using Sequenza In the search bar, find «Sequenza» Use read depth ratio (tumor/normal) & BAF extracted from mpileup files 19
Exome Somatic CNV: Sequenza B-Allele Frequency Depth Ratio Copy Number Allele-Spefic Copy Number 20
Datasets for Germline events Public data: Pair of Tumor & Match Germline + duplicate germline Available in Contra sourceforge repository Use of Contra: comparison of base-level log-ratios calculated from read depth between case and control samples Germline1 Aligned and preprocessed Reads (BAM) Germline N Aligned and preprocessed Reads (BAM) Germline Copy Number Detection [Not in Galaxy ] - Cohort comparison - Litterature Comparison 21
Import Data 1. Go to http://sigenae-workbench.toulouse.inra.fr/galaxy 2. Go to «Shared Data» in the top menu then «Published Histories» 3. Click on «TP_CNV_CONTRA_FILES» then on «Import History» 22
Create baseline from germlines In the search bar, find «Baseline : Control files for Contra» 23
Call CNV using Contra In the search bar, find «Contra Copy Number analysis» Set «bed» & «large deletion» to true in the optional parameters 24