NGS, Cancer and Bioinforma;cs. 20/10/15 Yannick Boursin

NGS, Cancer and Bioinforma;cs 1

NGS and Clinical Oncology NGS in hereditary cancer genome tes;ng BRCA1/2 (breast/ovary cancer) XPC (melanoma) ERCC1 (colorectal cancer) NGS for personalized cancer treatment Clinical trials: MOSCATO (GR), SAFIR (GR), SHIVA (Curie), Ipilimumab (an;-ctla4), Nivolumab (an;-pd1), Trastuzumab (an;-her2), Cetuximab (an;-egfr) Detec;on of chimeric transcripts Chronic Myeloid Leukemia: Philadelphia chromosome (BCR/ABL) Non-Small-Cell Lung Cancer: EML4-ALK 2

NGS and Oncology NGS is now widely used as: A research tool to screen a large amount of cancer samples A clinical/diagnosis tool in daily prac;ce These projects require dedicated bioinforma;cs integra;on project to access and analyses this huge amount of data. 1 3

Why do we need computers for NGS Sequencing data size evolu7on Needs to address Store PetaBytes of data (1 PB is 1000 TB). Share data around the world through networks Analyze huge amounts of data with complex algorithms 4

Bioinformatics and Oncology Problem: finding, extrac;ng, and presen;ng relevant informa;ons. Par;al solu;on: designing workflows in order to ease data analysis. 5

Interdisciplinary collaboration Bioinforma;cs acts as a hubs between the different fields. Trust between partners is needed, training is needed as well for efficient understanding. Biology knowledge Knowledge modeling, Bioinformatics Medical staff Clinicians, specialists, Raw data storage Integration of biological and clinical data Quality Control Data analysis Clinical Biostatistics Report for biological/medical staff Biological staff Biologists, Geneticists, Technological platforms Sequencing, Microarrays, ImmunoChemistry, 6

Standard Workflow for NGS Analysis Depends on the NGS Application Sequencing & Primary Analysis Raw Reads Reads Cleaning Reads Mapping Data Analysis QC: 1 QC: 2 QC: 3 A typical NGS workflow 7

Step 1: Quality Check and improvements 8

NGS Data: what do they look like? A raw data file (.fastq,.sff,.fa,.csfasta/.qual) with millions of short reads of the same size (SOLiD, HiSeq) or reads of different size (Ion PGM/Proton) Enhanced view of the reads in a fastq file 9

FASTQ format 1 sequence = 1 read = 4 lines in the file First line = sequence iden;fier 10

Fourth line = Quality FASTQ format ASCII encoded (Reduce the file size) 11

Sequence quality encoding Phred scores Q : Q scores are defined as a property that is logarithmically related to the base-calling error probabilities (P). Q = -10 log10 P 12

Quality controls on raw reads : lets start after sequencing A raw read is characterized by three parameters: Its length Its sequence Per-base-in-sequence quality ACTGATTAGTCTGAATTAGANNGATAGGAT GATCGATGCATAGCGATCAGCATCGATACG CGGCGCTCCGCTCTCGAAACTAGCACTGAC AGCATCAGGATCTACGATCTAGCGAACTGAC ACTACTTACGACATCGAGGTTAGGAGCATCA ACTAGGCATCGGCATCACGGACNNNNNNNN ACTAGCTATCGAGCTATCAGCGAGCATCTATC ACTAGCTACTATCGAGCGAGCGATCATCGAC CTGACTACTATCGAGCGAGCTACTAACTGAC ACTATCAGCTAGCGCTTCAGCATTACCGT ACTANNGACTAGGAATTAGCTACTGAGCTAC ACTAGCAGCTATATGAGCTACTAGCACTGAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNN Raw reads 13

Why looking at sequencing quality? Quality of data is very important for various downstream analyses: Sequence assembly or mapping Variants detec;on Gene expression studies... Quality of data = poor Try to find a reason Can we correct/improve the quality? May lead to erroneous conclusions 14

Quality controls on raw reads: which metrics to check? Mainly: Quality score per base and over the reads But also: Read length distribu;on Sequence content per base and % of GC Kmers content Overrepresented sequences Duplicated reads 15

Quality scores Per base (Box Whisker type plot) -> to see wether base calls falls into low quality (commonly towards the end of a read) Per sequence (mean quality distribu;on) -> to see if a subset of your sequences have universally low quality values 16

Quality scores PGM run A PGM run A PGM run B PGM run B 17

Quality scores Illumina run C Illumina run C Illumina run D Illumina run D 18

Quality control on raw reads: adapters removal An adapter is a small piece of known DNA located at the end of the reads Adapters roles: Hang read to the sequencer flowcell Allows a specific PCR enrichment of reads having adapter Use in mul;plex sequencing (samples in mix) Available tools to trim adapters: Cutadapt SeqPrep RmAdapter In blue: adapters. In orange: informa;ve part of the read. 19

Quality controls on raw reads : lets start after sequencing A first Quality Control of raw reads is mandatory and can be established according to the applica;on ('N', adapter sequences, barcode, contamina;on, etc.) ACTGATTAGTCTGAATTAGANNGATAGGAT GATCGATGCATAGCGATCAGCATCGATACG CGGCGCTCCGCTCTCGAAACTAGCATCGAC ACTGAC AGCATCAGGATCTACGATCTAGCGAACTGAC ACTGAC ACTACTTACGACATCGAGGTTAGGAGCATCA ACTAGGCATCGGCATCACGGACNNNNNNNN ACTAGCTATCGAGCTATCAGCGAGCATCTATC ACTAGCTACTATCGAGCGAGCGATCATCGAC CTGACTACTATCGAGCGAGCTACTAACTGAC ACTGAC ACTATCAGCTAGCGCTTCAGCATTACCGT ACTANNGACTAGGAATTAGCTACTGAGCTAC ACTAGCAGCTATATGAGCTACTAGCACTGAC ACTGAC NNNNNNNNNNNNNNNNNNNNNNNNNNNNN Processed reads: blue parts are to be kept, green and red parts to be removed 20

Quality controls : Standard Workflow for NGS Analysis Depends on the NGS Application Sequencing & Primary Analysis Raw Reads Reads Cleaning Reads Mapping Data Analysis QC: 1 QC: 2 QC: 3 A typical NGS workflow 21

Step 2: Short Reads Alignment 22

Reads alignment - Vocabulary Alignment : (mapping) The reads alignment aims at transforming the single reads informa;on in an organized and reduced set of informa;on. Mismatch : Incoherence between two nucleo;des Reference Genome : The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads informa;on. Gap : Bridge within the read alignment (i.e. small Inser;on/dele;on) Mappability : Uniqueness of a region (repeated region = low mappability, unique region = good mappability) Indels : Inser;on/Dele;on into the reference genome 23

Reads alignment Two strategies The reads alignment aims at transforming the single reads informa;on in an organized and reduced set of informa;on. Two strategies can be applied : - De novo Reads Assembly Used when no reference genome are available. It aims at reconstruc;ng long scaffolds from single reads informa;on. - Alignment on a Reference Genome The reads are directly compared to a known reference genome. 24

Alignment on a reference genome The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads informa;on. T T T A C G A A C T A C G A G C T C C T A T G C C A A C A G C T A C T A C G A C T T C A T C T A C T T T A C G A C G A G C T G C G A G C T G T C C T A G C A G C T G C G A C G A G C T A C C T T G G C T A C G A G A G C T A C T G G C C A A C C G G C C A A Reference Genome Sequence A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C Alignment of reads against reference genome 25

Alignment on a reference genome The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads informa;on. T G C C A A C A C C T T G G C G A G C T G A C G A G C T G G C C A A C C G G C C A A T C C T A G C A G C T G C G G C T C C T A C G A G C T G T T T A C G A A G C T A C T T T T A C G A A G C T A C T A C G A C T T C T A C G A G A C T A C G A C A T C T A C Reference Genome Sequence A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A Homozygous Polymorphism (T/C) Alignment of reads against reference genome 26

Alignment on a reference genome - Challenges New alignment algorithms must address the requirements and characterics of NGS reads Millions of reads per run (30x of genome coverage) Reads of different size (35bp - 200bp) Different types of reads (single-end, paired-end, mate-pair, etc.) Base-calling quality factors Sequencing errors ( ~ 1%) Repe;;ve regions Sequencing organism vs. reference genome Must adjust to evolving sequencing technologies and data formats 27

Alignment on a reference genome Bioinformatics tools Mappers timeline (since 2001) 28

Finding the best alignment - Rational Given a reference and a set of reads, report at least one good local alignment for each read if one exists What is good? For now, we concentrate on: Fewer mismatches is beuer T G A T C A T A... Is better than G A T C A A T G A T.C A T A... G A G A A T Failing to align a low-quality base is beuer than failing to align a high-quality base T G A T A T T A... Is better than G A T c a.t T G A T c a T A... G T A C A T Based on a scoring system, i.e. score for a match (1), MM penalty (3), gap open penalty (5), gap extension penalty (2). The best alignment is the one with the highest score. 29

Alignment key parameters - Repeats Approximately 50% of the human genome is comprised of repeats Treangen T.J. and Salzberg S.L. 2012. Nature review Gene;cs 13, 36-46 NGS and Bioinformatics 30

Alignment key parameters - Repeats Close proximity with genes : intergenic and intragenic posi;ons BRCA2: a mosaic of repeated regions 31

Alignment key parameters Repeats 3 strategies -1- Report only unique alignment -2- Report best alignments and randomly assign reads across equaly good loci -3- Report all (best) alignments -1- -2- -3- A B A B A B Treangen T.J. and Salzberg S.L. 2012. Nature review Gene;cs 13, 36-46 32

Alignment key parameters Using single or paired-end reads? The type of sequencing (i.e. single or paired-end reads) is owen driven by the applica;on. Exemple : Finding large indels, genomic rearrangements,... However, in most of the case, the pair informa;on can improve the mapping specificity - Single-end alignment repeated sequence A C G A C T C A C G A C T C Reference Genome Sequence A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C - Paired-end alignment unique sequence A C G A C T C G G C C A A C A C G A C T C G G C C A A C Reference Genome Sequence A C T A C G A C T C T A C G A G C A T C T A C G A G C T A C T A G C G A T C T A C G A G C T G C G A G C A A C G GC C A A C Alignment of reads against reference genome 33

Key points Alignment on a reference genome The alignment is a crucial step of the NGS analysis. The reference genome has to be carefully chosen. The mappability of the region of interest has to be taken into account (primer design). The scoring method has to be chosen accordingly to the sequencing error rate and the quality of the raw reads. The alignment parameters have to be set properly. 34

Limitations of Alignment Tools Even if we have now some nice tools to align reads on a reference genome, several issues are s;ll important : - Homopolymer mapping - Efficiently align small indels - Alignment on several genomes - Alignment on repeated sequences -... 35

Alignment formats A lot of formats exists: SAM BAM ELAND (Illumina specific) MAQ map SAM and BAM are now the standard for aligned data 36

SAM format SAM for Sequence Alignment Map Tabulated text file 1 line per read Each line is composed of 11 fields (minimum) 37

SAM format 11695_6 0 chr1 3292760 255 20M * 0 0 AAGAGATCTGGAACCATAGA DGDFCDGFFGBEFFGFDEEF XA:i:0 MD:Z:20 NM:i:0 XX:i:3984 9985_1 0 chr1 3292761 255 19M * 0 0 AGAGATCTGGAACCATAGA IIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:19 NM:i:0 XX:i:3990 4226_1 0 chr1 3296594 255 22M * 0 0 TCTGCAAGGCAAAAGACACTGT GHHHHHGHGHHHGHHHHBHBGG XA:i:0 MD:Z:22 NM:i:0 XX:i:4194 7001_1 0 chr1 3328828 255 20M * 0 0 AAGAAAGAGAACTTCAGACC GGGG+GGGGGGIIIIIBHII XA:i:0 MD:Z:20 NM:i:0 XX:i:2357 1042_1 0 chr1 3334731 255 21M * 0 0 GGGACTCAGCAGAACTTAGGA?@GGGDGGGG>DDGGGGGGDB XA:i:0 MD:Z:21 NM:i:0 XX:i:1027 14647_1 0 chr1 3334756 255 23M * 0 0 AGTCTGAACAGGTTAGAGGGTGC IIIIIIEGIHIGID<DBDGDBGB XA:i:0 MD:Z:23 NM:i:0 XX:i:1910 38

SAM format Second field can be used for quick sort of file With Samtools (command line) and f et F op;ons Useful webpage: hup://broadins;tute.github.io/picard/explain-flags.html 39

BAM format BAM for Binary Alignment/Map Correspond to SAM format compressed as BGZF Reduce by 5 ;mes the size of the alignment file Not directly readable as SAM format Require Samtools Best format for alignment file sharing Couples with an index file (BAI) Avoid a sequen;al read of the complete file 40

Quality controls on aligned data : Standard workflow for NGS analysis Depends on the NGS Application Sequencing & Primary Analysis Raw Reads Reads Cleaning Reads Mapping Data Analysis QC: 1 QC: 2 QC: 3 A typical NGS workflow 41

QC 3 : Which metric to check? In prac7ce, how to validate my alignment? Be aware of the mapping strategy used Look at simple descrip;ve sta;s;cs Number of aligned reads Coverage/Depth Mapping quality Number of normal/abnormal pairs for paired-end data Strand bias... 42

Paired-end mapping Insert-size checking % of "All Good"= both reads in the pair have aligned "the pair is properly aligned" meaning that they mapped within a proper distance from each other % of "All Bad" = neither the read nor its mate mapped % of Only one read maps = only one read in a pair is mapped 43

NGS Analysis : How can I work with my NGS data? Difficult on personal computer (lack of ressources) 1 alignement = 4 processors + 15gb Ram (to mul;ply by the number of samples) Impossible to open files into sofwares like text editor Need a very large storage capacity Data backup administra;on Applica;ons server connected to a compu;ng cluster and storage array: Commercials solu;on (CLC Bio, NextGene,...) Galaxy server: hwps://galaxy.gustaveroussy.fr/galaxyprod 44

Data analysis Depends on the NGS Application Sequencing & Primary Analysis Raw Reads Reads Cleaning Reads Mapping Data Analysis QC: 1 QC: 2 QC: 3 A typical NGS workflow 45

Data Analyses in Cancer 20/10/15 Chimeric transcript search Alterna;ve transcripts study Diﬀeren;al expression study Methyla;on study Detec;on of genomic variants Detec;on of copynumber varia;on Yannick Boursin 46

Chimeric transcripts Does the tumoral cells express any chimeric transcript? History of the bcr-abl fusion 47

Alternative transcripts 48

Differential expression Are there genes that would be strongly expressed in one kind of tumor that are not in the other kind? Can we group tumors according to their expression profiles? Clustering differen;al expression in breast tumours. 49

Methylome Is there any difference between DNA methyla;on in tumors and in normal cells? How does methyla;on promotes cancer? 50

Detection of copynumber variations Are there any copy-number altera;on (gain or loss of chomosomal regions, amplifica;ons ) that could explain tumorigenesis? Copynumber varia;ons in cancer. MYC and KRAS are amplified. 51

Detection of genomic variants Are there muta;onal events that are specific to the tumoral genome? Could the tumorigenesis be explained by those? Is there any drug targe;ng those muta;ons? Pancreas adenocarcinoma: from normal cells to tumoral cells 52

Limitations: Detection of genomic variants Between 1.4 and 8.9 % of the variants are technology specific 53

Limitations: Detection of genomic variants Common genomic variants between different variant callers 54

Conclusion Nowadays, NGS is widely used in cancer centers in order to categorize cancers and link pa;ents with personnalized treatments (Precision Medicine) NGS are also used in cancer research, in order to discover new oncogene;c mechanisms, to understand the way a treatment works, to link biological and gene;cal characters Due to technical and how-the-universe-works-related issues, using NGS might not solve your problems. It is important to know that the technique is limited: A) by the ques;on you asked at first. If a cancer cannot be explained by muta;onal events, it might be explained by other mechanisms. But s;ll, nothing is to be found in data. B) by technical issues. Sequencers and sowwares are prone to errors. Sta;s;cally, there will be at least one error for your analysis. You can owen limit the role of this limita;on by making biological and technical replicates. 55

Galaxy: a web-based genome analysis platform Galaxy is an open-source framework for integra;ng various computa;onal tools and databases into a cohesive workspace hwps://main.g2.bx.psu.edu/ A web-based service that provides and integrates many popular tools and resources for compara;ve genomics A completely self-contained applica;on for building your own Galaxy style sites 29 janvier 2015 Forma;on NGS & Cancer - Analyses Exome

Galaxy: the instant web-based tool and data resource integration platform Open Source downloadable package that can be deployed in individual labs Modularized Add new tools Integrate new data sources Easy to plug in your own components Straigh orward to run your own private galaxy server 29 janvier 2015 Forma;on NGS & Cancer - Analyses Exome

Galaxy: the one-stop shop for genome analysis Analyze Retrieve shared data between galaxy users or upload your own Interac;vely manipulate genomic data with a comprehensive and expanding best-prac;ces toolset Galaxy is designed to work with many different datatypes. hup://wiki.galaxyproject.org/learn/datatypes Visualize Visual analysis environment of your data, your analysis workflows. Publish and Share Results and step-by-step analysis record (Data Libraries and Histories) Customizable pipelines (Workflows) Complete protocols/documenta;ons (Pages) 29 janvier 2015 Forma;on NGS & Cancer - Analyses Exome

https://galaxy.gustaveroussy.fr/galaxyprod 29 janvier 2015 Forma;on NGS & Cancer - Analyses Exome

Data libraries Datasets are accessible from Galaxy or for download. 29 janvier 2015 Forma;on NGS & Cancer - Analyses Exome

History Histories are all steps in the process and the used se}ng. Histories can be imported into your session and rerun as is or modified. 29 janvier 2015 Forma;on NGS & Cancer - Analyses Exome

Workflows Workflows specify the steps in a process (a suite of ordered tools). Workflows are analyses that are meant to be run, each ;me with different user-provided datasets. 29 janvier 2015 Forma;on NGS & Cancer - Analyses Exome

User account Galaxy public Main or Test instances An account is not required to access it But if used, the data quota is increased and full func;onality across sessions opens up, such as naming, saving, sharing, and publishing Galaxy objects (Histories, Workflows, Datasets, Pages). Galaxy @ GR: hups://galaxy.gustaveroussy.fr/galaxyprod An account is required to access it full func;onality across sessions opens up, such as naming, saving, sharing, and publishing Galaxy objects (Histories, Workflows, Datasets, Pages). 29 janvier 2015 Forma;on NGS & Cancer - Analyses Exome