RNA SEQUENCING AND DATA ANALYSIS

Similar documents
RNA SEQUENCING AND DATA ANALYSIS

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

Machine-Learning on Prediction of Inherited Genomic Susceptibility for 20 Major Cancers

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

BWA alignment to reference transcriptome and genome. Convert transcriptome mappings back to genome space

Iso-Seq Method Updates and Target Enrichment Without Amplification for SMRT Sequencing

Selective depletion of abundant RNAs to enable transcriptome analysis of lowinput and highly-degraded RNA from FFPE breast cancer samples

Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser

Transcriptome Analysis

Supplementary Figures

SUPPLEMENTARY INFORMATION

Transcript reconstruction

Nature Genetics: doi: /ng Supplementary Figure 1. Workflow of CDR3 sequence assembly from RNA-seq data.

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

File Name: Supplementary Information Description: Supplementary Figures and Supplementary Tables. File Name: Peer Review File Description:

RNA-seq Introduction

Session 4 Rebecca Poulos

Session 4 Rebecca Poulos

ncounter Assay Automated Process Immobilize and align reporter for image collecting and barcode counting ncounter Prep Station

Ambient temperature regulated flowering time

BIMM 143. RNA sequencing overview. Genome Informatics II. Barry Grant. Lecture In vivo. In vitro.

Aliccia Bollig-Fischer, PhD Department of Oncology, Wayne State University Associate Director Genomics Core Molecular Therapeutics Program Karmanos

ncounter Assay Automated Process Capture & Reporter Probes Bind reporter to surface Remove excess reporters Hybridize CodeSet to RNA

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

Supplementary Tables. Supplementary Figures

Simple, rapid, and reliable RNA sequencing

TCGA. The Cancer Genome Atlas

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Supplementary Figure 1: LUMP Leukocytes unmethylabon to infer tumor purity

Breast and ovarian cancer in Serbia: the importance of mutation detection in hereditary predisposition genes using NGS

The Cancer Genome Atlas & International Cancer Genome Consortium

PSSV User Manual (V2.1)

DNA-seq Bioinformatics Analysis: Copy Number Variation

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

The Cancer Genome Atlas

Cytogenetics 101: Clinical Research and Molecular Genetic Technologies

Elevated RNA Editing Activity Is a Major Contributor to Transcriptomic Diversity in Tumors

Genomic structural variation

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

Trinity: Transcriptome Assembly for Genetic and Functional Analysis of Cancer [U24]

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library

Data mining with Ensembl Biomart. Stéphanie Le Gras

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Introduction to Systems Biology of Cancer Lecture 2

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

Inference of Isoforms from Short Sequence Reads

Supplemental Methods RNA sequencing experiment

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

ChIP-seq hands-on. Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

Deploying the full transcriptome using RNA sequencing. Jo Vandesompele, CSO and co-founder The Non-Coding Genome May 12, 2016, Leuven

RNA- seq Introduc1on. Promises and pi7alls

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

Circular RNAs (circrnas) act a stable mirna sponges

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

PSSV User Manual (V1.0)

CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery

Hands-On Ten The BRCA1 Gene and Protein

Fusion Analysis of Solid Tumors Reveals Novel Rearrangements in Breast Carcinomas

Supplementary Figures

CONTRACTING ORGANIZATION: Johns Hopkins University, Baltimore, MD

Structural Variation and Medical Genomics

Lecture 8 Understanding Transcription RNA-seq analysis. Foundations of Computational Systems Biology David K. Gifford

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

ACE ImmunoID Biomarker Discovery Solutions ACE ImmunoID Platform for Tumor Immunogenomics

Lectures 13: High throughput sequencing: Beyond the genome. Spring 2017 March 28, 2017

Role of FISH in Hematological Cancers

A Statistical Framework for Classification of Tumor Type from microrna Data

A complete next-generation sequencing workfl ow for circulating cell-free DNA isolation and analysis

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

P. Tang ( 鄧致剛 ); PJ Huang ( 黄栢榕 ) g( ); g ( ) Bioinformatics Center, Chang Gung University.

Using the Bravo Liquid-Handling System for Next Generation Sequencing Sample Prep

Module 3: Pathway and Drug Development

ChIP-seq data analysis

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection

RNA-Seq Preparation Comparision Summary: Lexogen, Standard, NEB

Solving Problems of Clustering and Classification of Cancer Diseases Based on DNA Methylation Data 1,2

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

Copy number and somatic mutations drive tumors

The Cancer Genome Atlas Pan-cancer analysis Katherine A. Hoadley

A Practical Guide to Integrative Genomics by RNA-seq and ChIP-seq Analysis

Supplemental Materials and Methods Plasmids and viruses Quantitative Reverse Transcription PCR Generation of molecular standard for quantitative PCR

NGS in Cancer Pathology After the Microscope: From Nucleic Acid to Interpretation

EPIGENOMICS PROFILING SERVICES

MODULE 3: TRANSCRIPTION PART II

Supplementary Figure 1. Copy Number Alterations TP53 Mutation Type. C-class TP53 WT. TP53 mut. Nature Genetics: doi: /ng.

Supplementary Figure 1. Spitzoid Melanoma with PPFIBP1-MET fusion. (a) Histopathology (4x) shows a domed papule with melanocytes extending into the

Genetic alterations of histone lysine methyltransferases and their significance in breast cancer

Performance Characteristics BRCA MASTR Plus Dx

Colorspace & Matching

Metabolomic and Proteomics Solutions for Integrated Biology. Christine Miller Omics Market Manager ASMS 2015

Obstacles and challenges in the analysis of microrna sequencing data

Supplementary Online Content

Characterisation of structural variation in breast. cancer genomes using paired-end sequencing on. the Illumina Genome Analyser

Patnaik SK, et al. MicroRNAs to accurately histotype NSCLC biopsies

ACE ImmunoID. ACE ImmunoID. Precision immunogenomics. Precision Genomics for Immuno-Oncology

Transform genomic data into real-life results

User s Manual Version 1.0

of TERT, MLL4, CCNE1, SENP5, and ROCK1 on tumor development were discussed.

Transcription:

RNA SEQUENCING AND DATA ANALYSIS

Length of mrna transcripts in the human genome 5,000 5,000 4,000 3,000 2,000 4,000 1,000 0 0 200 400 600 800 3,000 2,000 1,000 0 0 2,000 4,000 6,000 8,000 10,000

Length of mrna transcripts in the human genome 5,000 5,000 4,000 3,000 2,000 4,000 1,000 0 0 200 400 600 800 3,000 2,000 Insert size ~ 200bp 1,000 0 0 2,000 4,000 6,000 8,000 10,000

Overview of RNA sequencing protocol SEQUENCING Fwd read Reverse read Insert Read length: 48-76bp

Sequencing parameters Read Depth Minimum mapped reads: 10 million for quantitative analysis of mammalian transcriptome More reads needed for splicing variant discovery and differential comparison among samples Current output: 120-180 million raw reads / lane Multiplex level: 4-12 libraries / lane recommended

All RNA is not the same Types of RNA:

All RNA is not the same Types of RNA: Messenger RNA Micro RNA Long non-coding RNA Ribosomal RNA

Methods for RNA enrichment prior to library construction Poly(A)-RNA selection By hybridization to oligo-dt beads mature mrna highly enriched efficient for quantification of gene expression level and so on limitation: 3 bias correlating with RNA degradation rrna depletion: by hybridization to bead-bound rrna probes rrna sequence-dependent and species-specific all non-rrna retained: premature mrna, long non-coding RNA Small RNA extraction: Specific kits required to retain small RNA Optional fine size-selection by gel or column

Different methods capture different types of RNA Messenger RNA Micro RNA Long non-coding RNA Ribosomal RNA Poly(A)-RNA selection rrna depletion Small RNA extraction

Different methods capture different types of RNA Poly(A)-RNA selection rrna depletion Messenger RNA X X Small RNA extraction Micro RNA X X Long non-coding RNA Ribosomal RNA X X

READ QUALITY Paraffin embedded vs fresh frozen Fresh Frozen

First step: alignment

Or: assembly, then alignment

Alignment versus assembly Assembly Trinity, Cufflinks, ABySS Particularly useful when no reference genome is available, like in bacterial transcriptomes Alignment Bowtie, BWA, Mosaic Maximum sensitivity, fewer false positives

RNA sequencing applications

RNA sequencing applications Quantification of transcript expression levels Detection of splice variation/different isoforms of the same gene Allele specific expression levels Strand specific expression levels Detection of fusion transcripts (such as BCR-ABL in CML) Detection of sequence variation (limited application) Validation of DNA sequence variants

RNA-seq expression levels are linear where microarrays get saturated or are insensitive Expression is measured as reads per kilobase per million (RPKM) or fragments per kilobase of exon per million fragments mapped (FPKM) to normalize for gene length and library size

In GBM, the gene EGFR is frequently targeted by intragenic deletions

viii deletion occurs in same domain as point mutations

Detecting EGFR transcript variants using RNA-seq data

SpliceSeq can detect splice variants http://bioinformatics.mdanderson.org/main/spliceseq:overview

Allele-/Strand-specific RNA-seq Haplotype specific gene expression by computationally integrating RNAseq with DNA SNP data Strand-specific RNA-seq requires specific library preparation protocol Costs more Output more accurate, useful for analysis in absence of a reference genome

Identification of fusion transcripts Popular methods search for Read pairs that map to two different genes Need to correct for gene homology Reads that span fusion junction Split reads in half and align separate halfs Make a database of all possible fusion junctions and align full reads PRADA, MapSplice, TopHat http://sourceforge.net/projects/prada/

FGFR3-TACC3 fusion in GBM is the result of a local inversion FGFR3-TACC3

Fusion transcripts are often associated with copy number difference and genomic breakpoints FGFR3-TACC3 Copy number profile of two FGFR3-TACC3 cases in TCGA

6.4% of GBM harbors transcript fusions involving EGFR All fusions fall within the area of the EGFR amplification

Preprocessing.bam file [PAIRED END] INPUTS Fusion Module Config.txt.fastq files [YES NO ONLY] Discordant [location of read scripts and pair: reference files] Each end of the [END1 & END2] read pair maps uniquely to distinct Processing Module protein-coding genes. Expression & QC Module RNA-SeQC Fusion Module [YES NO ONLY] GUESS-ft [YES NO ONLY] -genea -geneb Read Alignment Remap alignments Combine two ends Quality Scores Recalibrate d OUTPUTS Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B. RPKM & QC metrics Fusion Candidates Supervised search evidence Gene A Gene B

Structural transcript variants in low grade glioma RNA-seq data from 272 TCGA low grade glioma Fusion detection accuracy affected by: PRADA detected 1,843 fusion transcripts #mapped reads per sample Detected #fusion transcripts per sample

Validation of predicted transcript fusions Filtering out artifacts Homology E value larger than 0.01 (column Evalue) No mismatches in junction spanning reads 970/1,843 fusions filtered Count the number of partner genes for each individual gene Identify genes with fusions mapping to more than 10 different chromosome arms 509/970 fusions filtered

Define four tiers of fusion transcripts based on evidence Tier 1: At least 3 discordant read pairs (DSP), two perfect match junction spanning reads (JSR), and both partner genes only fused to one other partner gene in the same sample Tier 2: At least 2 DSP and 1 JSR, with a DNA breakpoint within 100kb window Use matching DNA copy number profile Tier 3: At least 2 DSP and 1 JSR, unique partner genes, with predicted junction consistent for all Tier 4: The rest

Validation of RNA fusions using output of BreakDancer BreakDancer detects DNA rearrangements in low pass sequencing data

Validation of RNA fusions using output of BreakDancer BreakDancer detects DNA rearrangements in low pass sequencing data

Variant detection From TCGA renal cell clear cell carcinoma project Approximately 30% of mutations are covered sufficiently to be detected at a validation rate of ~ 80%. Reverse transcriptase step to convert RNA to cdna complicates detection of RNA edits and mutations

RNA sequencing read alignment in PRADA Transcripts from same gene Reads are aligned to all possible transcripts Reads are also aligned to genome

RNA sequencing read alignment in PRADA Reads are aligned to all possible transcripts Reads are also aligned to genome Final and single placement for each read it determined by re-mapping

PRADA alignments advantages versus disadvantages Advantage: Alignment to DNA means mapping of unannotated transcripts Alignment to transcriptome means mapping across exon-exon junctions Disadvantage More conservative alignment than split-read

Preprocessing.bam file [PAIRED END].fastq files [END1 & END2] INPUTS Config.txt [location of scripts and reference files] Processing Module Expression & QC Module [YES NO ONLY] RNA-SeQC Fusion Module [YES NO ONLY] GUESS-ft [YES NO ONLY] -genea -geneb Read Alignment Remap alignments Combine two ends Quality Scores Recalibrate d OUTPUTS RPKM & QC metrics Fusion Candidates Supervised search evidence http://sourceforge.net/projects/prada/ PRADA focuses on the analysis of paired-end RNA-sequencing data. Four modules: 1. Processing 2. Expression and Quality Control 3. Gene fusion 4. GUESS-ft: General User defined Supervised Search for fusion transcripts

Preprocessing.bam file [PAIRED END].fastq files [END1 & END2] INPUTS Config.txt [location of scripts and reference files] Processing Module Expression & QC Module [YES NO ONLY] RNA-SeQC Fusion Module GUESS-ft [YES NO ONLY] RNAseQC Process (java) [YES NO ONLY] -genea -geneb Read Alignment Remap alignments Combine two ends Quality Scores Recalibrate d Expression & QC Module OUTPUTS RNA-SeQC provides three types of quality control metrics: Read Counts Coverage Correlation RPKM Values at transcript level For longest transcript RPKM & QC metrics Fusion Candidates Supervised search evidence

Preprocessing.bam file [PAIRED END] INPUTS Fusion Module Config.txt.fastq files [YES NO ONLY] Discordant [location of read scripts and pair: reference files] Each end of the [END1 & END2] read pair maps uniquely to distinct Processing Module protein-coding genes. Expression & QC Module RNA-SeQC Fusion Module [YES NO ONLY] GUESS-ft [YES NO ONLY] -genea -geneb Read Alignment Remap alignments Combine two ends Quality Scores Recalibrate d OUTPUTS Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B. RPKM & QC metrics Fusion Candidates Supervised search evidence Gene A Gene B

Preprocessing.bam file [PAIRED END].fastq files [END1 & END2] INPUTS Config.txt [location of scripts and reference files] Expression & QC Module [YES NO ONLY] Fusion Module [YES NO ONLY] GUESS-ft [YES NO ONLY] -genea -geneb Processing Module RNA-SeQC Read Alignment Remap alignments Combine two ends Quality Scores Recalibrate d OUTPUTS RPKM & QC metrics Fusion Candidates Supervised search evidence Implementation Results Samples processed >400 KIRC >170 GBM Works well in MDACC HPC* system PRADA-fusion module validation rate ~85 % (53 out of 62)

RNA sequencing in The Cancer Genome Atlas mrna: poly-a mrna purified from total RNA using poly-t oligo-attached magnetic beads mirna: Total RNA is mixed with oligo(dt) MicroBeads and loaded into MACS column, which is then placed on a MultiMACS separator. From the flow-through, small RNAs, including mirnas, are recovered by ethanol precipitation.

Detecting fusion transcripts in GBM

KIRC fusion results We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccrcc), available through TCGA. We identified 80 bona-fide fusion transcripts, 57 intrachromosomal 33 interchromosomal in 62 individual samples Recurrent fusions SFPQ-TFE3 (n=5, chr1-chrx) DHX33-NLRP1 (n=2, chr2) TRIP12-SLC16A14 (n=2, chr17) TFG-GRP128 (n=4, chr3)

KIRC fusion validation PRADA-fusion module validation rate (11 out of 13) ~85% RT-PCR and FISH assays TFE3-SFPQ was validated in three individual samples Sample ID 5 Gene 3 Gene Discordant Read Pairs Fusion Span Reads Fusion Junction (s) 5 Gene Chr 3 Gene Chr Validated? TCGA-AK-3456-01A-02R-1325-07 TFE3 SFPQ 175 129 1 chrx chr1 Yes TCGA-AK-3456-01A-02R-1325-07 SFPQ TFE3 116 81 1 chr1 chrx Yes TCGA-A3-3313-01A-02R-1325-07 C6orf106 LRRC1 90 40 2 chr6 chr6 Yes TCGA-A3-3313-01A-02R-1325-07 CYP39A1 LEMD2 37 9 1 chr6 chr6 Yes TCGA-B2-4101-01A-02R-1277-07 FAM172A FHIT 17 4 1 chr5 chr3 Yes TCGA-AK-3445-01A-02R-1277-07 KIAA0802 LRRC41 14 6 1 chr18 chr1 Yes TCGA-B0-5095-01A-01R-1420-07 GORASP2 WIPF1 14 2 1 chr2 chr2 Yes TCGA-A3-3313-01A-02R-1325-07 ZNF193 MRPS18A 11 3 1 chr6 chr6 Yes TCGA-A3-3313-01A-02R-1325-07 FTSJD2 GPX6 9 8 1 chr6 chr6 Yes TCGA-B0-4945-01A-01R-1420-07 KIAA0427 GRM4 8 5 1 chr18 chr6 No TCGA-B8-4143-01A-01R-1188-07 SLC36A1 TTC37 5 5 1 chr5 chr5 No

KIRC fusion validation: RT-PCR SFPQ-TFE3 (a) (b) Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. (b) ions for sample TCGA-AK-3456. TFE3-SFPQ

KIRC fusion results We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccrcc), available through TCGA. We identified 80 bona-fide fusion transcripts, 57 intrachromosomal 33 interchromosomal in 62 individual samples Recurrent fusions SFPQ-TFE3 (n=5, chr1-chrx) DHX33-NLRP1 (n=2, chr2) TRIP12-SLC16A14 (n=2, chr17) TFG-GRP128 (n=4, chr3)

TFG-GRP128 has been reported in other cancers

TFG-GRP128 has been reported in other cancers

TFG-GRP128 has been reported in other cancers TCGA has 1,000s of RNA seq samples - how can we quickly scan many samples for the presence of this fusion?

Preprocessing.bam file [PAIRED END] INPUTS Supervised Search Module.fastq files Read Alignment Search Processing for fusion Module transcripts Remap alignments Config.txt [location of scripts and reference files] [END1 & END2] GUESS-ft: General User defined Supervised Use high quality mapping reads only, Checks read orientation fulfills fusion schema, allow up to one mismatch. Two read ends map to A and B respectively Summary report BAM Combine two ends GUESS-ft OUTPUTS Mapped to A or B Discordant reads A-B Quality Scores Recalibrate d Unmapped reads Junction DB Junction spanning reads Expression & QC Module [YES NO ONLY] RNA-SeQC Time consuming step Fusion Module [YES NO ONLY] RPKM & Fusion Parse QC metrics Candidates Unmapped reads with the other end mapping to A or B Map parsed reads to DB of all possible exon junctions List reads with one end map to junction, the other map to A or B GUESS-ft [YES NO ONLY] -genea -geneb Supervised search evidence

Identification of TFG-GRP128 fusion All available normal samples in cghub Subset of tumor samples selected based on RPKM expression pattern Table. Samples across cancer types Cancer Type # of normal samples # of tumor samples Bladder Urothelial Carcinoma [BLCA] 0 (0%) 2 (3.6%) Breast invasive carcinoma [BRCA] 1 (0.94%) 13 (1.6%) Head and Neck squamous cell carcinoma [HNSC] 0 (0%) 6 (2.3%) Kidney renal clear cell carcinoma [KIRC] 1 (1.5%) 5 (1.2%) Kidney renal papillary cell carcinoma [KIRP] 0 (0%) 1 (5.9%) Liver hepatocellular carcinoma [LIHC] 0 (0%) 1 (5.9%) Lung adenocarcinoma [LUAD] 0 (0%) 1 (0.79%) Lung squamous cell carcinoma [LUSC] 0 (0%) 9 (4%) Prostate adenocarcinoma [PRAD] 1 (14.3) 2 (1.9%) Thyroid carcinoma [THCA] 0 (0%) 2 (0.89%) * All performed by PRADA fusion module.

Tumors with the fusion have higher GPR128 expression levels RPKM expression pattern seen in KIRC tumors Fusion sample(s) Higher expression of GPR128 (activation) TCGA-B0-5703 w/ 1 discordant read pair in tumor sample w/ 33 discordant read pair in matched normal

Thanks. http://sourceforge.net/projects/prada/