RNA SEQUENCING AND DATA ANALYSIS

Length of mrna transcripts in the human genome 5,000 5,000 4,000 3,000 2,000 4,000 1,000 0 0 200 400 600 800 3,000 2,000 1,000 0 0 2,000 4,000 6,000 8,000 10,000

Length of mrna transcripts in the human genome 5,000 5,000 4,000 3,000 2,000 4,000 1,000 0 0 200 400 600 800 3,000 2,000 Insert size ~ 200bp 1,000 0 0 2,000 4,000 6,000 8,000 10,000

Overview of RNA sequencing protocol SEQUENCING Fwd read Reverse read Insert Read length: 48-76bp

Sequencing parameters Read Depth Minimum mapped reads: 10 million for quantitative analysis of mammalian transcriptome More reads needed for splicing variant discovery and differential comparison among samples Current output: 120-180 million raw reads / lane Multiplex level: 4-12 libraries / lane recommended

All RNA is not the same Types of RNA:

All RNA is not the same Types of RNA: Messenger RNA Micro RNA Long non-coding RNA Ribosomal RNA

Methods for RNA enrichment prior to library construction Poly(A)-RNA selection By hybridization to oligo-dt beads mature mrna highly enriched efficient for quantification of gene expression level and so on limitation: 3 bias correlating with RNA degradation rrna depletion: by hybridization to bead-bound rrna probes rrna sequence-dependent and species-specific all non-rrna retained: premature mrna, long non-coding RNA Small RNA extraction: Specific kits required to retain small RNA Optional fine size-selection by gel or column

Different methods capture different types of RNA Messenger RNA Micro RNA Long non-coding RNA Ribosomal RNA Poly(A)-RNA selection rrna depletion Small RNA extraction

Different methods capture different types of RNA Poly(A)-RNA selection rrna depletion Messenger RNA X X Small RNA extraction Micro RNA X X Long non-coding RNA Ribosomal RNA X X

READ QUALITY Paraffin embedded vs fresh frozen Fresh Frozen

First step: alignment

Or: assembly, then alignment

Alignment versus assembly Assembly Trinity, Cufflinks, ABySS Particularly useful when no reference genome is available, like in bacterial transcriptomes Alignment Bowtie, BWA, Mosaic Maximum sensitivity, fewer false positives

RNA sequencing applications

RNA sequencing applications Quantification of transcript expression levels Detection of splice variation/different isoforms of the same gene Allele specific expression levels Strand specific expression levels Detection of fusion transcripts (such as BCR-ABL in CML) Detection of sequence variation (limited application) Validation of DNA sequence variants

RNA-seq expression levels are linear where microarrays get saturated or are insensitive Expression is measured as reads per kilobase per million (RPKM) or fragments per kilobase of exon per million fragments mapped (FPKM) to normalize for gene length and library size

In GBM, the gene EGFR is frequently targeted by intragenic deletions

viii deletion occurs in same domain as point mutations

Detecting EGFR transcript variants using RNA-seq data

SpliceSeq can detect splice variants http://bioinformatics.mdanderson.org/main/spliceseq:overview

Allele-/Strand-specific RNA-seq Haplotype specific gene expression by computationally integrating RNAseq with DNA SNP data Strand-specific RNA-seq requires specific library preparation protocol Costs more Output more accurate, useful for analysis in absence of a reference genome

Identification of fusion transcripts Popular methods search for Read pairs that map to two different genes Need to correct for gene homology Reads that span fusion junction Split reads in half and align separate halfs Make a database of all possible fusion junctions and align full reads PRADA, MapSplice, TopHat http://sourceforge.net/projects/prada/

FGFR3-TACC3 fusion in GBM is the result of a local inversion FGFR3-TACC3

Fusion transcripts are often associated with copy number difference and genomic breakpoints FGFR3-TACC3 Copy number profile of two FGFR3-TACC3 cases in TCGA

6.4% of GBM harbors transcript fusions involving EGFR All fusions fall within the area of the EGFR amplification

Preprocessing.bam file [PAIRED END] INPUTS Fusion Module Config.txt.fastq files [YES NO ONLY] Discordant [location of read scripts and pair: reference files] Each end of the [END1 & END2] read pair maps uniquely to distinct Processing Module protein-coding genes. Expression & QC Module RNA-SeQC Fusion Module [YES NO ONLY] GUESS-ft [YES NO ONLY] -genea -geneb Read Alignment Remap alignments Combine two ends Quality Scores Recalibrate d OUTPUTS Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B. RPKM & QC metrics Fusion Candidates Supervised search evidence Gene A Gene B

Structural transcript variants in low grade glioma RNA-seq data from 272 TCGA low grade glioma Fusion detection accuracy affected by: PRADA detected 1,843 fusion transcripts #mapped reads per sample Detected #fusion transcripts per sample

Validation of predicted transcript fusions Filtering out artifacts Homology E value larger than 0.01 (column Evalue) No mismatches in junction spanning reads 970/1,843 fusions filtered Count the number of partner genes for each individual gene Identify genes with fusions mapping to more than 10 different chromosome arms 509/970 fusions filtered

Define four tiers of fusion transcripts based on evidence Tier 1: At least 3 discordant read pairs (DSP), two perfect match junction spanning reads (JSR), and both partner genes only fused to one other partner gene in the same sample Tier 2: At least 2 DSP and 1 JSR, with a DNA breakpoint within 100kb window Use matching DNA copy number profile Tier 3: At least 2 DSP and 1 JSR, unique partner genes, with predicted junction consistent for all Tier 4: The rest

Validation of RNA fusions using output of BreakDancer BreakDancer detects DNA rearrangements in low pass sequencing data

Variant detection From TCGA renal cell clear cell carcinoma project Approximately 30% of mutations are covered sufficiently to be detected at a validation rate of ~ 80%. Reverse transcriptase step to convert RNA to cdna complicates detection of RNA edits and mutations

RNA sequencing read alignment in PRADA Transcripts from same gene Reads are aligned to all possible transcripts Reads are also aligned to genome

RNA sequencing read alignment in PRADA Reads are aligned to all possible transcripts Reads are also aligned to genome Final and single placement for each read it determined by re-mapping

PRADA alignments advantages versus disadvantages Advantage: Alignment to DNA means mapping of unannotated transcripts Alignment to transcriptome means mapping across exon-exon junctions Disadvantage More conservative alignment than split-read

Preprocessing.bam file [PAIRED END].fastq files [END1 & END2] INPUTS Config.txt [location of scripts and reference files] Processing Module Expression & QC Module [YES NO ONLY] RNA-SeQC Fusion Module [YES NO ONLY] GUESS-ft [YES NO ONLY] -genea -geneb Read Alignment Remap alignments Combine two ends Quality Scores Recalibrate d OUTPUTS RPKM & QC metrics Fusion Candidates Supervised search evidence http://sourceforge.net/projects/prada/ PRADA focuses on the analysis of paired-end RNA-sequencing data. Four modules: 1. Processing 2. Expression and Quality Control 3. Gene fusion 4. GUESS-ft: General User defined Supervised Search for fusion transcripts

Preprocessing.bam file [PAIRED END].fastq files [END1 & END2] INPUTS Config.txt [location of scripts and reference files] Processing Module Expression & QC Module [YES NO ONLY] RNA-SeQC Fusion Module GUESS-ft [YES NO ONLY] RNAseQC Process (java) [YES NO ONLY] -genea -geneb Read Alignment Remap alignments Combine two ends Quality Scores Recalibrate d Expression & QC Module OUTPUTS RNA-SeQC provides three types of quality control metrics: Read Counts Coverage Correlation RPKM Values at transcript level For longest transcript RPKM & QC metrics Fusion Candidates Supervised search evidence

Preprocessing.bam file [PAIRED END].fastq files [END1 & END2] INPUTS Config.txt [location of scripts and reference files] Expression & QC Module [YES NO ONLY] Fusion Module [YES NO ONLY] GUESS-ft [YES NO ONLY] -genea -geneb Processing Module RNA-SeQC Read Alignment Remap alignments Combine two ends Quality Scores Recalibrate d OUTPUTS RPKM & QC metrics Fusion Candidates Supervised search evidence Implementation Results Samples processed >400 KIRC >170 GBM Works well in MDACC HPC* system PRADA-fusion module validation rate ~85 % (53 out of 62)

RNA sequencing in The Cancer Genome Atlas mrna: poly-a mrna purified from total RNA using poly-t oligo-attached magnetic beads mirna: Total RNA is mixed with oligo(dt) MicroBeads and loaded into MACS column, which is then placed on a MultiMACS separator. From the flow-through, small RNAs, including mirnas, are recovered by ethanol precipitation.

Detecting fusion transcripts in GBM

KIRC fusion results We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccrcc), available through TCGA. We identified 80 bona-fide fusion transcripts, 57 intrachromosomal 33 interchromosomal in 62 individual samples Recurrent fusions SFPQ-TFE3 (n=5, chr1-chrx) DHX33-NLRP1 (n=2, chr2) TRIP12-SLC16A14 (n=2, chr17) TFG-GRP128 (n=4, chr3)

KIRC fusion validation PRADA-fusion module validation rate (11 out of 13) ~85% RT-PCR and FISH assays TFE3-SFPQ was validated in three individual samples Sample ID 5 Gene 3 Gene Discordant Read Pairs Fusion Span Reads Fusion Junction (s) 5 Gene Chr 3 Gene Chr Validated? TCGA-AK-3456-01A-02R-1325-07 TFE3 SFPQ 175 129 1 chrx chr1 Yes TCGA-AK-3456-01A-02R-1325-07 SFPQ TFE3 116 81 1 chr1 chrx Yes TCGA-A3-3313-01A-02R-1325-07 C6orf106 LRRC1 90 40 2 chr6 chr6 Yes TCGA-A3-3313-01A-02R-1325-07 CYP39A1 LEMD2 37 9 1 chr6 chr6 Yes TCGA-B2-4101-01A-02R-1277-07 FAM172A FHIT 17 4 1 chr5 chr3 Yes TCGA-AK-3445-01A-02R-1277-07 KIAA0802 LRRC41 14 6 1 chr18 chr1 Yes TCGA-B0-5095-01A-01R-1420-07 GORASP2 WIPF1 14 2 1 chr2 chr2 Yes TCGA-A3-3313-01A-02R-1325-07 ZNF193 MRPS18A 11 3 1 chr6 chr6 Yes TCGA-A3-3313-01A-02R-1325-07 FTSJD2 GPX6 9 8 1 chr6 chr6 Yes TCGA-B0-4945-01A-01R-1420-07 KIAA0427 GRM4 8 5 1 chr18 chr6 No TCGA-B8-4143-01A-01R-1188-07 SLC36A1 TTC37 5 5 1 chr5 chr5 No

KIRC fusion validation: RT-PCR SFPQ-TFE3 (a) (b) Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. (b) ions for sample TCGA-AK-3456. TFE3-SFPQ

TFG-GRP128 has been reported in other cancers

TFG-GRP128 has been reported in other cancers TCGA has 1,000s of RNA seq samples - how can we quickly scan many samples for the presence of this fusion?

Preprocessing.bam file [PAIRED END] INPUTS Supervised Search Module.fastq files Read Alignment Search Processing for fusion Module transcripts Remap alignments Config.txt [location of scripts and reference files] [END1 & END2] GUESS-ft: General User defined Supervised Use high quality mapping reads only, Checks read orientation fulfills fusion schema, allow up to one mismatch. Two read ends map to A and B respectively Summary report BAM Combine two ends GUESS-ft OUTPUTS Mapped to A or B Discordant reads A-B Quality Scores Recalibrate d Unmapped reads Junction DB Junction spanning reads Expression & QC Module [YES NO ONLY] RNA-SeQC Time consuming step Fusion Module [YES NO ONLY] RPKM & Fusion Parse QC metrics Candidates Unmapped reads with the other end mapping to A or B Map parsed reads to DB of all possible exon junctions List reads with one end map to junction, the other map to A or B GUESS-ft [YES NO ONLY] -genea -geneb Supervised search evidence

Identification of TFG-GRP128 fusion All available normal samples in cghub Subset of tumor samples selected based on RPKM expression pattern Table. Samples across cancer types Cancer Type # of normal samples # of tumor samples Bladder Urothelial Carcinoma [BLCA] 0 (0%) 2 (3.6%) Breast invasive carcinoma [BRCA] 1 (0.94%) 13 (1.6%) Head and Neck squamous cell carcinoma [HNSC] 0 (0%) 6 (2.3%) Kidney renal clear cell carcinoma [KIRC] 1 (1.5%) 5 (1.2%) Kidney renal papillary cell carcinoma [KIRP] 0 (0%) 1 (5.9%) Liver hepatocellular carcinoma [LIHC] 0 (0%) 1 (5.9%) Lung adenocarcinoma [LUAD] 0 (0%) 1 (0.79%) Lung squamous cell carcinoma [LUSC] 0 (0%) 9 (4%) Prostate adenocarcinoma [PRAD] 1 (14.3) 2 (1.9%) Thyroid carcinoma [THCA] 0 (0%) 2 (0.89%) * All performed by PRADA fusion module.

Tumors with the fusion have higher GPR128 expression levels RPKM expression pattern seen in KIRC tumors Fusion sample(s) Higher expression of GPR128 (activation) TCGA-B0-5703 w/ 1 discordant read pair in tumor sample w/ 33 discordant read pair in matched normal

Thanks. http://sourceforge.net/projects/prada/