Inference of Isoforms from Short Sequence Reads

Similar documents
genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

Lecture 8 Understanding Transcription RNA-seq analysis. Foundations of Computational Systems Biology David K. Gifford

Transcriptome Analysis

Ambient temperature regulated flowering time

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

RNA-seq Introduction

Introduction. Introduction

RNA SEQUENCING AND DATA ANALYSIS

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

Transcript reconstruction

Analyse de données de séquençage haut débit

Effects of UBL5 knockdown on cell cycle distribution and sister chromatid cohesion

Transcriptome and isoform reconstruc1on with short reads. Tangled up in reads

Methods to study splicing from high-throughput RNA Sequencing data

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Supplemental Data. Integrating omics and alternative splicing i reveals insights i into grape response to high temperature

An Analysis of MDM4 Alternative Splicing and Effects Across Cancer Cell Lines

Global regulation of alternative splicing by adenosine deaminase acting on RNA (ADAR)

Introduction to Systems Biology of Cancer Lecture 2

Supplemental Methods RNA sequencing experiment

Daehwan Kim September 2018

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Computational Modeling and Inference of Alternative Splicing Regulation.

Mechanisms of alternative splicing regulation

TITLE: The Role Of Alternative Splicing In Breast Cancer Progression

Supplementary Figures

RNA SEQUENCING AND DATA ANALYSIS

Analysis and design of RNA sequencing experiments for identifying isoform regulation

High-Throughput Sequencing Course

Lectures 13: High throughput sequencing: Beyond the genome. Spring 2017 March 28, 2017

Alternative splicing. Biosciences 741: Genomics Fall, 2013 Week 6

Profiles of gene expression & diagnosis/prognosis of cancer. MCs in Advanced Genetics Ainoa Planas Riverola

RNA-Seq guided gene therapy for vision loss. Michael H. Farkas

BIMM 143. RNA sequencing overview. Genome Informatics II. Barry Grant. Lecture In vivo. In vitro.

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

MATS: a Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data

Computational aspects of ChIP-seq. John Marioni Research Group Leader European Bioinformatics Institute European Molecular Biology Laboratory

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

BWA alignment to reference transcriptome and genome. Convert transcriptome mappings back to genome space

RNA- seq Introduc1on. Promises and pi7alls

Studying Alternative Splicing

A Practical Guide to Integrative Genomics by RNA-seq and ChIP-seq Analysis

CRISPR/Cas9 Enrichment and Long-read WGS for Structural Variant Discovery

Evaluating Classifiers for Disease Gene Discovery

Supplementary Material for IPred - Integrating Ab Initio and Evidence Based Gene Predictions to Improve Prediction Accuracy

Selective depletion of abundant RNAs to enable transcriptome analysis of lowinput and highly-degraded RNA from FFPE breast cancer samples

Pre-mRNA has introns The splicing complex recognizes semiconserved sequences

Experimental Design For Microarray Experiments. Robert Gentleman, Denise Scholtens Arden Miller, Sandrine Dudoit

A Statistical Framework for Classification of Tumor Type from microrna Data

Hands-On Ten The BRCA1 Gene and Protein

MODULE 3: TRANSCRIPTION PART II

Identification of Prostate Cancer LncRNAs by RNA-Seq

Iso-Seq Method Updates and Target Enrichment Without Amplification for SMRT Sequencing

GENOME-WIDE DETECTION OF ALTERNATIVE SPLICING IN EXPRESSED SEQUENCES USING PARTIAL ORDER MULTIPLE SEQUENCE ALIGNMENT GRAPHS

Biostatistical modelling in genomics for clinical cancer studies

CS 6824: Tissue-Based Map of the Human Proteome

ChIP-seq data analysis

Nature Structural & Molecular Biology: doi: /nsmb.2419

Multi-omics data integration colon cancer using proteogenomics approach

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009

SCIENCE CHINA Life Sciences

AbSeq on the BD Rhapsody system: Exploration of single-cell gene regulation by simultaneous digital mrna and protein quantification

MapSplice: Accurate Mapping of RNA-Seq Reads for Splice Junction Discovery

High Throughput Sequence (HTS) data analysis. Lei Zhou

DeconRNASeq: A Statistical Framework for Deconvolution of Heterogeneous Tissue Samples Based on mrna-seq data

Methods: Biological Data

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Bio 111 Study Guide Chapter 17 From Gene to Protein

SUPPLEMENTARY INFORMATION. Intron retention is a widespread mechanism of tumor suppressor inactivation.

NEXT GENERATION SEQUENCING. R. Piazza (MD, PhD) Dept. of Medicine and Surgery, University of Milano-Bicocca

Ridge regression for risk prediction

The corrected Figure S1J is shown below. The text changes are as follows, with additions in bold and deletions in bracketed italics:

A Quick-Start Guide for rseqdiff

Computer Science, Biology, and Biomedical Informatics (CoSBBI) Outline. Molecular Biology of Cancer AND. Goals/Expectations. David Boone 7/1/2015

RECENT ADVANCES IN THE MOLECULAR DIAGNOSIS OF BREAST CANCER

EST alignments suggest that [secret number]% of Arabidopsis thaliana genes are alternatively spliced

Research Strategy: 1. Background and Significance

Tutorial: RNA-Seq Analysis Part II: Non-Specific Matches and Expression Measures

Small-area estimation of mental illness prevalence for schools

GeneOverlap: An R package to test and visualize

Big Image-Omics Data Analytics for Clinical Outcome Prediction

ncounter Assay Automated Process Immobilize and align reporter for image collecting and barcode counting ncounter Prep Station

Analysis of left-censored multiplex immunoassay data: A unified approach

Antibodies and T Cell Receptor Genetics Generation of Antigen Receptor Diversity

Supplemental Information For: The genetics of splicing in neuroblastoma

Bayesian hierarchical modelling

The Alternative Choice of Constitutive Exons throughout Evolution

Results. Abstract. Introduc4on. Conclusions. Methods. Funding

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

REGULATED SPLICING AND THE UNSOLVED MYSTERY OF SPLICEOSOME MUTATIONS IN CANCER

Gene-microRNA network module analysis for ovarian cancer

Lecture 10: Learning Optimal Personalized Treatment Rules Under Risk Constraint

Trinity: Transcriptome Assembly for Genetic and Functional Analysis of Cancer [U24]

Detection of copy number variations in PCR-enriched targeted sequencing data

Proteogenomic analysis of alternative splicing: the search for novel biomarkers for colorectal cancer Gosia Komor

Finding subtle mutations with the Shannon human mrna splicing pipeline

Transcription:

Inference of Isoforms from Short Sequence Reads Tao Jiang Department of Computer Science and Engineering University of California, Riverside Tsinghua University Joint work with Jianxing Feng and Wei Li

Outline 1 Background and Existing Work 2 Quadratic Program 3 Valid Isoforms 4 IsoInfer: The Basic Algorithm 5 An Improvement by Lasso Regression 6 Experimental Results and Comparison to Cufflinks and Scripture 7 Concluding remarks

Transcription, Splicing and Translation http://en.wikipedia.org/wiki/gene

Alternative Splicing

Alternative Splicing 5 3 Pre-mRNA Isoform 1

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Isoform 2

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Isoform 2 Isoform 3

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Alternate 3 Isoform 2 Isoform 3 Isoform 4

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Alternate 3 Alternate 5 Isoform 2 Isoform 3 Isoform 4 Isoform 5

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Alternate 3 Alternate 5 Mixed Isoform 2 Isoform 3 Isoform 4 Isoform 5 Isoform 6

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Alternate 3 Alternate 5 Mixed Isoform 2 Isoform 3 Isoform 4 Isoform 5 Isoform 6 Widely spread in human genome More than 92% multi-exon genes

Alternative Splicing An example: KLF6 gene in human chromosome 10, a tumor suppressor gene, includes 4 alternative splicing variants. Scale chr10: KLF6 KLF6 KLF6 KLF6 2 kb 3812000 3812500 3813000 3813500 3814000 3814500 3815000 3815500 3816000 3816500 3817000 RefSeq Genes

Alternative Splicing An example: KLF6 gene in human chromosome 10, a tumor suppressor gene, includes 4 alternative splicing variants. Scale chr10: KLF6 KLF6 KLF6 KLF6 2 kb 3812000 3812500 3813000 3813500 3814000 3814500 3815000 3815500 3816000 3816500 3817000 RefSeq Genes Three of the four variants, if expressed, inactivate the tumor suppressor gene and are associated with increased prostate cancer risk (DiFeo et al. Cancer Res. 2005).

Experimental Methods For transcriptome analysis (isoform detection and sequencing, alternative splicing and expression level analysis) Traditional methods: EST (Expressed Sequence Tag) RACE (Rapid Amplification of cdna Ends) SAGE (Serial Analysis of Gene Expression) CAGE (Cap Analysis Gene Expression)... cost ineffective New methods: RACEArray (Djebali et al, Nat Methods, 2008) PCR+ deep-well pooling +sequencing (Salehi-Ashtiani et al, Nat Methods, 2008)... large scale? unclear

RNA-Seq Assumption: uniformly distributed along expressed segments

The Problem Single-end short reads (RNA-Seq) Paired-end short reads (RNA-Seq) Exon-intron boundaries (known or RNA-Seq) (Trapnell et al, Bioinformatics, 2009) TopHat (Au et al, NAR, 2010) SpliceMap TSS/PAS pairs (known or GIS-PET) (Ng et al, Nat Methods, 2005) (Fullwood et al, Genome Res, 2009) PAS profiling (3P-Seq) (Jan et al, Nature 2010)

The Problem Single-end short reads (RNA-Seq) Paired-end short reads (RNA-Seq) Exon-intron boundaries (known or RNA-Seq) (Trapnell et al, Bioinformatics, 2009) TopHat (Au et al, NAR, 2010) SpliceMap TSS/PAS pairs (known or GIS-PET) (Ng et al, Nat Methods, 2005) (Fullwood et al, Genome Res, 2009) PAS profiling (3P-Seq) (Jan et al, Nature 2010) Problem Given the data, find all the isoforms and expression levels of each isoform.

Existing Work Expression level estimation RSAT. (Jiang & Wong, Bioinformatics. 2009) RSEM. (Li et al, Bioinformatics. 2009)...

Existing Work Expression level estimation RSAT. (Jiang & Wong, Bioinformatics. 2009) RSEM. (Li et al, Bioinformatics. 2009)... Isoform reconstruction Scripture. (Guttman et al, Nat Biotechnology. 2010.5) Uses weighting to filter out lowly expressed isoforms. Focuses on ful-length transcripts. Cufflinks. (Trapnell et al, Nat Biotechnology. 2010.5) Uses a minimal path cover algorithm to find a parsimonious solution to explain the read data.

Outline 1 Background and Existing Work 2 Quadratic Program 3 Valid Isoforms 4 IsoInfer: The Basic Algorithm 5 An Improvement by Lasso Regression 6 Experimental Results and Comparison to Cufflinks and Scripture 7 Concluding remarks

Expressed Segments r i : # reads mapped to expressed segments s i. L-1 L-1 L-1 L-1 } Junction sequences }

Quadratic Program For a gene with isoforms F, expressed segments S and junctions J:

Quadratic Program For a gene with isoforms F, expressed segments S and junctions J: r i B(M,p i ), p i = C s i f,f F x fl i

Quadratic Program For a gene with isoforms F, expressed segments S and junctions J: r i B(M,p i ), p i = C s i f,f F x fl i Approximate it with N(µ i,σ 2 i ) µ i = Mp i,σ 2 i = Mp i (1 p i ) Mp i = µ i

Quadratic Program For a gene with isoforms F, expressed segments S and junctions J: r i B(M,p i ), p i = C s i f,f F x fl i Approximate it with N(µ i,σ 2 i ) µ i = Mp i,σ 2 i = Mp i (1 p i ) Mp i = µ i ǫ i = r i µ i, ǫ i σ i obeys N(0,1) approximately.

Quadratic Program For a gene with isoforms F, expressed segments S and junctions J: r i B(M,p i ), p i = C s i f,f F x fl i Approximate it with N(µ i,σ 2 i ) µ i = Mp i,σ 2 i = Mp i (1 p i ) Mp i = µ i ǫ i = r i µ i, ǫ i σ i obeys N(0,1) approximately. min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F Convex program

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F Convex program If σ i is knownx f corresponds to the maximum likelihood estimation

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F Convex program If σ i is knownx f corresponds to the maximum likelihood estimation Replace σ i with d i

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F Convex program If σ i is knownx f corresponds to the maximum likelihood estimation Replace σ i with d i z χ 2 ( S + J )

Outline 1 Background and Existing Work 2 Quadratic Program 3 Valid Isoforms 4 IsoInfer: The Basic Algorithm 5 An Improvement by Lasso Regression 6 Experimental Results and Comparison to Cufflinks and Scripture 7 Concluding remarks

Single-end Reads Theorem Under the uniform sampling assumption, the probability that an isoform consisting of t exons, with expression level α RPKM, has all its junctions covered by single-end reads is at least P = (1 e αlm/109 ) t 1 RPKM : Reads Per Kilobase of exon model per Million mapped reads.

Single-end Reads Theorem Under the uniform sampling assumption, the probability that an isoform consisting of t exons, with expression level α RPKM, has all its junctions covered by single-end reads is at least P = (1 e αlm/109 ) t 1 RPKM : Reads Per Kilobase of exon model per Million mapped reads. Example M = 40,000,000,L = 30,α = 6 RPKM. If t = 10, then P = 99.3% If t = 100, then P = 92.8%

Paired-end Reads Theorem The probability that there are no paired-end reads with start positions in the first interval and end positions in the third interval is upper bounded by P M,h,α (w 1,w 2,w 3 ) = (1 P 0 ) M e MP 0 where P 0 = 10 9 α 0 i<w 1 u(i) l(i) h(x)dx, l(i) = w 1 i +w 2, and u(i) = w 1 i +w 2 +w 3.

Valid Isoforms Informative pair Segment pair (s i,s j ),i < j, on isoform f is an informative pair if P M,h,α (l i +L 1,g i,j,l j +L 1) < 0.05, where g i,j = i<k<j l kf[k]

Valid Isoforms Informative pair Segment pair (s i,s j ),i < j, on isoform f is an informative pair if P M,h,α (l i +L 1,g i,j,l j +L 1) < 0.05, where g i,j = i<k<j l kf[k] Valid Isoform An isoform f is valid if: 1 All the exon-exon junctions are covered by short reads. 2 All the informative pairs are supported by short reads. α controls this condition. 3 The start-end expressed segment pair appears in the given start-end pair set. (i.e., the TSS/PAS pairs)

Outline 1 Background and Existing Work 2 Quadratic Program 3 Valid Isoforms 4 IsoInfer: The Basic Algorithm 5 An Improvement by Lasso Regression 6 Experimental Results and Comparison to Cufflinks and Scripture 7 Concluding remarks

The Framework

Isoform Inference on Each Cluster 1 Enumerate all valid isoforms.

Isoform Inference on Each Cluster 1 Enumerate all valid isoforms. 2 Generate subinstances with each one of them focusing on a subset of expressed segments.

Isoform Inference on Each Cluster 1 Enumerate all valid isoforms. 2 Generate subinstances with each one of them focusing on a subset of expressed segments. 3 On each subinstance, enumerate all the subsets of valid isoforms, use QP to find the best one.

Isoform Inference on Each Cluster 1 Enumerate all valid isoforms. 2 Generate subinstances with each one of them focusing on a subset of expressed segments. 3 On each subinstance, enumerate all the subsets of valid isoforms, use QP to find the best one. 4 Combine the results of all the subinstances using set cover.

Isoform Inference on Each Cluster 1 Enumerate all valid isoforms. 2 Generate subinstances with each one of them focusing on a subset of expressed segments. 3 On each subinstance, enumerate all the subsets of valid isoforms, use QP to find the best one. 4 Combine the results of all the subinstances using set cover. s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 s 10 s 11 s 12 s 13 s 14 s 15 s 16

Outline 1 Background and Existing Work 2 Quadratic Program 3 Valid Isoforms 4 IsoInfer: The Basic Algorithm 5 An Improvement by Lasso Regression 6 Experimental Results and Comparison to Cufflinks and Scripture 7 Concluding remarks

Lasso Model LASSO: Least Absolute Shrinkage and Selection Operator (Tibshirani. Journal of the Royal Statistical Society, 1996) minf(x) = ( d i x f ) 2 +λ x f (1) l i i s i f,f F f F s.t. x f 0,f F (2) which is equivalent to the following constrained form: minf(x) = ( d i x f ) 2 (3) l i i s i f,f F s.t. x f γ (4) f F x f 0,f F (5)

Lasso Model - Completeness minf(x) = ( d i x f ) 2 (6) l i i s i f,f F s.t. x f γ (7) f F x f 0,f F (8) x f I si f p,if s i has mapped read (9) f F

Outline 1 Background and Existing Work 2 Quadratic Program 3 Valid Isoforms 4 IsoInfer: The Basic Algorithm 5 An Improvement by Lasso Regression 6 Experimental Results and Comparison to Cufflinks and Scripture 7 Concluding remarks

Expression Level Estimation 2 (A) 5 0 5 5 1.5 log(predicted FPKM) Count 2 10 1 0.5 0 0 0.1 0.1 1 1 10 10 100 Isoform expression levels (RPKM) >100 0 5 10 log(true RPKM) 2 (D) Cufflinks R =0.91679 10 5 0 5 5 (C) IsoInfer, R =0.8965 log(predicted RPKM) 2.5 0 2 (B) IsoLasso, R =0.89886 log(predicted RPKM) x 10 log(predicted score) 4 3 0 5 10 log(true RPKM) 10 5 0 5 5 0 5 10 log(true RPKM) 2 (E) Scripture R =0.50213 10 5 0 5 5 0 5 10 log(true RPKM) Figure: The distribution of simulated isoform expression levels (left), and the accuracy of expression level estimation for different methods (right), using 80M 75 2 paired-end reads. Mouse transcriptome is used.

Isoform Reconstruction - Single-end 0.45 0.4 0.35 0.3 IsoLasso IsoInfer without TSS/PAS Cufflinks Scripture 0.9 0.8 0.7 0.6 IsoLasso IsoInfer without TSS/PAS Cufflinks Scripture Sensitivity 0.25 0.2 Precision 0.5 0.4 0.15 0.3 0.1 0.2 0.05 0.1 0 10M 20M 40M 80M 160M Number of single end reads 0 10M 20M 40M 80M 160M Number of single end reads Figure: Sensitivity (left) and precision (right) on single-end reads

Isoform Reconstruction - Paired-end Sensitivity 0.6 0.5 0.4 0.3 0.2 0.1 IsoLasso IsoInfer without TSS/PAS Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 IsoLasso 0.2 IsoInfer without TSS/PAS 0.1 Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads Figure: Sensitivity (left) and precision (right) on paired-end reads

Isoform Reconstruction - Running time Running time (seconds) 18000 16000 14000 12000 10000 8000 6000 4000 2000 IsoLasso IsoInfer without TSS/PAS Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads

Isoform Reconstruction - Real data Figure: The number of known isoforms of mouse (A) and human (B), and the number of predicted isoforms of mouse (C) and human (D), assembled by IsoLasso, Cufflinks and Scripture.

Isoform Reconstruction - Real data Figure: An alternative 5 start isoform of gene Tmem70 in mouse C2C12 myoblast RNA-Seq data

With TSS/PAS information Sensitivity 0.6 0.5 0.4 0.3 0.2 0.1 IsoLasso IsoInfer with TSS/PAS IsoInfer without TSS/PAS Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 IsoLasso 0.3 IsoInfer with TSS/PAS 0.2 IsoInfer without TSS/PAS 0.1 Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads Mouse transcriptome, junctions predicted by TopHat.

Outline 1 Background and Existing Work 2 Quadratic Program 3 Valid Isoforms 4 IsoInfer: The Basic Algorithm 5 An Improvement by Lasso Regression 6 Experimental Results and Comparison to Cufflinks and Scripture 7 Concluding remarks

Concluding remarks We presented two combinatorial approaches to infer mrna isoforms and estimate their expression levels from RNA-Seq data using convex quadratic programming, set cover and Lasso regression. The programs can be downloaded from: http://www.cs.ucr.edu/~jianxing/isoinfer.html http://www.cs.ucr.edu/~liw/isolasso.html The programs compete well against Cufflinks and Scripture in terms of accuracy and speed. The tools can be further improved by addressing practical issues including sequencing errors, nonuniformly distributed reads, multireads, etc 1 J. Feng, W. Li and T. Jiang. Inference of isoforms from short sequence reads (extended abstract). Proc. 14th RECOMB, April, 2010, pp. 138-157. 2 W. Li, J. Feng and T. Jiang. IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly. Accepted by RECOMB 2011.