Inference of Isoforms from Short Sequence Reads

Inference of Isoforms from Short Sequence Reads Tao Jiang Department of Computer Science and Engineering University of California, Riverside Tsinghua University Joint work with Jianxing Feng and Wei Li

Outline 1 Background and Existing Work 2 Quadratic Program 3 Valid Isoforms 4 IsoInfer: The Basic Algorithm 5 An Improvement by Lasso Regression 6 Experimental Results and Comparison to Cufflinks and Scripture 7 Concluding remarks

Transcription, Splicing and Translation http://en.wikipedia.org/wiki/gene

Alternative Splicing

Alternative Splicing 5 3 Pre-mRNA Isoform 1

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Isoform 2

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Isoform 2 Isoform 3

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Alternate 3 Isoform 2 Isoform 3 Isoform 4

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Alternate 3 Alternate 5 Isoform 2 Isoform 3 Isoform 4 Isoform 5

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Alternate 3 Alternate 5 Mixed Isoform 2 Isoform 3 Isoform 4 Isoform 5 Isoform 6

Alternative Splicing 5 3 Pre-mRNA Isoform 1 Intron retention Exon skipping Alternate 3 Alternate 5 Mixed Isoform 2 Isoform 3 Isoform 4 Isoform 5 Isoform 6 Widely spread in human genome More than 92% multi-exon genes

Alternative Splicing An example: KLF6 gene in human chromosome 10, a tumor suppressor gene, includes 4 alternative splicing variants. Scale chr10: KLF6 KLF6 KLF6 KLF6 2 kb 3812000 3812500 3813000 3813500 3814000 3814500 3815000 3815500 3816000 3816500 3817000 RefSeq Genes Three of the four variants, if expressed, inactivate the tumor suppressor gene and are associated with increased prostate cancer risk (DiFeo et al. Cancer Res. 2005).

Experimental Methods For transcriptome analysis (isoform detection and sequencing, alternative splicing and expression level analysis) Traditional methods: EST (Expressed Sequence Tag) RACE (Rapid Amplification of cdna Ends) SAGE (Serial Analysis of Gene Expression) CAGE (Cap Analysis Gene Expression)... cost ineffective New methods: RACEArray (Djebali et al, Nat Methods, 2008) PCR+ deep-well pooling +sequencing (Salehi-Ashtiani et al, Nat Methods, 2008)... large scale? unclear

RNA-Seq Assumption: uniformly distributed along expressed segments

The Problem Single-end short reads (RNA-Seq) Paired-end short reads (RNA-Seq) Exon-intron boundaries (known or RNA-Seq) (Trapnell et al, Bioinformatics, 2009) TopHat (Au et al, NAR, 2010) SpliceMap TSS/PAS pairs (known or GIS-PET) (Ng et al, Nat Methods, 2005) (Fullwood et al, Genome Res, 2009) PAS profiling (3P-Seq) (Jan et al, Nature 2010)

Existing Work Expression level estimation RSAT. (Jiang & Wong, Bioinformatics. 2009) RSEM. (Li et al, Bioinformatics. 2009)...

Existing Work Expression level estimation RSAT. (Jiang & Wong, Bioinformatics. 2009) RSEM. (Li et al, Bioinformatics. 2009)... Isoform reconstruction Scripture. (Guttman et al, Nat Biotechnology. 2010.5) Uses weighting to filter out lowly expressed isoforms. Focuses on ful-length transcripts. Cufflinks. (Trapnell et al, Nat Biotechnology. 2010.5) Uses a minimal path cover algorithm to find a parsimonious solution to explain the read data.

Expressed Segments r i : # reads mapped to expressed segments s i. L-1 L-1 L-1 L-1 } Junction sequences }

Quadratic Program For a gene with isoforms F, expressed segments S and junctions J:

Quadratic Program For a gene with isoforms F, expressed segments S and junctions J: r i B(M,p i ), p i = C s i f,f F x fl i

Quadratic Program For a gene with isoforms F, expressed segments S and junctions J: r i B(M,p i ), p i = C s i f,f F x fl i Approximate it with N(µ i,σ 2 i ) µ i = Mp i,σ 2 i = Mp i (1 p i ) Mp i = µ i

Quadratic Program For a gene with isoforms F, expressed segments S and junctions J: r i B(M,p i ), p i = C s i f,f F x fl i Approximate it with N(µ i,σ 2 i ) µ i = Mp i,σ 2 i = Mp i (1 p i ) Mp i = µ i ǫ i = r i µ i, ǫ i σ i obeys N(0,1) approximately.

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F Convex program

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F Convex program If σ i is knownx f corresponds to the maximum likelihood estimation

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F Convex program If σ i is knownx f corresponds to the maximum likelihood estimation Replace σ i with d i

Quadratic Program min s.t. z = s i S J (ǫ i σ i ) 2 s i f,f F x fl i +ǫ i = d i, s i S J x f 0, f F Convex program If σ i is knownx f corresponds to the maximum likelihood estimation Replace σ i with d i z χ 2 ( S + J )

Single-end Reads Theorem Under the uniform sampling assumption, the probability that an isoform consisting of t exons, with expression level α RPKM, has all its junctions covered by single-end reads is at least P = (1 e αlm/109 ) t 1 RPKM : Reads Per Kilobase of exon model per Million mapped reads.

Paired-end Reads Theorem The probability that there are no paired-end reads with start positions in the first interval and end positions in the third interval is upper bounded by P M,h,α (w 1,w 2,w 3 ) = (1 P 0 ) M e MP 0 where P 0 = 10 9 α 0 i<w 1 u(i) l(i) h(x)dx, l(i) = w 1 i +w 2, and u(i) = w 1 i +w 2 +w 3.

Valid Isoforms Informative pair Segment pair (s i,s j ),i < j, on isoform f is an informative pair if P M,h,α (l i +L 1,g i,j,l j +L 1) < 0.05, where g i,j = i<k<j l kf[k]

Valid Isoforms Informative pair Segment pair (s i,s j ),i < j, on isoform f is an informative pair if P M,h,α (l i +L 1,g i,j,l j +L 1) < 0.05, where g i,j = i<k<j l kf[k] Valid Isoform An isoform f is valid if: 1 All the exon-exon junctions are covered by short reads. 2 All the informative pairs are supported by short reads. α controls this condition. 3 The start-end expressed segment pair appears in the given start-end pair set. (i.e., the TSS/PAS pairs)

The Framework

Isoform Inference on Each Cluster 1 Enumerate all valid isoforms.

Isoform Inference on Each Cluster 1 Enumerate all valid isoforms. 2 Generate subinstances with each one of them focusing on a subset of expressed segments.

Isoform Inference on Each Cluster 1 Enumerate all valid isoforms. 2 Generate subinstances with each one of them focusing on a subset of expressed segments. 3 On each subinstance, enumerate all the subsets of valid isoforms, use QP to find the best one. 4 Combine the results of all the subinstances using set cover.

Lasso Model LASSO: Least Absolute Shrinkage and Selection Operator (Tibshirani. Journal of the Royal Statistical Society, 1996) minf(x) = ( d i x f ) 2 +λ x f (1) l i i s i f,f F f F s.t. x f 0,f F (2) which is equivalent to the following constrained form: minf(x) = ( d i x f ) 2 (3) l i i s i f,f F s.t. x f γ (4) f F x f 0,f F (5)

Lasso Model - Completeness minf(x) = ( d i x f ) 2 (6) l i i s i f,f F s.t. x f γ (7) f F x f 0,f F (8) x f I si f p,if s i has mapped read (9) f F

Expression Level Estimation 2 (A) 5 0 5 5 1.5 log(predicted FPKM) Count 2 10 1 0.5 0 0 0.1 0.1 1 1 10 10 100 Isoform expression levels (RPKM) >100 0 5 10 log(true RPKM) 2 (D) Cufflinks R =0.91679 10 5 0 5 5 (C) IsoInfer, R =0.8965 log(predicted RPKM) 2.5 0 2 (B) IsoLasso, R =0.89886 log(predicted RPKM) x 10 log(predicted score) 4 3 0 5 10 log(true RPKM) 10 5 0 5 5 0 5 10 log(true RPKM) 2 (E) Scripture R =0.50213 10 5 0 5 5 0 5 10 log(true RPKM) Figure: The distribution of simulated isoform expression levels (left), and the accuracy of expression level estimation for different methods (right), using 80M 75 2 paired-end reads. Mouse transcriptome is used.

Isoform Reconstruction - Single-end 0.45 0.4 0.35 0.3 IsoLasso IsoInfer without TSS/PAS Cufflinks Scripture 0.9 0.8 0.7 0.6 IsoLasso IsoInfer without TSS/PAS Cufflinks Scripture Sensitivity 0.25 0.2 Precision 0.5 0.4 0.15 0.3 0.1 0.2 0.05 0.1 0 10M 20M 40M 80M 160M Number of single end reads 0 10M 20M 40M 80M 160M Number of single end reads Figure: Sensitivity (left) and precision (right) on single-end reads

Isoform Reconstruction - Paired-end Sensitivity 0.6 0.5 0.4 0.3 0.2 0.1 IsoLasso IsoInfer without TSS/PAS Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 IsoLasso 0.2 IsoInfer without TSS/PAS 0.1 Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads Figure: Sensitivity (left) and precision (right) on paired-end reads

Isoform Reconstruction - Running time Running time (seconds) 18000 16000 14000 12000 10000 8000 6000 4000 2000 IsoLasso IsoInfer without TSS/PAS Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads

Isoform Reconstruction - Real data Figure: The number of known isoforms of mouse (A) and human (B), and the number of predicted isoforms of mouse (C) and human (D), assembled by IsoLasso, Cufflinks and Scripture.

Isoform Reconstruction - Real data Figure: An alternative 5 start isoform of gene Tmem70 in mouse C2C12 myoblast RNA-Seq data

With TSS/PAS information Sensitivity 0.6 0.5 0.4 0.3 0.2 0.1 IsoLasso IsoInfer with TSS/PAS IsoInfer without TSS/PAS Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads Precision 1 0.9 0.8 0.7 0.6 0.5 0.4 IsoLasso 0.3 IsoInfer with TSS/PAS 0.2 IsoInfer without TSS/PAS 0.1 Cufflinks Scripture 0 20M 40M 60M 80M 100M Number of paired end reads Mouse transcriptome, junctions predicted by TopHat.

Concluding remarks We presented two combinatorial approaches to infer mrna isoforms and estimate their expression levels from RNA-Seq data using convex quadratic programming, set cover and Lasso regression. The programs can be downloaded from: http://www.cs.ucr.edu/~jianxing/isoinfer.html http://www.cs.ucr.edu/~liw/isolasso.html The programs compete well against Cufflinks and Scripture in terms of accuracy and speed. The tools can be further improved by addressing practical issues including sequencing errors, nonuniformly distributed reads, multireads, etc 1 J. Feng, W. Li and T. Jiang. Inference of isoforms from short sequence reads (extended abstract). Proc. 14th RECOMB, April, 2010, pp. 138-157. 2 W. Li, J. Feng and T. Jiang. IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly. Accepted by RECOMB 2011.