Pre-mRNA Secondary Structure Prediction Aids Splice Site Recognition

Similar documents
Gene Expression: Details (Eukaryotes) Pre-mRNA Secondary Structure Prediction Aids Splice Site Recognition

PRE-mRNA SECONDARY STRUCTURE PREDICTION AIDS SPLICE SITE PREDICTION

Gene Finding in Eukaryotes

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Genetics. Instructor: Dr. Jihad Abdallah Transcription of DNA

MODULE 3: TRANSCRIPTION PART II

Studying Alternative Splicing

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

Sebastian Jaenicke. trnascan-se. Improved detection of trna genes in genomic sequences

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Annotation of Drosophila mojavensis fosmid 8 Priya Srikanth Bio 434W

SUPPLEMENTARY FIGURES: Supplementary Figure 1

SpliceDB: database of canonical and non-canonical mammalian splice sites

Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009

RECAP (1)! In eukaryotes, large primary transcripts are processed to smaller, mature mrnas.! What was first evidence for this precursorproduct

Alternative Splicing and Genomic Stability

Prediction of Alternative Splice Sites in Human Genes

Gene finding. kuobin/

Bioinformatics Laboratory Exercise

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Polyomaviridae. Spring

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

RNA Processing in Eukaryotes *

Keywords Gene prediction, artificial neural network, donor splice site, acceptor splice site, Markov chain, fuzzy logic

Alternative RNA processing: Two examples of complex eukaryotic transcription units and the effect of mutations on expression of the encoded proteins.

Supplementary Figure S1. Gene expression analysis of epidermal marker genes and TP63.

1. Identify and characterize interesting phenomena! 2. Characterization should stimulate some questions/models! 3. Combine biochemistry and genetics

TITLE: The Role Of Alternative Splicing In Breast Cancer Progression

Mechanisms of alternative splicing regulation

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

Lecture 8 Understanding Transcription RNA-seq analysis. Foundations of Computational Systems Biology David K. Gifford

Finding subtle mutations with the Shannon human mrna splicing pipeline

Figure 1: Final annotation map of Contig 9

Pre-mRNA has introns The splicing complex recognizes semiconserved sequences

Molecular Cell Biology - Problem Drill 10: Gene Expression in Eukaryotes

Hands-On Ten The BRCA1 Gene and Protein

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Figure mouse globin mrna PRECURSOR RNA hybridized to cloned gene (genomic). mouse globin MATURE mrna hybridized to cloned gene (genomic).

Processing of RNA II Biochemistry 302. February 13, 2006

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Expert-guided Visual Exploration (EVE) for patient stratification. Hamid Bolouri, Lue-Ping Zhao, Eric C. Holland

Global regulation of alternative splicing by adenosine deaminase acting on RNA (ADAR)

Long non-coding RNAs

GENOME-WIDE COMPUTATIONAL ANALYSIS OF SMALL NUCLEAR RNA GENES OF ORYZA SATIVA (INDICA AND JAPONICA)

Processing of RNA II Biochemistry 302. February 18, 2004 Bob Kelm

Molecular Biology (BIOL 4320) Exam #2 April 22, 2002

Transcriptional control in Eukaryotes: (chapter 13 pp276) Chromatin structure affects gene expression. Chromatin Array of nuc

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

Circular RNAs (circrnas) act a stable mirna sponges

TRANSCRIPTION. DNA à mrna

Ambient temperature regulated flowering time

Processing of RNA II Biochemistry 302. February 14, 2005 Bob Kelm

Recognition of HIV-1 subtypes and antiretroviral drug resistance using weightless neural networks

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

DNA codes for RNA, which guides protein synthesis.

Supplementary Figure 1. CFTR protein structure and domain architecture.

A Machine Learning Model for Discovery of Protein Isoforms as Biomarkers

Epigenetics. Jenny van Dongen Vrije Universiteit (VU) Amsterdam Boulder, Friday march 10, 2017

An Introduction to Genetics. 9.1 An Introduction to Genetics. An Introduction to Genetics. An Introduction to Genetics. DNA Deoxyribonucleic acid

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

CH 9: The Cell Cycle Overview. Cellular Organization of the Genetic Material. Distribution of Chromosomes During Eukaryotic Cell Division

Biology. Lectures winter term st year of Pharmacy study

REGULATED AND NONCANONICAL SPLICING

EMOTION CLASSIFICATION: HOW DOES AN AUTOMATED SYSTEM COMPARE TO NAÏVE HUMAN CODERS?

Evaluating Classifiers for Disease Gene Discovery

Mechanism of splicing

Identification and characterization of multiple splice variants of Cdc2-like kinase 4 (Clk4)

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Prediction and Statistical Analysis of Alternatively Spliced Exons

Introduction retroposon

Cross species analysis of genomics data. Computational Prediction of mirnas and their targets

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Protein Synthesis

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Regulation of Gene Expression in Eukaryotes

(A) Cell membrane (B) Ribosome (C) DNA (D) Nucleus (E) Plasmids. A. Incorrect! Both prokaryotic and eukaryotic cells have cell membranes.

Last time we talked about the few steps in viral replication cycle and the un-coating stage:

Colon cancer subtypes from gene expression data

Extensive Search for Discriminative Features of Alternative Splicing. H. Sakai and O. Maruyama. Pacific Symposium on Biocomputing 9:54-65(2004)

BIOINFORMATICS ANALYSES OF ALTERNATIVE SPLICING, EST-BASED AND MACHINE LEARNING-BASED PREDICTION JING XIA

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

L I F E S C I E N C E S

Beta Thalassemia Case Study Introduction to Bioinformatics

Prediction of novel precursor mirnas using a. context-sensitive hidden Markov model (CSHMM)

Overview: Conducting the Genetic Orchestra Prokaryotes and eukaryotes alter gene expression in response to their changing environment

Supplemental Figure 1. Small RNA size distribution from different soybean tissues.

Inference of Isoforms from Short Sequence Reads

Computational Biology I LSM5191

BWA alignment to reference transcriptome and genome. Convert transcriptome mappings back to genome space

An Analysis of MDM4 Alternative Splicing and Effects Across Cancer Cell Lines

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

DEVELOPING BIOINFORMATICS TOOLS FOR THE STUDY OF ALTERNATIVE SPLICING IN EUKARYOTIC GENES LIM YUN PING

Multifactorial Interplay Controls the Splicing Profile of Alu-Derived Exons

Breeding scheme, transgenes, histological analysis and site distribution of SB-mutagenized osteosarcoma.

Comparative analysis detects dependencies among the 5 splice-site positions

Insulin mrna to Protein Kit

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Data complexity measures for analyzing the effect of SMOTE over microarrays

DNA-seq Bioinformatics Analysis: Copy Number Variation

Transcription:

Pre-mRNA Secondary Structure Prediction Aids Splice Site Recognition Donald J. Patterson, Ken Yasuhara, Walter L. Ruzzo January 3-7, 2002 Pacific Symposium on Biocomputing University of Washington Computational Molecular Biology Group 1

Gene Expression: Details (Eukaryotes) DNA pre-mrna mrna Protein nucleus gene Protein cell DNA (chromosome) premrna mrna 3

Architecture of a Gene pre-mrna s transcribed from most genes contain introns, which must be spliced out to form useful mrnas Exons: 1 2 3 4 Introns: a b c Pre-mRNA mrna 1 2 3 4 Encodes a protein 7

Characteristics of human genes (Nature, 2/2001, Table 21) Median Mean Sample (size) Internal exon 122 bp 145 bp RefSeq alignments to draft genome sequence, with confirmed intron boundaries (43,317 exons) Exon number 7 8.8 RefSeq alignments to finished sequence (3,501 genes) Introns 1,023 bp 3,365 bp RefSeq alignments to finished sequence (27,238 introns) 3' UTR 400 bp 770 bp Confirmed by mrna or EST on chromo 22 (689) 5' UTR 240 bp 300 bp Confirmed by mrna or EST on chromo 22 (463) Coding seq 1,100 bp 1340bp Selected RefSeq entries (1,804)* (CDS) 367 aa 447 aa Genomic extent 14 kb 27 kb Selected RefSeq entries (1,804)* * 1,804 selected RefSeq entries were those with fulllength unambiguous alignment to finished sequence 8

Cell, Vol. 92, 315 326, February 6, 1998 Mechanical Devices of the Spliceosome: Motors, Clocks, Springs, and Things Jonathan P. Staley and Christine Guthrie 9

Relevance of Splice Prediction Splice site prediction is critical to eukaryotic gene prediction. Average human gene has 8.8 exons Genes with over 175 exons known Current primary sequence models do not display the same discriminatory power that cells exhibit in vivo Small per-site error rate compounds 10

Hypothesis Secondary structure contains information useful for predicting splice site location. This information is in addition to primary sequence information. Specific instances of secondary structure variation affecting the splicing process. 11

Possible acceptor splice sites Pre-mRNA sequences Secondary Structure Prediction Primary Sequence Model (MFOLD) (WAM) Summary Statistics Secondary Structure Predictions Summary Statistics Threshold Classifier Machine Learner 12

Possible acceptor splice sites Pre-mRNA sequences Secondary Structure Prediction Primary Sequence Model (MFOLD) (WAM) Summary Statistics Secondary Structure Predictions Summary Statistics Threshold Classifier Machine Learner 13

Data Set Drawn from 462 unrelated, annotated, multiexon human genes with standard splicing. (Reese 97) 1,980 acceptor splice sites (3 end of intron) 1,980 non-sites selected randomly Aligned to an AG consensus Located within 100 bases of an annotated acceptor splice site. 14

Possible acceptor splice sites Pre-mRNA sequences Secondary Structure Prediction Primary Sequence Model (MFOLD) (WAM) Summary Statistics Secondary Structure Predictions Summary Statistics Threshold Classifier Machine Learner 15

What's in the Primary Sequence? exon 5 intron exon 16

What's in the Primary Sequence? -4-3 -2-1 +1 +2 +3 A 22 4 100 0 25 25 27 C 33 74 0 0 13 21 27 G 22 0 0 100 52 22 24 T 22 21 0 0 9 32 23 intron exon acceptor splice site Weight Matrix Model (0 th order Markov Model) 17

Sequence-based Metric 1 st order Weight Array Matrix (WAM) / Markov Model P i (N i ={A,C,G,U} N i-1 ={A,C,G,U} ) Training Generate two conditional probability tables for positions ( 21,+3), one from positive examples and one from negative examples. Testing For each sequence, x, calculate its likelihood ratio: & log 10 $ % P P + WAM ' WAM ( x) # ( x)!! " 18

Possible acceptor splice sites Pre-mRNA sequences Secondary Secondary Structure Sequence Prediction Model Primary Sequence Model (MFOLD) (WAM) Summary Statistics Secondary Structure Predictions Summary Statistics Threshold Classifier Machine Learner 19

Acceptor Splice Site Secondary Structure 0 100 20

Possible acceptor splice sites Pre-mRNA sequences Secondary Structure Prediction Primary Sequence Model (MFOLD) (WAM) Summary Statistics Secondary Structure Predictions Summary Statistics Threshold Classifier Machine Learner 21

Secondary Structure Statistics Optimal Folding Energy Max Helix score Neighbor Pairing Correlation Model 22

1. Optimal Folding Energy...CUGCUUUCUCCCCUCUCAGGGACUUACAGUUUGAGAUGC... Secondary Sequence Prediction (MFOLD) Free Energy Free Energy Free Energy -35.2 kcal/mole -34.0 kcal/mole -2.0 kcal/mole 23

2. Max Helix What is the highest probability that a helix will form nearby? Calculate Calculate P HStart, x P HEnd, x Helix MaxHelix i = max ( PHStart, x, PHEnd, x" ( i! 5, i+ 5) x ) 24

3. Neighbor Pairing Correlation Model Change the premrna alphabet from nucleotides to structural symbols O P S Unpaired base Paired base Paired and stacked base 25

3. Neighbor Pairing Correlation Model O O O O O P S S P O O P S S P Change the premrna alphabet from nucleotides to structural symbols O P S Unpaired base Paired base Paired and stacked base 26

3. Neighbor Pairing Correlation Model 2 nd order Markov Model P i (N i ={O,P,S} N i-1 ={O,P,S} ^ N i-2 ={O,P,S} ) Training Generate two conditional probability tables for positions ( 50,+3), one from positive examples and one from negative examples. Testing For each sequence, x, calculate its log likelihood ratio: & log 10 $ % P P + NPCM ' NPCM ( x) # ( x)!! " 27

Possible acceptor splice sites Pre-mRNA sequences Secondary Structure Prediction Primary Sequence Model (MFOLD) (WAM) Summary Statistics Secondary Structure Predictions Summary Statistics Threshold Classifier Machine Learner 28

Machine Learners Decision Trees Quinlan s C4.5 Support Vector Machines Noble s svm 1.1 Radial Basis Kernel degree 2 Both take a vector of statistics and produce a yes/no binary classifier. 29

Possible acceptor splice sites Pre-mRNA sequences Secondary Structure Prediction Primary Sequence Model (MFOLD) (WAM) Summary Statistics Secondary Structure Predictions Summary Statistics Threshold Classifier Machine Learner 31

Features WAM (baseline) WAM,OFE WAM,OFE,NPCM WAM,OFE,MH WAM,OFE,NPCM,MH Results (Decision Trees) Mean Accuracy (%) 92.73 93.13 93.16 93.21 93.13 % Error Reduction 5.5 5.9 6.6 5.5 p 0.066 0.022 0.009 0.016 WAM = Weight Array Matrix (Primary Sequence Method) OFE = Optimal Free Energy MH = Max Helix NPCM = Neighbor Pairing Correlation Matrix Wilcoxon p-value under 10-fold cross-validation 32

LLR of Base Pairing 25% more likely for acceptor splice sites to pair at position -2 33

LLR of Helix Initiation 35% more likely for acceptor splice sites to initiate a helix at position 2 and -1 34

LLR of Helix Results Continuation 45% more likely for acceptor splice sites to continue a helix through the splice site. 35

36

37

Helix Formed at Splice Site Acceptor Non-Acceptor Pr(No Helix) 0.37 0.48 Pr(Helix) 0.63 0.52 Pr(Folds Left) 0.35 0.26 Pr(Folds Right) 0.28 0.26 38

Conclusions Secondary structure statistics correlate with splice site location. Our models (Max Helix, NPCM) can represent some of the relevant secondary structure. These models capture correlations that current primary sequence models don t capture. 39

Future Work Other organisms Oryza sativa (rice) in progress Donor splice sites Other features? More structure models Stochastic Context Free Grammars? 40

Acknowledgements Don Paterson Ken Yasuhara Jeff Stoner Kevin Chu More Info http://www.cs.washington.edu/homes/ruzzo UW CSE Computational Biology Group 41