Mature microrna identification via the use of a Naive Bayes classifier

Similar documents
Prediction of micrornas and their targets

Prediction of novel microrna genes in cancer-associated genomic regions a combined computational and experimental approach

Cross species analysis of genomics data. Computational Prediction of mirnas and their targets

he micrornas of Caenorhabditis elegans (Lim et al. Genes & Development 2003)

Identification of mirnas in Eucalyptus globulus Plant by Computational Methods

Computational Prediction Of Micro-RNAs In Hepatitis B Virus Genome

STAT1 regulates microrna transcription in interferon γ stimulated HeLa cells

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

mirna Dr. S Hosseini-Asl

Supplementary Figure 1

MicroRNA in Cancer Karen Dybkær 2013

Research Article Base Composition Characteristics of Mammalian mirnas

CRS4 Seminar series. Inferring the functional role of micrornas from gene expression data CRS4. Biomedicine. Bioinformatics. Paolo Uva July 11, 2012

MicroRNA and Male Infertility: A Potential for Diagnosis

Small RNAs and how to analyze them using sequencing

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Consumer Review Analysis with Linear Regression

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

High AU content: a signature of upregulated mirna in cardiac diseases

Improved annotation of C. elegans micrornas by deep sequencing reveals structures associated with processing by Drosha and Dicer

omiras: MicroRNA regulation of gene expression

Ch.20 Dynamic Cue Combination in Distributional Population Code Networks. Ka Yeon Kim Biopsychology

Gene-microRNA network module analysis for ovarian cancer

Improved Hepatic Fibrosis Grading Using Point Shear Wave Elastography and Machine Learning

MicroRNA dysregulation in cancer. Systems Plant Microbiology Hyun-Hee Lee

From reference genes to global mean normalization

Supplemental Figure 1. Small RNA size distribution from different soybean tissues.

Eukaryotic small RNA Small RNAseq data analysis for mirna identification

NMF-Density: NMF-Based Breast Density Classifier

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Prediction of novel precursor mirnas using a. context-sensitive hidden Markov model (CSHMM)

Transcriptome-wide analysis of microrna expression in the malaria mosquito Anopheles gambiae

Supplementary information

Biomarker adaptive designs in clinical trials

Bi 8 Lecture 17. interference. Ellen Rothenberg 1 March 2016

Overview Detection epileptic seizures

Long non-coding RNAs

Introduction to Computational Neuroscience

Chapter 2. Investigation into mir-346 Regulation of the nachr α5 Subunit

Information Processing During Transient Responses in the Crayfish Visual System

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

Algorithms for Design-Automation - Mastering Nanoelectronic Systems Logic Diagnosis with improved resolution

A novel and universal method for microrna RT-qPCR data normalization

Sebastian Jaenicke. trnascan-se. Improved detection of trna genes in genomic sequences

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

MICRORNAS (mirna) are one type of non-coding RNAs

RNA interference (RNAi)

Predicting Breast Cancer Survivability Rates

Circular RNAs (circrnas) act a stable mirna sponges

Obstacles and challenges in the analysis of microrna sequencing data

Stage-Specific Predictive Models for Cancer Survivability

Evaluating Classifiers for Disease Gene Discovery

Using AUC and Accuracy in Evaluating Learning Algorithms

Learning from data when all models are wrong

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network

Supplementary Figure 1

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Chapter 7 Conclusions

Integrated network analysis reveals distinct regulatory roles of transcription factors and micrornas

A scored AUC Metric for Classifier Evaluation and Selection

Modeling Sentiment with Ridge Regression

RECAP (1)! In eukaryotes, large primary transcripts are processed to smaller, mature mrnas.! What was first evidence for this precursorproduct

A Quick-Start Guide for rseqdiff

RNA Secondary Structures: A Case Study on Viruses Bioinformatics Senior Project John Acampado Under the guidance of Dr. Jason Wang

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

Profiling of the Exosomal Cargo of Bovine Milk Reveals the Presence of Immune- and Growthmodulatory Non-coding RNAs (ncrna)

Patnaik SK, et al. MicroRNAs to accurately histotype NSCLC biopsies

Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers

DSB. Double-Strand Breaks causate da radiazioni stress ossidativo farmaci

MicroRNAs and Cancer

Removal of Shelterin Reveals the Telomere End-Protection Problem

Computational Modeling of mirna Biogenesis

Sample Size Considerations. Todd Alonzo, PhD

Outlier Analysis. Lijun Zhang

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Efficient AUC Optimization for Information Ranking Applications

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Supplementary Figure S1. Gene expression analysis of epidermal marker genes and TP63.

A DATA MINING APPROACH FOR PRECISE DIAGNOSIS OF DENGUE FEVER

MicroRNAs: novel regulators in skin research

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

V16: involvement of micrornas in GRNs

Small RNAs and how to analyze them using sequencing

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Simplify the relations between genes and diseases into typed binary relations. PubMed BioText GoPubMed

HALLA KABAT * Outreach Program, mircore, 2929 Plymouth Rd. Ann Arbor, MI 48105, USA LEO TUNKLE *

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

Regulation of Cardiac Cell Fate by micrornas: Implications for Heart Regeneration

COGNITIVE CONTROL IN MEDIA MULTITASKERS

Hybrid HMM and HCRF model for sequence classification

Mutation Profile to Predict Tumor Stage in Lung Adenocarcinoma

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

MicroRNAs, RNA Modifications, RNA Editing. Bora E. Baysal MD, PhD Oncology for Scientists Lecture Tue, Oct 17, 2017, 3:30 PM - 5:00 PM

ARTIFICIAL INTELLIGENCE FOR DIGITAL PATHOLOGY. Kyunghyun Paeng, Co-founder and Research Scientist, Lunit Inc.

MicroRNAs Modulate Hematopoietic Lineage Differentiation

Automatic Lung Cancer Detection Using Volumetric CT Imaging Features

A Statistical Framework for Classification of Tumor Type from microrna Data

Neural Coding. Computing and the Brain. How Is Information Coded in Networks of Spiking Neurons?

Machine Gaydar : Using Facebook Profiles to Predict Sexual Orientation

Transcriptome profiling of the developing male germ line identifies the mir-29 family as a global regulator during meiosis

Transcription:

Mature microrna identification via the use of a Naive Bayes classifier Master Thesis Gkirtzou Katerina Computer Science Department University of Crete 13/03/2009 Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 1 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 2 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 3 / 44

MicroRNAs Definition MicroRNAs (mirnas) are small single stranded RNAs, on average 22nt long, generated from endogenous hairpin shaped transcripts with post transcriptional activity. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 4 / 44

MicroRNAs Definition MicroRNAs (mirnas) are small single stranded RNAs, on average 22nt long, generated from endogenous hairpin shaped transcripts with post transcriptional activity. Significance MicroRNAs have been observed to participate in: Developmental timing. Cell proliferation and cell differentiation. Apoptosis. Diseases, such as diabetes and cancer. Anti viral defense. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 4 / 44

MicroRNA Biogenesis Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 5 / 44

Related Work Disadvantages Hypothesize that only a single mature is produced from every hairpin structure (BayesMiRNAfind 2006, ProMiR 2006) Hypothesize that pri-mirnas are always processed by the Drosha complex, whose cleavage cite determines the start position of the mature (Microprocessor SVM 2006). Mature candidate is provided only for the human precursors, which are expressed in specific cell lines (SSCprofiler 2009). Evaluation of performance is often measured in terms of true positive rate alone, ignoring the false positive rate (ProMiR 2006, BayesMiRNAfind 2006, Tao 2007, mircos 2007, SSCprofiler 2009) Distance distribution of predicted compared to true matures is not provided (BayesMiRNAfind 2006, Tao 2007, mircos 2007, SSCprofiler 2009). Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 6 / 44

Objectives Goal Build a classifier considering biological information of precursor mirna, such as sequence or secondary structure, capable of identifying the mature mirna(s) within a precursor mirna with high accuracy. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 7 / 44

Objectives Goal Build a classifier considering biological information of precursor mirna, such as sequence or secondary structure, capable of identifying the mature mirna(s) within a precursor mirna with high accuracy. Output The model s output is the predicted start position of the mature mirna(s) for each precursor sequence. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 7 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 8 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 9 / 44

Naive Bayes Classifier Advantages of Naive Bayes Achieves strong performance in many real problems, despite its simplified assumptions. Requires small amount of training data. Contribution of features is easily derived. Our decision surface P(C mature x) P(C non mature x) > λ Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 10 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 11 / 44

Datasets Typical two class classification problem Positive data: experimental verified mature mirnas. Negative data: What is a non mature mirna? Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 12 / 44

Datasets Typical two class classification problem Positive data: experimental verified mature mirnas. Negative data: What is a non mature mirna? Observations Known mirna precursors do not produce multiple overlapping mature mirnas from the same arm of the foldback precursor. The search area of the classifier will be a mirna precursor sequence. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 12 / 44

Datasets Typical two class classification problem Positive data: experimental verified mature mirnas. Negative data: What is a non mature mirna? Observations Known mirna precursors do not produce multiple overlapping mature mirnas from the same arm of the foldback precursor. The search area of the classifier will be a mirna precursor sequence. Solution! Create negative data by sliding 1 base pair in both stem arms of the precursor with a window with size the same as the produced mature mirna. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 12 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 2 23 Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 2 23 3 24 Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 40 62 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 40 62 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 40 62 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 40 62 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 40 62 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 40 62 3 24.. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 40 62 3 24.. 59 81 Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 40 62 3 24.. 59 81 60 82 Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Produce Negative Data Negative Data Positive Data SP EP SP EP 1 22 24 46 2 23 40 62 3 24.. 59 81 60 82 Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 13 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 14 / 44

Position oriented features Example of a position oriented features Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 15 / 44

Position oriented features Example of a position oriented features Areas of position oriented features Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 15 / 44

Distance oriented features Distance of Starting Position Distance of Ending Position Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 16 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 17 / 44

Feature Selection Feature Selection Ranking Method 1 For each feature estimate the probability mass functions in both positive and negative data. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 18 / 44

Feature Selection Feature Selection Ranking Method 1 For each feature estimate the probability mass functions in both positive and negative data. 2 Using the symmetric K-L divergence estimate a score for each feature. Symmetric Kullback Leibler divergence The divergence between the positive (P) and negative (N) probability distribution: where D KL (P N) = i P(i) log 2 Sym D KL = 1 2 (D KL(P N) + D KL (N P)) P(i) N(i) Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 18 / 44

Feature Selection Feature Selection Ranking Method 1 For each feature estimate the probability mass functions in both positive and negative data. 2 Using the symmetric K-L divergence estimate a score for each feature. 3 Rank features according to the K-L provided score. Symmetric Kullback Leibler divergence The divergence between the positive (P) and negative (N) probability distribution: where D KL (P N) = i P(i) log 2 Sym D KL = 1 2 (D KL(P N) + D KL (N P)) P(i) N(i) Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 18 / 44

Feature Selection Feature Selection Ranking Method 1 For each feature estimate the probability mass functions in both positive and negative data. 2 Using the symmetric K-L divergence estimate a score for each feature. 3 Rank features according to the K-L provided score. 4 Train the classifier using the top K features. Incoporate features gradually only if it helps increasing the performance of the classifier. Symmetric Kullback Leibler divergence The divergence between the positive (P) and negative (N) probability distribution: where D KL (P N) = i P(i) log 2 Sym D KL = 1 2 (D KL(P N) + D KL (N P)) P(i) N(i) Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 18 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 19 / 44

Datasets Training Dataset Version 10.1 mirbase Organism Precursor True Mature Negative Mature Human 533 729 7290 Mouse 422 530 5300 Test Dataset Version 12 mirbase Organism Precursor True Mature Human 155 160 Mouse 45 48 Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 20 / 44

Implementation Specifications Extra Parameters: tune over a 10 fold cross validation The size of the flanking regions, N. The size of the scanning window, W. The number of features used in the classifier, K. The type of information the position oriented features hold. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 21 / 44

Implementation Specifications Extra Parameters: tune over a 10 fold cross validation The size of the flanking regions, N. The size of the scanning window, W. The number of features used in the classifier, K. The type of information the position oriented features hold. Evaluation Specification The validation sets consisted of true mirna precursors, whose mature mirnas were left out from training in the cross validation procedure. Candidates mature mirnas were produced by sliding 1 base pair in both stem arms of the precursor with a fixed size sliding window, W. Evaluation was estimated based on exact match of the starting position of the predicted compared to the real mature mirna. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 21 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 22 / 44

Information contained in Position Oriented Features Classifier s Description Sensitivity Specificity Sequence Based Naive Bayes Classifiers 0nt flanking region 67.10% 55.10% 5nt flanking region 76.04% 53.34% 7nt flanking region 75.96% 53.20% 10nt flanking region 79.15% 47.01% 12nt flanking region 74.30% 51.33% Structure Based Naive Bayes Classifiers 0nt flanking region 65.70% 54.30% 5nt flanking region 76.34% 52.64% 7nt flanking region 77.85% 54.29% 10nt flanking region 81.01% 56.63% 12nt flanking region 79.89% 55.51% Combined Naive Bayes Classifiers 0nt flanking region 68.50% 62.50% 5nt flanking region 71.32% 65.34% 7nt flanking region 74.26% 66.46% 10nt flanking region 76.50% 65.61% 12nt flanking region 77.81% 64.14% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 23 / 44

Information contained in Distance Oriented Features Distance Oriented Naive Bayes Classifiers AUC Distance oriented Window Window Window Window Features 18nt 20nt 22nt 24nt HS 0.8181 0.8155 0.8128 0.8147 HS-HE 0.7794 0.7914 0.8099 0.8100 HS-HE-ES 0.7621 0.7803 0.7787 0.7866 HS-HE-ES-EE 0.7587 0.7808 0.7875 0.7839 HS : the distance of the starting position of the mature mirna from the hairpin. HE : the distance of the ending position of the mature mirna from the hairpin. ES : the distance of the starting position of the mature mirna from the ends of the precursor. EE : the distance of the ending position of the mature mirna from the ends of the precursor. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 24 / 44

Searching the Optimun Naive Bayes Classifier HS and Position Oriented Naive Bayes Classifiers AUC Flanking Window Window Window Window Region 18nt 20nt 22nt 24nt 0nt 0.8629 0.8615 0.8621 0.8624 3nt 0.8671 0.8658 0.8675 0.8661 5nt 0.8597 0.8614 0.8662 0.8642 7nt 0.8592 0.8630 0.8716 0.8696 9nt 0.8599 0.8673 0.8771 0.8704 12nt 0.8585 0.8691 0.8745 0.8658 Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 25 / 44

The ROC Curve of the Best Naive Bayes Classifier Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 26 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 27 / 44

Top Scorer in Cross Validation Distance from truth Percent 0 27.89% ±1 48.91% ±2 64.59% ±3 73.92% ±4 81.18% ±5 84.48% ±6 86.88% ±7 89.28% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 28 / 44

Top Scorer in Test Data Distance from truth Percent 0 18.68% ±1 39.01% ±2 51.61% ±3 59.89% ±4 65.93% ±5 71.98% ±6 76.37% ±7 79.67% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 29 / 44

MicroRNA Biogenesis Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 30 / 44

Top Scorer and its Duplex in Cross Validation Distance from truth Percent 0 22.89% ±1 48.97% ±2 64.35% ±3 74.71% ±4 82.17% ±5 85.87% ±6 87.83% ±7 90.30% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 31 / 44

Top Scorer and its Duplex in Test Data Distance from truth Percent 0 14.98% ±1 37.68% ±2 49.76% ±3 61.35% ±4 69.57% ±5 74.40% ±6 78.74% ±7 80.68% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 32 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 33 / 44

Comparison with ProMiR Dataset Analysis Initial Dataset: 200 experimental human and mouse precursors. ProMiR predicted as precursors: 178/200. ProMiR predicted wrong stem for 78/178. Our Model predicted wrong stem for 94/178 if we consider as computational truth the top scorer of the precursor. Our Model predicted wrong stem for 0/178 if we consider as computational truth the top scorer of the precursor and its duplex. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 34 / 44

Distance Distributions for Correct Stem Prediction Distance from truth Percent ProMiR 0 7% ±1 12% ±2 23% ±3 28% ±4 36% ±5 49% ±6 55% ±7 65% Top Scorer of our Model Distance from truth Percent 0 12.05% ±1 34.94% ±2 45.78% ±3 56.63% ±4 67.47% ±5 72.29% ±6 79.52% ±7 83.13% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 35 / 44

Distance Distributions for Correct Stem Prediction Distance from truth Percent ProMiR 0 7% ±1 12% ±2 23% ±3 28% ±4 36% ±5 49% ±6 55% ±7 65% Top Scorer and its Duplex of our Model Distance from truth Percent 0 14.59% ±1 39.46% ±2 51.35% ±3 63.24% ±4 71.35% ±5 75.14% ±6 80.00% ±7 81.62% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 36 / 44

Comparison with BayesMiRNAfind Dataset Analysis Initial Dataset: 200 experimental human and mouse precursors. BayesMiRNAfind predicted as precursors: 101/200. BayesMiRNAfind predicted wrong stem for 45/101. Our Model predicted wrong stem for 53/101 if we consider as computational truth the top scorer of the precursor. Our Model predicted wrong stem for 0/101 if we consider as computational truth the top scorer of the precursor and its duplex. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 37 / 44

Distance Distributions for Correct Stem Prediction Distance from truth Percent BayesMiRNAfind 0 7.14% ±1 14.29% ±2 26.79% ±3 33.93% ±4 41.07% ±5 42.86% ±6 44.64% ±7 55.36% Top Scorer of our Model Distance from truth Percent 0 20.83% ±1 47.92% ±2 58.33% ±3 68.75% ±4 77.08% ±5 81.25% ±6 85.42% ±7 89.58% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 38 / 44

Distance Distributions for Correct Stem Prediction Distance from truth Percent BayesMiRNAfind 0 7.14% ±1 14.29% ±2 26.79% ±3 33.93% ±4 41.07% ±5 42.86% ±6 44.64% ±7 55.36% Top Scorer and its Duplex of our Model Distance from truth Percent 0 19.63% ±1 51.40% ±2 62.62% ±3 72.90% ±4 78.50% ±5 80.37% ±6 83.18% ±7 85.05% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 39 / 44

Outline 1 Introduction 2 Methodology Naive Bayes Classifier Datasets Feature Definition Feature Selection 3 Results Training the Naive Bayes Classifier Finding the Best Mature Candidate Comparison with Other Methods 4 Conclusions Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 40 / 44

Conclusions Innovations Feature Selection is based on Kullback Leibler divergence. Performance is estimated based on AUC, in comparison with other methods that their performance are estimated based on sensitivity. Provide distance distributions for true matures. Flexibility to select between top scorer per stem or top scorer and its duplex per precursor. Simple algorithm with quite strong performance. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 41 / 44

Conclusions Innovations Feature Selection is based on Kullback Leibler divergence. Performance is estimated based on AUC, in comparison with other methods that their performance are estimated based on sensitivity. Provide distance distributions for true matures. Flexibility to select between top scorer per stem or top scorer and its duplex per precursor. Simple algorithm with quite strong performance. Comparison Program Percent for ±4nt Program Percent for ±4nt ProMir 36.00% BayesMiRNA 41.07% Top Scorer 67.47% Top Scorer 77.08% Duplex 71.35% Duplex 78.50% Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 41 / 44

Conclusions Conclusion Our findings suggest that position specific sequence and structure information and the distance of the starting position from the hairpin combined with a simple Bayes classifier achieve a good performance on the challenging task of mature mirna identification. Future Work Examine different error costs per class. Use stronger classifier, such as support vector machines (SVM). Use as training input the mirna mirna duplex. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 42 / 44

Publications K. Gkirtzou, P. Tsakalides and P. Poirazi. Mature microrna identification via the use of a Naive Bayes classifier. In proceedings of BIBE, 2008. A. Oulas, A. Boutla, K. Gkirtzou, M. Reczko, K. Kalantidis and P. Poirazi. Prediction of novel microrna genes in cancer associated genomic regions a combined computational and experimental approach. In press, Nucleic Acids Research. K. Gkirtzou, P. Tsakalides and P. Poirazi. MatureFind: a tool for identifying mature mirnas in mammalians precursors. Manuscript in preparation. Gkirtzou K. (CSD UOC) Mature microrna identification 13/03/2009 43 / 44