LIPOPREDICT: Bacterial lipoprotein prediction server

Similar documents
A HLA-DRB supertype chart with potential overlapping peptide binding function

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

Palindromes drive the re-assortment in Influenza A

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Bioinformation Volume 5

Prediction of temperature factors from protein sequence

Bioinformation by Biomedical Informatics Publishing Group

User Guide. Protein Clpper. Statistical scoring of protease cleavage sites. 1. Introduction Protein Clpper Analysis Procedure...

High AU content: a signature of upregulated mirna in cardiac diseases

Data mining with Ensembl Biomart. Stéphanie Le Gras

Evaluating Classifiers for Disease Gene Discovery

SEQUENCE FEATURE VARIANT TYPES

Comparative Analysis of Machine Learning Algorithms for Chronic Kidney Disease Detection using Weka

FurinDB: A Database of 20-Residue Furin Cleavage Site Motifs, Substrates and Their Associated Drugs

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies

Mal-Lys: prediction of lysine malonylation sites in proteins integrated sequence-based features with mrmr feature selection

Prediction of Post-translational modification sites and Multi-domain protein structure

Antigenic Epitope Prediction towards SARS Corona Virus using Bioinformatics Analysis *1

Identification and characterization of merozoite surface protein 1 epitope

Deciphering the Role of micrornas in BRD4-NUT Fusion Gene Induced NUT Midline Carcinoma

Module 3. Genomic data and annotations in public databases Exercises Custom sequence annotation

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics

Selection of epitope-based vaccine targets of HCV genotype 1 of Asian origin: a systematic in silico approach

A two-step drug repositioning method based on a protein-protein interaction network of genes shared by two diseases and the similarity of drugs

Bioinformatics based analysis of Type III secretion system effector protein of Vibrio vulnificus

Rajesh Kannangai Phone: ; Fax: ; *Corresponding author

a. From the grey navigation bar, mouse over Analyze & Visualize and click Annotate Nucleotide Sequences.

ECG Beat Recognition using Principal Components Analysis and Artificial Neural Network

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

PROTOCOL FOR INFLUENZA A VIRUS GLOBAL SWINE H1 CLADE CLASSIFICATION

MetaMHC: a meta approach to predict peptides binding to MHC molecules

Identification of mirnas in Eucalyptus globulus Plant by Computational Methods

The Immune Epitope Database Analysis Resource: MHC class I peptide binding predictions. Edita Karosiene, Ph.D.

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Using SuSPect to Predict the Phenotypic Effects of Missense Variants. Chris Yates UCL Cancer Institute

The role of Ca² + ions in the complex assembling of protein Z and Z-dependent protease inhibitor: A structure and dynamics investigation

Contents. 2 Statistics Static reference method Sampling reference set Statistics Sampling Types...

Identification of N-Glycosylation Sites with Sequence and Structural Features Employing Random Forests

Towards Personalized Medicine: An Improved De Novo Assembly Procedure for Early Detection of Drug Resistant HIV Minor Quasispecies in Patient Samples

Molecular docking analysis of UniProtKB nitrate reductase enzyme with known natural flavonoids

The Influenza Virus Resource at the National Center for Biotechnology Information. Running Title: NCBI Influenza Virus Resource ACCEPTED

Identifying Relevant micrornas in Bladder Cancer using Multi-Task Learning

SMPD 287 Spring 2015 Bioinformatics in Medical Product Development. Final Examination

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Workshop on Analysis and prediction of contacts in proteins

Sebastian Jaenicke. trnascan-se. Improved detection of trna genes in genomic sequences

Binary Classification of Cognitive Disorders: Investigation on the Effects of Protein Sequence Properties in Alzheimer s and Parkinson s Disease

Characterization of Lovastatin biosynthetic cluster proteins in Aspergillus terreus strain ATCC 20542

Differences in structural elements of Bcr-Abl oncoprotein isoforms in Chronic Myelogenous Leukemia

Comparative Molecular Docking Studies with ABCC1 and Aquaporin 9 in the Arsenite Complex Efflux

ISR Process for Internal Service Providers

Supplementary Information

User Guide. Association analysis. Input

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Investigating the performance of a CAD x scheme for mammography in specific BIRADS categories

CoCoLysis: A Web-Accessible Coiled-Coil Protein Database with Analysis Tools

Confluence: Conformity Influence in Large Social Networks

Section B. Comparative Genomics Analysis of Influenza H5N2 Viruses. Objective

Influenza Virus HA Subtype Numbering Conversion Tool and the Identification of Candidate Cross-Reactive Immune Epitopes

Student Handout Bioinformatics

ANALYSIS AND CLASSIFICATION OF EEG SIGNALS. A Dissertation Submitted by. Siuly. Doctor of Philosophy

Appendix 81. From OPENFLU to OPENFMD. Open Session of the EuFMD: 2012, Jerez de la Frontera, Spain 1. Conclusions and recommendations

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

Detection of Lung Cancer Using Backpropagation Neural Networks and Genetic Algorithm

Keeping Abreast of Breast Imagers: Radiology Pathology Correlation for the Rest of Us

Knowledge networks of biological and medical data An exhaustive and flexible solution to model life sciences domains

Data Mining in Bioinformatics Day 4: Text Mining

Automatic Lung Cancer Detection Using Volumetric CT Imaging Features

AUTOMATIC BRAIN TUMOR DETECTION AND CLASSIFICATION USING SVM CLASSIFIER

Classification and Statistical Analysis of Auditory FMRI Data Using Linear Discriminative Analysis and Quadratic Discriminative Analysis

Classification of normal and abnormal images of lung cancer

CEU MASS MEDIATOR USER'S MANUAL Version 2.0, 31 st July 2017

Top 10 Tips for Successful Searching ASMS 2003

Automatic Classification of Perceived Gender from Facial Images

SUPPLEMENTARY MATERIAL S-1 INTREPID VARIANTS

Variable Features Selection for Classification of Medical Data using SVM

Rohit Miri Asst. Professor Department of Computer Science & Engineering Dr. C.V. Raman Institute of Science & Technology Bilaspur, India

UPF Bioinformatics course projects

Genomes2Drugs: Identifies Target Proteins and Lead Drugs from Proteome Data

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Types of Modifications

Plasmodium and host glutathione reductase: molecular function and biological process

MRI Image Processing Operations for Brain Tumor Detection

Bacterial Secreted Proteins: Secretory Mechanisms And Role In Pathogenesis READ ONLINE

ParkDiag: A Tool to Predict Parkinson Disease using Data Mining Techniques from Voice Data

Cholera host response gene networks: preliminary studies

Integrated Analysis of Copy Number and Gene Expression

Analysis of Classification Algorithms towards Breast Tissue Data Set

Clinical Decision Support for Immunizations as a Community-drive, Standards-based Activity

Lipid annotation with MS2Analyzer. Yan Ma 10/24/2013

Computational detection of allergenic proteins attains a new level of accuracy with in silico variable-length peptide extraction and machine learning

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

Release Notes. Medtech32. Version Build 3347 Update HbA1c Laboratory Data Format Change Advanced Forms Licensing Renewal (September 2011)

Predicting clinical outcomes in neuroblastoma with genomic data integration

A Graphic User Interface for the Glyco-MGrid Portal System for Collaboration among Scientists

Web-based tools for Bioinformatics; A (free) introduction to (freely available) NCBI, MUSC and Worldwide.

TABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO. ABSTRACT

A Review on Arrhythmia Detection Using ECG Signal

In silico identification of mirnas and their targets from the expressed sequence tags of Raphanus sativus

Transcription:

www.bioinformation.net Server Volume 8(8) LIPOPREDICT: Bacterial lipoprotein prediction server S Ramya Kumari, Kiran Kadam, Ritesh Badwaik & Valadi K Jayaraman* Centre for Development of Advanced Computing (C-DAC), Pune University Campus, Ganeshkind, Pune-411 007, India; Valadi K Jayaraman Email: jayaramanv@cdac.in; *Corresponding author Received April 11, 2012; Accepted April 16, 2012; Published April 30, 2012 Abstract: Bacterial lipoproteins have many important functions owing to their essential nature and roles in pathogenesis and represent a class of possible vaccine candidates. The prediction of bacterial lipoproteins from sequence is thus an important task for computational vaccinology. A Support Vector Machines (SVM) based module for predicting bacterial lipoproteins, LIPOPREDICT, has been developed. The best performing sequence model were generated using selected dipeptide composition, which gave 97% accuracy of prediction. The results obtained were compared very well with those of previously developed methods. Availability: A web server for bacterial lipoprotein prediction available at www.lipopredict.cdac.in Keywords: Bacterial lipoproteins, Support Vector Machine (SVM), compositional features, prediction server Background: Bacterial lipoproteins have many important functions owing to their essential nature and roles in pathogenesis representing a class of possible vaccine candidates. They are functionally diverse class of membrane-anchored proteins that typically represent approximately 2% of the bacterial proteome [1]. They consist of a large group of proteins and perform many different functions: promote antibiotic resistance, cell signaling and substrate binding in ABC transport systems, protein export, sporulation, germination, bacterial conjugation, and many others are yet to be assigned a function [2]. Lipoproteins are required for virulence in many bacteria. They perform variety of roles in host-pathogen interaction, from surface adhesion and initiation of inflammatory processes through translocation of virulence factors into the host cytoplasm [3]. Several methods have been devised in literature to predict bacterial lipoproteins, using different approaches and data sets. Identification of Gram-positive bacterial lipoproteins has resulted in various servers and databases such as DOLOP [4], LIPO [5], PSORT [6] and ScanProsite [7]. LipoP [8] and Phobius [9] use hidden Markov model. LipPred [10] uses Naive-Bayesian network and SPEPLip [11] uses neural network. LipoP [8] identification of Gram- negative bacterial lipoproteins uses pattern matching methodology. In this work, we present a SVM based method using amino acid composition to identify bacterial lipoproteins. Methodology: Bacterial Lipoprotein Dataset The dataset of bacterial lipoproteins consists of experimentally annotated 222 sequences. It has been derived from distinct bacterial lipoproteins available in the DOLOP [4] database. Bacterial Non Lipoprotein Dataset 222 bacterial non lipoprotein sequences which were obtained from various databases such as NCBI [12], UNIPROT [13] were used for the construction of the dataset. Bioinformation 8(8): 394-398 (2012) 394 2012 Biomedical Informatics

Both the datasets were compiled after performing CD-HIT. The program CD-HIT (Cluster Database at High Identity with Tolerance) [14] [15] removes homologous sequences by clustering the protein dataset at user-defined sequence identity thresholds. Here we employed multiple CD-HIT runs; for example 90%, and then 60% and then 50% generating more efficient non-redundant datasets. Figure 1: Snapshot of the index page of LIPOPREDICT server. Server Implementation The prediction method described in this paper is implemented in the form of a web-server LIPOPREDICT: Bacterial Lipoprotein prediction server (Figure 1). Bacterial lipoproteins are a diverse and functionally important group of proteins that are amenable to bioinformatic analyses because of their unique signal peptide features. They are characterized by the presence of a signal peptide in their N- terminus, followed by presence of a specific cysteine residue [16]. The lipidation motif, represented in PROSITE by the regular expression DERK (6) [LIVMFW STAG] (2) [LIVMFYSTAGCQ] [AGS] C(PS00013), is present in both Gram- positive and Gram-negative bacteria. Signal peptides of bacterial lipoproteins possess many distinctive physio chemical features, along with the presence or absence of specific amino acids [17]. G. von Heijne [18] showed that considerable structural and compositional differences exist between signal peptides of bacterial lipoproteins and bacterial non lipoproteins. The two classes differ from each other in terms of the physio-chemical properties like charge, hydrophobicity, secondary structure propensities and amino acid size. All those differences in signal peptides from lipoproteins and non-lipoproteins can be attributed to the amino acid composition of the signal peptide. Hence compositional features like amino acid and dipeptide composition can be employed for discriminating these signal peptides, which in turn will result in differentiating bacterial lipoproteins and non-lipoproteins. In our work, we analyzed the amino acid sequence of the signal peptides of lipoproteins and bacterial non lipoproteins. The average length of signal peptides in bacteria ranges from 24 amino acids for Gram-negatives and 32 amino acids for Gram- Bioinformation 8(8): 394-398 (2012) 395 2012 Biomedical Informatics

positives [19]. Considering few variations in the lengths, we selected first 35 residues for the analysis. Server description: Prediction Input Interface Users can click on the prediction icon and the prediction interface displays various input type option. In which user can either type or paste sequence (Figure 2) or submit the file using the option upload file (Figure 3). Submitting sequence/s must be in FASTA format, on submission the input sequence is validated and if invalid the errors are reported to the user to rectify the problem. Prediction Output Interface When the run prediction button is clicked, the users are directed to the results page, best model (Selected Dipeptide Composition) with highest cross validation were created and used for prediction. Compositional feature model selected dipeptide is been used to generate prediction results. Support vector machines (SVMs) with probability estimates are calculated and the prediction result with probability estimate of the sequence belonging to the respective class is displayed in the result page. Result page also gives an option to download the prediction results for further use. Discussion: SVM kernel types and kernel parameters were tuned based on 10 fold cross validation accuracy as performance measure. The results are tabulated in Table 1 (see supplementary material). Bacterial lipoprotein prediction problem with feature selection gave the highest accuracy of 97% with selected dipeptide composition feature. We employed information gain feature selection using WEKA software [20] with the view to extracting the subsets of informative features. With feature selection the maximum accuracy increased for selected dipeptide, so we employed 67 selected dipeptide composition features as SVM domain feature input for building the model. Figure 2: Snapshot of query prediction page Type or Paste Sequence. Conclusions: In this study we have presented a prediction server, LIPOPREDICT for identification and classification of bacterial lipoproteins. The server employs Support Vector Machines supervisory learning model, which is rigorously based on statistical learning theory. For prediction of bacterial lipoprotein, selection of most informative features in dipeptide composition further improved model accuracy. Our results indicate that this SVM model can be employed for accurate prediction of bacterial lipoproteins. The prediction models include probability measures in the output, so it can be used to assess the confidence of SVM predictions. Further, our user friendly web server can be readily used for annotation of novel proteins. Acknowledgement: Dr. VKJ gratefully thanks CSIR, New Delhi for support in the form of Emeritus Scientist Grant. The authors also thank NPSF group members of CDAC for help and support for hosting of web server. Figure 3: Snapshot of query prediction page Upload File. References: [1] Hutchings MI et al. Trends Microbiol. 2009 17: 13 [PMID: 19059780] [2] Sutcliffe IC et al. Microbiology. 2002 148: 2065 [PMID: 12101295] [3] Kovacs-Simon A et al. Infect Immun. 2011 79: 548 [PMID: 20974828] [4] Madan Babu M & Sankaran K, Bioinformatics. 2002 18: 641 [PMID: 12016064] [5] Berven FS et al. Arch Microbiol. 2006 184: 362 [PMID: 16311759] [6] Nakai K & Horton P, Trends Biochem Sci. 1999 24: 34 [PMID: 10087920] Bioinformation 8(8): 394-398 (2012) 396 2012 Biomedical Informatics

[7] De Castro E et al. Nucleic Acids Res. 2006 34: W362 [PMID: 16845026] [8] Juncker AS et al. Protein Sci. 2003 12: 1652 [PMID: 12876315] [9] Käll L et al. J Mol Biol. 2004 338: 1027 [PMID: 15111065] [10] Taylor PD et al. Bioinformation. 2006 1: 176 [PMID: 17597883] [11] Fariselli P et al. Bioinformatics. 2003 19: 2498 [PMID: 14668245] [12] http://www.ncbi.nlm.nih.gov/ [13] http://www.uniprot.org/ [14] Li W et al. Bioinformatics. 2002 18: 77 [PMID: 11836214] [15] Li W et al. Bioinformatics. 2001 17: 282 [PMID: 11294794] [16] Hayashi S & Wu HC, J Bioenerg Biomembr. 1990 22: 451 [PMID: 2202727] [17] Klein P et al. Protein Eng. 1988 2: 15 [PMID: 3253732] [18] Von Heijne G, Protein Eng. 1989 2: 531 [PMID: 2664762] [19] Bendtsen JD, J Mol Biol. 2004 340: 783 [PMID: 15223320] [20] http://www.cs.waikato.ac.nz/ml/weka/ Edited by P Kangueane Citation: Kumari et al. Bioinformation 8(8): 394-398 (2012) License statement: This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes, provided the original author and source are credited. Bioinformation 8(8): 394-398 (2012) 397 2012 Biomedical Informatics

Supplementary material: Table 1: Performance of SVM models based on various features Bacterial Lipoproteins Kernel 10 Fold Cross Validation Amino acid RBF 90% Selected amino acid RBF 91% Dipeptide RBF 94% Selected Dipeptide RBF 97% - SVM MODEL Amino acid and dipeptide RBF 93% Selected amino acid and dipeptide RBF 96% Bioinformation 8(8): 394-398 (2012) 398 2012 Biomedical Informatics