Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics

Similar documents
Data Mining in Bioinformatics Day 4: Text Mining

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Prediction of Alternative Splice Sites in Human Genes

Bioinformatic analyses: methodology for allergen similarity search. Zoltán Divéki, Ana Gomes EFSA GMO Unit

Kernel Methods and String Kernels for Authorship Analysis

Multiple sequence alignment

Influenza Virus HA Subtype Numbering Conversion Tool and the Identification of Candidate Cross-Reactive Immune Epitopes

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

Bioinformatics. Sequence Analysis: Part III. Pattern Searching and Gene Finding. Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute

LIPOPREDICT: Bacterial lipoprotein prediction server

A HMM-based Pre-training Approach for Sequential Data

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

A Universal Trend among Proteomes Indicates an Oily Last Common Ancestor. BI Journal Club Aleksander Sudakov

Colorspace & Matching

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Understanding eye movements in face recognition with hidden Markov model

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Predicting Protein-Peptide Binding Affinity by Learning Peptide-Peptide Distance Functions

Statistical analysis of RIM data (retroviral insertional mutagenesis) Bioinformatics and Statistics The Netherlands Cancer Institute Amsterdam

Hamby, Stephen Edward (2010) Data mining techniques for protein sequence analysis. PhD thesis, University of Nottingham.

BCB 444/544 Fall 07 Dobbs 1

Mammogram Analysis: Tumor Classification

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Gene Finding in Eukaryotes

CS 4365: Artificial Intelligence Recap. Vibhav Gogate

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Name: Due on Wensday, December 7th Bioinformatics Take Home Exam #9 Pick one most correct answer, unless stated otherwise!

Exploring Potential Discriminatory Information Embedded in PSSM to Enhance Protein Structural Class Prediction Accuracy

cloglog link function to transform the (population) hazard probability into a continuous

Speeding up Greedy Forward Selection for Regularized Least-Squares

Effective Diagnosis of Alzheimer s Disease by means of Association Rules

Multi-atlas-based segmentation of the parotid glands of MR images in patients following head-and-neck cancer radiotherapy

Sign Language Recognition System Using SIFT Based Approach

Hybrid HMM and HCRF model for sequence classification

PROCEEDINGS OF SPIE. Models of temporal enhanced ultrasound data for prostate cancer diagnosis: the impact of time-series order

Mammogram Analysis: Tumor Classification

J2.6 Imputation of missing data with nonlinear relationships

Study the Evolution of the Avian Influenza Virus

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

Learning Convolutional Neural Networks for Graphs

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network

For all of the following, you will have to use this website to determine the answers:

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING

VIP: an integrated pipeline for metagenomics of virus

Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports

Mutation Profile to Predict Tumor Stage in Lung Adenocarcinoma

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Real Time Sign Language Processing System

Introduction to Computational Neuroscience

Recognition of HIV-1 subtypes and antiretroviral drug resistance using weightless neural networks

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

Machine Learning for Personalized Medicine

Contents. 2 Statistics Static reference method Sampling reference set Statistics Sampling Types...

Algorithms in Nature. Pruning in neural networks

Ras and Cell Signaling Exercise

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Bioinformatics Laboratory Exercise

TURKISH SIGN LANGUAGE RECOGNITION USING HIDDEN MARKOV MODEL

CONSTRUCTION OF PHYLOGENETIC TREE USING NEIGHBOR JOINING ALGORITHMS TO IDENTIFY THE HOST AND THE SPREADING OF SARS EPIDEMIC

Correlogram Method for Comparing Bio-Sequences

Cost-sensitive Dynamic Feature Selection

Predicting Sleep Using Consumer Wearable Sensing Devices

Hands-On Ten The BRCA1 Gene and Protein

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Yeast Cells Classification Machine Learning Approach to Discriminate Saccharomyces cerevisiae Yeast Cells Using Sophisticated Image Features.

arxiv: v2 [q-bio.pe] 21 Jan 2008

1. INTRODUCTION. Vision based Multi-feature HGR Algorithms for HCI using ISL Page 1

Unsupervised Identification of Isotope-Labeled Peptides

Persons Personality Traits Recognition using Machine Learning Algorithms and Image Processing Techniques

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Stepwise Knowledge Acquisition in a Fuzzy Knowledge Representation Framework

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Project PRACE 1IP, WP7.4

Identification of single de novo drug candidate for dengue and filaria on Aedes aegypti and Culex quinquefasciatus mosquitoes using insilico Protocols

Sign Language to Number by Neural Network

A micropower support vector machine based seizure detection architecture for embedded medical devices

Changes to Biochemistry (4th ed.), 2nd Printing

CISC453 Winter Probabilistic Reasoning Part B: AIMA3e Ch

Machine Learning Applied to Perception: Decision-Images for Gender Classification

Module 3. Genomic data and annotations in public databases Exercises Custom sequence annotation

Bayesian Face Recognition Using Gabor Features

Automatic Medical Coding of Patient Records via Weighted Ridge Regression

EECS 433 Statistical Pattern Recognition

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

Predicting Disulfide Connectivity Patterns

PIB Ch. 18 Sequence Memory for Prediction, Inference, and Behavior. Jeff Hawkins, Dileep George, and Jamie Niemasik Presented by Jiseob Kim

The Human Behaviour-Change Project

Statement of research interest

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

Quantitative Estimation of Movement Progress during Rehabilitation after Knee/Hip Replacement Surgery

SNPrints: Defining SNP signatures for prediction of onset in complex diseases

Handwriting - marker for Parkinson s Disease

Outlier Analysis. Lijun Zhang

EpiGRAPH regression: A toolkit for (epi-)genomic correlation analysis and prediction of quantitative attributes

Breast Cancer Diagnosis Based on K-Means and SVM

Transcription:

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt March 1 to March 12, 2010 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Why compare sequences? Protein sequences Proteins are chains of amino acids. 20 different types of amino acids can be found in protein sequences. Protein sequence changes over time by mutations, deletion, insertions. Different protein sequences may diverge from one common ancestor. Their sequences may differ slightly, yet their function is often conserved. Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Why compare sequences? Biological Question: Biologists are interested in the reverse direction: Given two protein sequences, is it likely that they originate from the same common ancestor? Computational Challenge: How to measure similarity between two protein sequence, or equivalently: How to measure similarity between two strings Kernel Challenge: How to measure similarity between two strings via a kernel function In short: How to define a string kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

History of sequence comparison First phase Smith-Waterman BLAST Second phase Profiles Hidden Markov Models Third phase PSI-Blast SAM-T98 Fourth phase Kernels Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Sequence comparison: Phase 1 Idea Methods Measure pairwise similarities between sequences with gaps Smith-Waterman dynamic programming high accuracy slow (O(n 2 )) BLAST faster heuristic alternative with sufficient accuracy searches common substrings of fixed length extends these in both directions performs gapped alignment Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Sequence comparison: Phase 2 Idea Methods Collect aggregate statistics from a family of sequences Compare this statistics to a single unlabeled protein Hidden Markov Models (HMMs) Markov process with hidden and observable parameters Forward algorithm determines probability if given sequence is output of particular HMM Profiles Profiles of sequence families are derived by multiple sequence alignment Given sequence is compared to this profile Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Sequence comparison: Phase 3 Idea Methods Create single models from database collections of homologous sequences PSI-BLAST Position specific iterative BLAST Profile from highest scoring hits in initial BLAST runs Position weighting according to degree of conservation Iteration of these steps SAM-T98, now SAM-T02 database search with HMM from multiple sequence alignment Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Phase 4: Kernels and SVMs General idea Model differences between classes of sequences Use SVM classifier to distinguish classes Use kernel to measure similarity between strings Kernels for Protein Sequences SVM-Fisher kernel Composite kernel Motif kernel String kernel Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

SVM-Fisher method General idea Combine HMMs and SVMs for sequence classification Won best-paper award at ISMB 1999 Sequence representation fixed-length vector components are transition and emission probabilities transformation into Fisher score Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

SVM-Fisher method Algorithm Model protein family F as HMM Transform query protein X into fixed-length vector via HMM Compute kernel between X and positive and negative examples of the protein family Advantages allows to incorporate prior knowledge allows to deal with missing data is interpretable outperforms competing methods Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Composition kernels General idea Model sequence by amino acid content Bin amino acids w.r.t physico-chemical properties Sequence representation feature vector of amino acid frequencies physico-chemical properties include predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, polarizability useful database: AAindex Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Motif kernels General idea Conserved motif in amino acid sequences indicate structural and functional relationship Model sequence s as a feature vector f representing motifs i-th component of f is 1 s contains i-th motif Motif databases PROSITE emotifs BLOCKS+ combines several databases Generated by manual construction multiple sequence alignment Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

Pairwise comparison kernels General idea Employ empirical kernel map on Smith-Waterman/Blast scores Advantage Utilizes decades of practical experience with Blast Disadvantage High computational cost (O(m 3 )) Alleviation Employ Blast instead of Smith-Waterman Use vectorization set for empirical map only Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Phase 4: String Kernels General idea Count common substrings in two strings A substring of length k is a k-mer Variations Assign weights to k-mers Allow for mismatches Allow for gaps Include substitutions Include wildcards Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Spectrum Kernel General idea For each l-mer α Σ l, the coordinate indexed by α will be the number of times α occurs in sequence x. Then the l-spectrum feature map is Φ Spectrum l (x) = (φ α (x)) α Σ l Here φ α (x) is the # occurrences of α in x. The spectrum kernel is now the inner product in the feature space defined by this map: k Spectrum (x, x ) =< Φ Spectrum l (x), Φ Spectrum l (x ) > Sequences are deemed the more similar, the more common substrings they contain Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Spectrum Kernel Principle Spectrum kernel: Count exactly common k-mers Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Mismatch Kernel General idea Do not enforce strictly exact matches Define mismatch neighborhood of an l-mer α with up to m mismatches: φ Mismatch (l,m) (α) = (φ β (α)) β Σ l For a sequence x of any length, the map is then extended as φ Mismatch (l,m) (x) = (φ Mismatch (l,m) (α)) l mers α in x The mismatch kernel is now the inner product in feature space defined by: k Mismatch (l,m) (x, x ) =< Φ Mismatch (l,m) (x), Φ Mismatch (l,m) (x ) > Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

Mismatch Kernel Principle Mismatch kernel: Count common k-mers with max. m mismatches Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Gappy Kernel General idea Allow for gaps in common substrings subsequences A g-mer then contributes to all its l-mer subsequences φ Gap (g,l) (α) = (φ β(α)) β Σ l For a sequence x of any length, the map is then extended as φ Gap (g,l) (x) = (φ Gap (g,l) (α)) g mers α in x The gappy kernel is now the inner product in feature space defined by: k Gap (g,l) (x, x ) =< Φ Gap (g,l) (x), ΦGap (g,l) (x ) > Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Gappy Kernel Principle Gappy kernel: Count common l-subsequences of g- mers Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Substitution Kernel General idea mismatch neighborhood substitution neighborhood An l-mer then contributes to all l-mers in its substitution neighborhood l M (l,σ) (α) = {β = b 1 b 2... b l Σ l : log P (a i b i ) < σ} For a sequence x of any length, the map is then extended as φ Sub (l,σ) (x) = (φ Sub (l,σ) (α)) l mers α in x The substitution kernel is now: k(l,σ) Sub (x, x ) =< Φ Sub (l,σ)(x), ΦSub (l,σ) (x ) > i Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Substitution Kernel Principle Substitution kernel: Count common l-subsequences in substitution neighborhood Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Wildcard Kernels General idea augment alphabet Σ by a wildcard character Σ { } given α from Σ l and β from {Σ { }} l with maximum m occurrences of l-mer α contributes to l-mer β if their non-wildcard characters match For a sequence x of any length, the map is then given by φ W (l,m,λ) ildcard (x) = (φ β (α)) β W l mers α in x where φ β (α) = λ j if α matches pattern β containing j wildcards, φ β (α) = 0 if α does not match β, and 0 λ 1. Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

Wildcard Kernel Principle Wildcard kernel: Count l-mers that match except for wildcards Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

References and further reading References [1] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In PSB, pages 564 575, 2002. [2] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for SVM protein classification. In NIPS 2002. MIT Press. [3] C. Leslie and R. Kuang. Fast kernels for inexact string matching. In COLT, 2003. [4] B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology, Chapter 3 and 4. MIT Press, Cambridge, MA, 2004. Karsten Borgwardt: Data Mining in Bioinformatics, Page 25