Data Mining in Bioinformatics Day 4: Text Mining

Similar documents
Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Effective Diagnosis of Alzheimer s Disease by means of Association Rules

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data. Technical Report

This is the accepted version of this article. To be published as : This is the author version published as:

Variable Features Selection for Classification of Medical Data using SVM

Proposing a New Term Weighting Scheme for Text Categorization

The Long Tail of Recommender Systems and How to Leverage It

NMF-Density: NMF-Based Breast Density Classifier

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Administrative notes. Computational Thinking ct.cs.ubc.ca

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

Prediction of Alternative Splice Sites in Human Genes

Evaluating Classifiers for Disease Gene Discovery

Automatic Medical Coding of Patient Records via Weighted Ridge Regression

Efficient AUC Optimization for Information Ranking Applications

Design of Multi-Class Classifier for Prediction of Diabetes using Linear Support Vector Machine

Lecture 2: Foundations of Concept Learning

Drug clearance pathway prediction using semi-supervised learning

A MODIFIED FREQUENCY BASED TERM WEIGHTING APPROACH FOR INFORMATION RETRIEVAL

Genetic Algorithm based Feature Extraction for ECG Signal Classification using Neural Network

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Development of Soft-Computing techniques capable of diagnosing Alzheimer s Disease in its pre-clinical stage combining MRI and FDG-PET images.

Winner s Report: KDD CUP Breast Cancer Identification

Improved Intelligent Classification Technique Based On Support Vector Machines

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Abstracts

MTAT Bayesian Networks. Introductory Lecture. Sven Laur University of Tartu

Detecting and monitoring foodborne illness outbreaks: Twitter communications and the 2015 U.S. Salmonella outbreak linked to imported cucumbers

Remarks on Bayesian Control Charts

An Improved Algorithm To Predict Recurrence Of Breast Cancer

CLASSIFICATION OF BREAST CANCER INTO BENIGN AND MALIGNANT USING SUPPORT VECTOR MACHINES

Predictive performance and discrimination in unbalanced classification

Predicting Sleep Using Consumer Wearable Sensing Devices

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Exploiting Similarity to Optimize Recommendations from User Feedback

Supervised Learning Approach for Predicting the Presence of Seizure in Human Brain

Classification of Mammograms using Gray-level Co-occurrence Matrix and Support Vector Machine Classifier

SVM-Kmeans: Support Vector Machine based on Kmeans Clustering for Breast Cancer Diagnosis

Detection of Cochlear Hearing Loss Applying Wavelet Packets and Support Vector Machines

Introduction to Computational Neuroscience

Journal of Advanced Scientific Research ROUGH SET APPROACH FOR FEATURE SELECTION AND GENERATION OF CLASSIFICATION RULES OF HYPOTHYROID DATA

Prediction of Successful Memory Encoding from fmri Data

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

Using Association Rule Mining to Discover Temporal Relations of Daily Activities

Statement of research interest

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Electrocardiogram beat classification using Discrete Wavelet Transform, higher order statistics and multivariate analysis

Using Information From the Target Language to Improve Crosslingual Text Classification

Self-Advising SVM for Sleep Apnea Classification

Hybrid HMM and HCRF model for sequence classification

GIANT: Geo-Informative Attributes for Location Recognition and Exploration

A New Approach for Detection and Classification of Diabetic Retinopathy Using PNN and SVM Classifiers

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Linear and Nonlinear Optimization

Disease predictive, best drug: big data implementation of drug query with disease prediction, side effects & feedback analysis

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

Case Studies of Signed Networks

Speeding up Greedy Forward Selection for Regularized Least-Squares

An Automated Method for Neuronal Spike Source Identification

Identification of Tissue Independent Cancer Driver Genes

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Survey on Data Mining Techniques for Diagnosis and Prognosis of Breast Cancer

Plan Recognition through Goal Graph Analysis

Sentiment Classification of Chinese Reviews in Different Domain: A Comparative Study

Mammogram Analysis: Tumor Classification

Vital Responder: Real-time Health Monitoring of First- Responders

An empirical evaluation of text classification and feature selection methods

Plan Recognition through Goal Graph Analysis

Bayesian (Belief) Network Models,

Information Sciences 00 (2013) Lou i Al-Shrouf, Mahmud-Sami Saadawia, Dirk Söffker

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Chapter 1. Introduction

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

BREAST CANCER DETECTION BASED ON DIFFEREN- TIAL ULTRAWIDEBAND MICROWAVE RADAR

Text mining for lung cancer cases over large patient admission data. David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

arxiv: v1 [cs.lg] 3 Jan 2018

Knowledge networks of biological and medical data An exhaustive and flexible solution to model life sciences domains

Automated Assessment of Diabetic Retinal Image Quality Based on Blood Vessel Detection

An Experimental Study of Diabetes Disease Prediction System Using Classification Techniques

The Development and Application of Bayesian Networks Used in Data Mining Under Big Data

Heterogeneous Data Mining for Brain Disorder Identification. Bokai Cao 04/07/2015

HYBRID SUPPORT VECTOR MACHINE BASED MARKOV CLUSTERING FOR TUMOR DETECTION FROM BIO-MOLECULAR DATA

Performance of SVM Classifiers in Predicting Diabetes

Event Classification and Relationship Labeling in Affiliation Networks

Outlier Analysis. Lijun Zhang

EECS 433 Statistical Pattern Recognition

CSE 258 Lecture 1.5. Web Mining and Recommender Systems. Supervised learning Regression

DISCOVERING IMPLICIT ASSOCIATIONS BETWEEN GENES AND HEREDITARY DISEASES

Evaluation of Gene Selection Using Support Vector Machine Recursive Feature Elimination

Learning Convolutional Neural Networks for Graphs

Transcription:

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

What is text mining? Definition Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in the (biomedical) literature. Motivation Most knowledge is stored in terms of texts, both in industry and in academia This alone makes text mining an integral part of knowledge discovery! Furthermore, to make text machine-readable, one has to solve several recognition (mining) tasks on text Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

What is text mining? Common tasks Information retrieval: Find documents that are relevant to a user, or to a query in a collection of documents Document ranking: rank all documents in the collection Document selection: classify documents into relevant and irrelevant Information filtering: Search newly created documents for information that is relevant to a user Document classification: Assign a document to a category that describes its content Keyword co-occurrence: Find groups of keywords that co-occur in many documents Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Evaluating text mining Precision and Recall Let the set of documents that are relevant to a query be denoted as {Relevant} and the set of retrieved documents as {Retrieved}. The precision is the percentage of retrieved documents that are relevant to the query precision = {Relevant} {Retrieved} {Retrieved} The recall is the percentage of relevant documents that were retrieved by the query: recall = {Relevant} {Retrieved} {Relevant} (1) (2) Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Text representation Tokenization Process of identifying keywords in a document Not all words in a text are relevant Text mining ignores stop words Stop words form the stop list Stop lists are context-dependent Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Text representation Vector space model Given #d documents and #t terms Model each document as a vector v in a t-dimensional space Weighted Term-frequency matrix Matrix T F of size #d #t Entries measure association of term and document If a term t does not occur in a document d, then T F (d, t) = 0 If a term t does occur in a document d, then T F (d, t) > 0. Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Text representation If term t occurs in document d, then T F (d, t) = 1 T F (d, t) = frequency of t in d (freq(d, t)) freq(d,t) T F (d, t) = t T freq(d,t ) T F (d, t) = 1 + log(1 + log(freq(d, t))) Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Text representation Inverse document frequency represents the scaling factor, or importance, of a term A term that appears in many document is scaled down IDF (t) = log 1 + d d t where d is the number of all documents, and d t is the number of documents containing term t TF-IDF measure Product of term frequency and inverse document frequency: (3) T F -IDF (d, t) = T F (d, t)idf (t); (4) Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

Measuring similarity Cosine measure Kernels Let v 1 and v 2 be two document vectors. The cosine similarity is defined as sim(v 1, v 2 ) = v 1 v 2 v 1 v 2 depending on how we represent a document, there are many kernels available for measuring similarity of these representations vectorial representation: vector kernels like linear, polynomial, Gaussian RBF kernel one long string: string kernels that count common k- mers in two strings (more on that later in the course) (5) Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

Keyword co-occurrence Problem Find sets of keyword that often co-occur Common problem in biomedical literature: find associations between genes, proteins or other entities using co-occurrence search Keyword co-occurrence search is an instance of a more general problem in data mining, called association rule mining. Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Association rules Definitions Let I = {I 1, I 2,..., I m } be a set of items (keywords) Let D be the database of transactions T (collection of documents) A transaction T D is a set of items: T I (a document is a set of keywords) Let A be a set of items: A T. An association rule is an implication of the form where A, B I and A B = A T B T, (6) Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Association rules Support and Confidence The rule A B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A B: {T D A T B T } support(a B) = (7) {T D} The rule A B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B: confidence(a B) = {T D A T B T } {T D A T } (8) Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

Association rules Strong rules Rules that satisfy both a minimum support threshold (minsup) and a minimum confidence threshold (minconf) are called strong association rules and these are the ones we are after! Finding strong rules 1. Search for all frequent itemsets (set of items that occur in at least minsup % of all transactions) 2. Generate strong association rules from the frequent itemsets Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Association rules Apriori algorithm Steps Makes use of the Apriori property: If an itemset A is frequent, then any subset B of A (B A) is frequent as well. If B is infrequent, then any superset A of B (A B) is infrequent as well. 1. Determine frequent items = k-itemsets with k = 1 2. Join all pairs of frequent k-itemsets that differ in at most 1 item = candidates C k+1 for being frequent k+1 itemsets 3. Check the frequency of these candidates C k+1 : the frequent ones form the frequent k + 1-itemsets (trick: discard any candidate immediately that contains an infrequent k-itemset) 4. Repeat from Step 2 until no more candidate is frequent Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Transduction Known test set Classification on text databases often means that we know all the data we will work with before training Hence the test set is known apriori This setting is called transductive Can we define classifiers that exploit the known test set? Yes! Transductive SVM (Joachims, ICML 1999) Trains SVM on both training and test set Uses test data to maximise margin Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Inductive vs. transductive Classification Task: predict label y from features x Classic inductive setting Strategy: Learn classifier on (labelled) training data Goal: Classifier shall generalise to unseen data from same distribution Transductive setting Strategy: Learn classifier on (labelled) training data AND a given (unlabelled) test dataset Goal: Predict class labels for this particular dataset Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Why transduction? Really necessary? Classic approach works: train on training dataset, test on test dataset That is what we usually do in practice, for instance, in cross-validation. We usually ignore or neglect that the fact that settings are transductive. The benefits of transductive classification Inductive setting: infinitely many potential classifiers Transductive setting: finite number of equivalence classes of classifiers f and f in same equivalence class f and f classify points from training and test dataset identically Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

Why transduction? Idea of Transductive SVMs Risk on Test data Risk on Training data + confidence interval (depends on number of equivalence classes) Theorem by Vapnik(1998): The larger the margin, the lower the number of equivalence classes that contain a classifier with this margin Find hyperplane that separates classes in training data AND in test data with maximum margin. Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Why transduction? Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

Transduction on text Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Transductive SVM Linearly separable case 1 min w,b,y 2 w 2 s.t. n i=1 y i [w x i + b] 1 k j=1 y j [w x j + b] 1 Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Transductive SVM Non-linearly separable case 1 min w,b,y,ξ,ξ 2 w 2 + C n ξ i + C i=0 k j=0 ξ j s.t. n i=1 y i [w x i + b] 1 ξ i k j=1 y j [w x j + b] 1 ξ j n i=1 ξ i 0 k j=1 ξ j 0 Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Transductive SVM Optimisation How to solve this OP? Not so nice: combination of integer and convex OP Joachims approach: find approximate solution by iterative application of inductive SVM train inductive SVM on training data, predict on test data, assign labels to test data retrain on all data, with special slack weights for test data (C, C + ) Outer loop: repeat and slowly increase (C, C + ) Inner loop: within each repetition switch pairs of misclassified data points repeatedly Local search with approximate solution to OP Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

Inductive SVM for TSVM Variant of inductive SVM 1 min w,b,y,ξ,ξ 2 w 2 + C n k ξ i + C ξj + C+ j:yj = 1 i=0 k j:y j =1 ξ j s.t. n i=1 y i [w x i + b] 1 ξ i k j=1 yj [w x j + b] 1 ξj Three different penalty costs C for points from training dataset C for points from in test dataset currently in class 1 for points from in test dataset currently in class +1 C + Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

Experiments Average P/R-breakeven point on the Reuters dataset for different training set sizes and a test size of 3,299 Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

Experiments Average P/R-breakeven point on the Reuters dataset for 17 training documents and varying test set size for the TSVM Karsten Borgwardt: Data Mining in Bioinformatics, Page 26

Experiments Average P/R-breakeven point on the WebKB category course for different training set sizes Karsten Borgwardt: Data Mining in Bioinformatics, Page 27

Experiments Average P/R-breakeven point on the WebKB category project for different training set sizes Karsten Borgwardt: Data Mining in Bioinformatics, Page 28

Summary Results Transductive version of SVM Maximizes margin on training and test data Implementation uses variant of classic inductive SVM Solution is approximate and fast Works well on text, in particular on small training samples and large test sets Karsten Borgwardt: Data Mining in Bioinformatics, Page 29

References and further reading References [1] T.-Joachims. Transductive Inference for Text Classification using Support Vector Machines ICML, 1999: 200-209. [2] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Elsevier, Morgan-Kaufmann Publishers, 2006. Karsten Borgwardt: Data Mining in Bioinformatics, Page 30

The end See you tomorrow! Next topic: Graph Mining Karsten Borgwardt: Data Mining in Bioinformatics, Page 31