Positive and Unlabeled Relational Classification through Label Frequency Estimation

Similar documents
Positive and Unlabeled Relational Classification through Label Frequency Estimation

Using Bayesian Networks to Direct Stochastic Search in Inductive Logic Programming

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Discovering Meaningful Cut-points to Predict High HbA1c Variation

Predicting Breast Cancer Survivability Rates

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

CS 4365: Artificial Intelligence Recap. Vibhav Gogate

Evaluating Classifiers for Disease Gene Discovery

Learning with Rare Cases and Small Disjuncts

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

Mining first-order frequent patterns in the STULONG database

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

International Journal of Software and Web Sciences (IJSWS)

Logical Bayesian Networks and Their Relation to Other Probabilistic Logical Models

Fuzzy Decision Tree FID

Data complexity measures for analyzing the effect of SMOTE over microarrays

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Transfer Learning from Minimal Target Data by Mapping across Relational Domains

Variable Features Selection for Classification of Medical Data using SVM

arxiv: v2 [cs.lg] 30 Oct 2013

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

Bayesian models of inductive generalization

A Comparison of Weak Supervision methods for Knowledge Base Construction

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Event Classification and Relationship Labeling in Affiliation Networks

On the Combination of Collaborative and Item-based Filtering

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Chapter 1. Introduction

Use of BONSAI decision trees for the identification of potential MHC Class I peptide epitope motifs.

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

A HMM-based Pre-training Approach for Sequential Data

Unit 2 Boundary Value Testing, Equivalence Class Testing, Decision Table-Based Testing. ST 8 th Sem, A Div Prof. Mouna M.

Comparative Analysis of Machine Learning Algorithms for Chronic Kidney Disease Detection using Weka

Transfer Learning from Minimal Target Data by Mapping across Relational Domains

Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

Using Information From the Target Language to Improve Crosslingual Text Classification

Using AUC and Accuracy in Evaluating Learning Algorithms

Trading off coverage for accuracy in forecasts: Applications to clinical data analysis

and errs as expected. The disadvantage of this approach is that it is time consuming, due to the fact that it is necessary to evaluate all algorithms,

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients

Testing Bias Prevention Techniques on Recidivism Risk Models

Nearest Shrunken Centroid as Feature Selection of Microarray Data

BREAST CANCER EPIDEMIOLOGY MODEL:

Impute vs. Ignore: Missing Values for Prediction

International Journal of Advance Research in Computer Science and Management Studies

Hypothesis-Driven Research

Interpretable Models to Predict Breast Cancer

BayesRandomForest: An R

Data Mining with Weka

Pythia WEB ENABLED TIMED INFLUENCE NET MODELING TOOL SAL. Lee W. Wagenhals Alexander H. Levis

AUTOMATING NEUROLOGICAL DISEASE DIAGNOSIS USING STRUCTURAL MR BRAIN SCAN FEATURES

Lecture 2: Foundations of Concept Learning

INDUCTIVE LEARNING OF TREE-BASED REGRESSION MODELS. Luís Fernando Raínho Alves Torgo

Outlier Analysis. Lijun Zhang

Towards Learning to Ignore Irrelevant State Variables

Credal decision trees in noisy domains

A Scoring Policy for Simulated Soccer Agents Using Reinforcement Learning

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

4. Model evaluation & selection

Testing the robustness of anonymization techniques: acceptable versus unacceptable inferences - Draft Version

Evolutionary Programming

Challenges of Automated Machine Learning on Causal Impact Analytics for Policy Evaluation

Classification of Thyroid Disease Using Data Mining Techniques

Rating prediction on Amazon Fine Foods Reviews

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

A DATA MINING APPROACH FOR PRECISE DIAGNOSIS OF DENGUE FEVER

ERA: Architectures for Inference

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Learning to Cook: An Exploration of Recipe Data

Magical Thinking in Data Mining: Lessons From CoIL Challenge 2000

Classification of Smoking Status: The Case of Turkey

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Reader s Emotion Prediction Based on Partitioned Latent Dirichlet Allocation Model

PO Box 19015, Arlington, TX {ramirez, 5323 Harry Hines Boulevard, Dallas, TX

Artificial Intelligence Programming Probability

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Learning from data when all models are wrong

A Bayesian Approach to Tackling Hard Computational Challenges

Data Mining in Bioinformatics Day 4: Text Mining

Lecture 2: Learning and Equilibrium Extensive-Form Games

A Simple Causal Model for Glucose Metabolism in Diabetes Patients

Effective Values of Physical Features for Type-2 Diabetic and Non-diabetic Patients Classifying Case Study: Shiraz University of Medical Sciences

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Citation for published version (APA): Geus, A. F. D., & Rotterdam, E. P. (1992). Decision support in aneastehesia s.n.

Inter-session reproducibility measures for high-throughput data sources

Remarks on Bayesian Control Charts

A SIMULATION BASED ESTIMATION OF CROWD ABILITY AND ITS INFLUENCE ON CROWDSOURCED EVALUATION OF DESIGN CONCEPTS

Towards More Confident Recommendations: Improving Recommender Systems Using Filtering Approach Based on Rating Variance

Empirical function attribute construction in classification learning

Exploiting Implicit Item Relationships for Recommender Systems

Analysis of Classification Algorithms towards Breast Tissue Data Set

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Analysis of Cow Culling Data with a Machine Learning Workbench. by Rhys E. DeWar 1 and Robert J. McQueen 2. Working Paper 95/1 January, 1995

Statistical Analysis Using Machine Learning Approach for Multiple Imputation of Missing Data

Artificial Intelligence Lecture 7

Transcription:

Positive and Unlabeled Relational Classification through Label Frequency Estimation Jessa Bekker and Jesse Davis Computer Science Department, KU Leuven, Belgium firstname.lastname@cs.kuleuven.be Abstract. Many applications, such as knowledge base completion and patient data, only have access to positive examples but lack negative examples which are required by standard ILP techniques and suffer under the closed-world assumption. The corresponding propositional problem is known as Positive and Unlabeled (PU) learning. In this field, it is known that using the label frequency (the fraction of true positive examples that are labeled) makes learning easier. This notion has not been explored yet in the relational domain. The goal of this work is twofold: 1) to explore if using the label frequency would also be useful when working with relational data and 2) to propose a method for estimating the label frequency from relational PU data. Our experiments confirm the usefulness of knowing the label frequency and of our estimate. 1 Introduction ILP traditionally requires positive and negative examples to learn a theory. However, in many applications, it is only possible to acquire positive examples. A common solution to this issue is to make the closed-world assumption and assume that unlabeled examples belong to the negative class. In reality, this assumption is often incorrect, for example: diabetics often go undiagnosed (Claesen et al., 2015) and people do not bookmark all interesting pages. Considering unlabeled cases as negative is therefore suboptimal. To cope with this, several score functions have been proposed that use only positive examples (Muggleton, 1996; McCreath and Sharma; Schoenmackers et al., 2010). In propositional settings, having training data with only positive and unlabeled examples is known as Positive and Unlabeled (PU) learning. It has been noted that if the class distribution is known, then learning in this setting is greatly simplified. Specifically, knowing the class prior allows calculating the label frequency, which is the probability of a positive example being labeled. The label frequency is crucial as it enables converting standard score functions into PU score functions that can incorporate information about the unlabeled data (Elkan and Noto, 2008). Following this insight, several methods have been proposed to estimate the label frequency from PU data (du Plessis et al., 2015; Jain et al., 2016; Ramaswamy et al., 2016). As far as we know, this notion has not been exploited in relational settings. We propose a method to estimate the

2 label frequency from relational PU data and a way to use this frequency when learning a relational classifier. Our work goes beyond the existing ILP score functions that only use positive examples to incorporate unlabeled examples in the model evaluation process. Our main contributions are: 1) Investigating the helpfulness of the label frequency in relational PU learning by adjusting relational decision trees to incorporate the label frequency; 2) Proposing a method for estimating the label frequency in relational data through tree induction; and 3) Evaluating our approach experimentally. 2 PU Learning and Using the Label Frequency In PU Learning, the labeled positive examples are commonly assumed to be selected completely at random (Elkan and Noto, 2008). This means that the probability c = Pr(s = 1 y = 1) for a positive example to be labeled is constant: equal for every positive example. c is called the label frequency. Many propositional PU learners exploit c to simplify the learning (Denis et al., 2005; Zhang and Lee, 2005). To the best of our knowledge, using the label frequency in relational PU learning has not been investigated yet. We briefly review some of the propositional methods that can be adjusted to the relational domain. The most basic method is to learn a probabilistic model which considers unlabeled as negative and adjust the predicted probability: Pr(y = 1 x) = 1 c Pr(s = 1 x) (Zhang and Lee, 2005). In the experiments, we apply this approach to the first-order logical decision tree learner TILDE (Blockeel and De Raedt, 1998). Another method is to adapt learning algorithms that make decisions based on counts of positive and negative examples (Elkan and Noto, 2008). The positive and negative counts P and N can be obtained with P = L/c and N = T P. Decision trees, for example, assign classes to leaves and score splits based on the positive/negative counts in the potential subsets (Denis et al., 2005). This idea can be applied straightforwardly to relational learners based on counts, like TILDE or Aleph (Srinivasan). 3 Label Frequency Estimation To estimate the label frequency in relational PU data, we will use the insights of a propositional label frequency estimator. We first review the original method and then propose a relational version. 3.1 Label Frequency Estimation in Propositional PU data The propositional estimator is TIcE (Bekker and Davis, under review). It is based on two main insights: 1) a subset of the data naturally provides a lower bound on the label frequency, and 2) the lower bound of a large enough positive subset approximates the real label frequency. TIcE uses decision tree induction

3 to find likely positive subsets and estimates the label frequency by taking the maximum of the lower bounds implied by all the subsets in the tree. The label frequency is the same in subsets of the data because of the selected completely at random assumption, therefore it can be estimated in a subset of the data. Clearly, the true number of positive examples P in a subset cannot exceed the total number of examples in that subset T. This naively implies a lower bound: c = L/P L/T. To take stochasticity into account, this bound is corrected with confidence 1 δ using the one-sided Chebyshev inequality which introduces an error term based on the subset size: ( Pr c L T 1 ) 1 δ δ (1) 2 δt The higher the ratio of positive examples in the subset, the closer the bound gets to the actual label frequency. The ratio of positive examples is unknown, but directly proportional to the ratio of labeled examples. Therefore, TIcE aims to find subsets of the data with a high proportion of labeled examples using decision tree induction. To this end, it uses the max-bepp score (Blockeel et al., 2005) that chooses the split that provides the subset with the largest proportion of labels. To avoid overfitting, i.e. finding subsets where L/P > c, k folds are used to induce the tree and estimate the label frequency on different datasets. 3.2 Label Frequency Estimation in Relational PU Data We propose TIcER (Tree Induction for c Estimation in Relational data). The main difference with TIcE is that it learns a first order logical decision tree using TILDE (Blockeel and De Raedt, 1998). Each internal node splits on a formula. Each node in the tree, therefore, specifies a subset of the data, and each subset implies a lower bound on the label frequency through equation 1. The estimate for the label frequency is the maximal lower bound implied by the subsets. Ideally, the splits are chosen such that subsets with high proportions of labeled examples are found. But we did not do this in this preliminary work. To prevent overfitting, k folds are used to induce the tree and estimate the label frequency on different datasets. With relational data, extra care should be taken that the data in different folds are not related to each other. 4 Experiments We aim to evaluate if knowing the label frequency makes learning from relational PU data easier and if TIcER provides a good estimate for the label frequency. Datasets We evaluate our approach on 4 commonly used datasets for relational classification (Table 1). All datasets are available on Alchemy 1, except for Mutagenesis. 2 The datasets were converted to PU datasets by selecting some 1 http://alchemy.cs.washington.edu/data/ 2 http://www.cs.ox.ac.uk/activities/machlearn/mutagenesis.html

4 Fig. 1. PU Classifier Comparison. Clearly, taking unlabeled examples as negative is suboptimal. Posonly cannot handle disjunctive concepts well, which is demonstrated by WebKB, where Person contains webpages from students, faculty and staff and Other contains webpages from departments, courses, and research projects. Adjusting Aleph with the true label frequency c gives consistently good results, but can suffer from a bad estimates, e.g. UW-CSE (Prof). Tilde on the other hand gives better results with the TIcER estimates. This can be explained by the estimate only adjusting the leaf probabilities, an underestimates makes it more likely that a leaf with at least 1 positive example classifies as positive, all other leaves always classify as negative. of the positive examples at random to be labeled. The labeling was done with frequencies c {0.1, 0.3, 0.5, 0.7, 0.9}. Each having 5 different random labelings. Methods To evaluate if knowing the label frequency makes learning easier, we compared seven PU classifiers: TILDE and Aleph adjusted with the TIcER label frequency estimate, TILDE and Aleph adjusted with the true label frequency, TILDE and Aleph taking unlabeled examples as negative (i.e. c = 1), and Aleph posonly 3 : Muggleton (1996) s approach. TILDE and Aleph are adjusted by adjusting the leaf probabilities and changing the coverage score function respectively. Standard settings were used for all classifiers, except for requiring minimum two example per rule in posonly and allowing infinite noise and exploration in Aleph. k-fold cross validation was applied for validation, i.e. k 1 folds were used for learning the classifier and the other fold to evaluate it. TIcER also needs folds for estimation, it used 1 fold for inducing a tree and the other k 2 folds for bounding the label frequency. The classifiers are compared using the F 1 score and the average absolute error of the estimated c is reported. 3 http://www.cs.ox.ac.uk/activities/machinelearning/aleph/aleph

5 Table 1. Datasets Datasets #Examples Class 1 (#) Class 2 (#) # Folds IMDB 268 Actor (236) Director (32) 5 Mutagenesis 230 Yes (138) No (92) 5 UW-CSE 278 Student (216) Professor (62) 5 WebKB 922 Person (590) Other (332) 4 Fig. 2. Label Frequency Estimates. The blue line shows the average absolute error in the label frequency estimate error. The red dots show the maximal proportion of true positives in the used subsets. Each of the dots is annotated with the maximum number of examples in a subset which positive proportion was within 10% of the maximum. The estimate is expected to be good if a large enough subset with a high proportion of positives was found, this is confirmed by the experiments. For example, the worst results, for UW-CSE (Prof), are explained by the low positive proportions. Results The label frequency indeed makes learning from PU data easier. It often outperforms posonly, in particular when the positive concept is conjunctive (Figure 1). TIcER gives reasonable results most of the time. It performs weaker when it fails to find subsets with a high ratio of positive examples or when the subsets contain few examples (Figure 2). 5 Conclusions We showed that using the label frequency c in relational PU learning is very promising. And that the TIcER estimate for c approaches the real value well when enough data is available and when it can find highly positive subsets. We only evaluated a naive way to use c during learning: by adjusting the output probabilities. We expect that methods that using c internally can improve the results. We expect the label frequency estimate to improve if a score function is used that explicitly looks for pure positive subsets.

Bibliography J. Bekker and J. Davis. Estimating the class prior in positive and unlabeled data through decision tree induction, under review. H. Blockeel and L. De Raedt. Top-down induction of first-order logical decision trees. Artificial intelligence, pages 285 297, 1998. H. Blockeel, D. Page, and A. Srinivasan. Multi-instance tree learning. In Proceedings of the 22nd international conference on Machine learning, 2005. M. Claesen, F. De Smet, P. Gillard, C. Mathieu, and B. De Moor. Building classifiers to predict the start of glucose-lowering pharmacotherapy using belgian health expenditure data. arxiv preprint arxiv:1504.07389, 2015. F. Denis, R. Gilleron, and F. Letouzey. Learning from positive and unlabeled examples. Theoretical Computer Science, pages 70 83, 2005. M. C. du Plessis, G. Niu, and M. Sugiyama. Class-prior estimation for learning from positive and unlabeled data. Machine Learning, pages 1 30, 2015. C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 213 220, 2008. S. Jain, M. White, and P. Radivojac. Estimating the class prior and posterior from noisy positives and unlabeled data. In Advances in Neural Information Processing Systems, 2016. E. McCreath and A. Sharma. Ilp with noise and fixed example size: A bayesian approach. In Proceedings of the Fifteenth International Joint Conference on Artifical Intelligence, pages 1310 1315. S. Muggleton. Learning from positive data. In Selected Papers from the 6th International Workshop on Inductive Logic Programming, pages 358 376, 1996. H. G. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embedding of distributions. In Proceedings of International Conference on Machine Learning, 2016. S. Schoenmackers, J. Davis, O. Etzioni, and D. S. Weld. Learning first-order horn clauses from web text. In Proceedings of Conference on Empirical Methods on Natural Language Processing, 2010. A. Srinivasan. The Aleph manual. D. Zhang and W. S. Lee. A simple probabilistic approach to learning from positive and unlabeled examples. In Proceedings of the 5th Annual UK Workshop on Computational Intelligence, pages 83 87, 2005.