Mining first-order frequent patterns in the STULONG database

Similar documents
Mining for association rules in medical data

Machine Learning Methods for Knowledge Discovery in Medical Data on Atherosclerosis

Trend Analysis in Stulong Data

Roles of Medical Ontology in Association Mining CRISP-DM Cycle

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Risk Evaluation using Evolvable Discriminate Function.

Formal Framework for Data Mining with Association Rules and Domain Knowledge

EEL-5840 Elements of {Artificial} Machine Intelligence

Discovering Meaningful Cut-points to Predict High HbA1c Variation

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Coherence Theory of Truth as Base for Knowledge Based Systems

Automatic Discovery of Hypotheses in Nuclear Cardiology

Global WordNet Tools

PO Box 19015, Arlington, TX {ramirez, 5323 Harry Hines Boulevard, Dallas, TX

Intelligent Agents. CmpE 540 Principles of Artificial Intelligence

High-level Vision. Bernd Neumann Slides for the course in WS 2004/05. Faculty of Informatics Hamburg University Germany

Selecting Relevant Information for Medical Decision Support with Application in Cardiology

Chapter 2. Knowledge Representation: Reasoning, Issues, and Acquisition. Teaching Notes

How can Natural Language Processing help MedDRA coding? April Andrew Winter Ph.D., Senior Life Science Specialist, Linguamatics

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

1 What is an Agent? CHAPTER 2: INTELLIGENT AGENTS

Citation for published version (APA): Geus, A. F. D., & Rotterdam, E. P. (1992). Decision support in aneastehesia s.n.

IE 5203 Decision Analysis Lab I Probabilistic Modeling, Inference and Decision Making with Netica

Lenka Bockova Ministry of Labour and Social Affairs Social Inclusion Policy Unit

Recovery Education Centres

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Computational Theory of Belief Introspection

Faculty of Social Sciences

Data Mining and Knowledge Discovery: Practice Notes

SAGE. Nick Beard Vice President, IDX Systems Corp.

Effective Values of Physical Features for Type-2 Diabetic and Non-diabetic Patients Classifying Case Study: Shiraz University of Medical Sciences

Support system for breast cancer treatment

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

Design and Implementation of Fuzzy Expert System for Back pain Diagnosis

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING

What Happened to Bob? Semantic Data Mining of Context Histories

Using Bayesian Networks to Direct Stochastic Search in Inductive Logic Programming

Semantic Structure of the Indian Sign Language

User Evaluation of an Electronic Malaysian Sign Language Dictionary: e-sign Dictionary

Logical Bayesian Networks and Their Relation to Other Probabilistic Logical Models

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Device Equivalence Evaluation Form

Lecture 2: Foundations of Concept Learning

Lecture 9: Lab in Human Cognition. Todd M. Gureckis Department of Psychology New York University

Goal Recognition through Goal Graph Analysis

NORC AmeriSpeak Omnibus Survey: 41% of Americans Do Not Intend to Get a Flu Shot this Season

DESIGNING AND IMPLEMENTING APPLICATIONS FOR HEARING-IMPAIRED CHILDREN


Web-Mining Agents Cooperating Agents for Information Retrieval

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Selection of Linking Items

Automatic extraction of adverse drug reaction terms from medical free text

International Journal of Software and Web Sciences (IJSWS)

Vorlesung Grundlagen der Künstlichen Intelligenz

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Use of GELLO v.1.x, GLIF 3.5, SNOMED CT and EN archetypes

What Smokers Who Switched to Vapor Products Tell Us About Themselves. Presented by Julie Woessner, J.D. CASAA National Policy Director

Master of Science in Management: Fall 2018 and Spring 2019

COMMUNITY RESEARCH WORKSHOP

Including Quantification in Defeasible Reasoning for the Description Logic EL

Fuzzy Decision Tree FID

Module 1. Introduction. Version 1 CSE IIT, Kharagpur

FitFormula. Sam Liu (Apper) Simran Fitzgerald (Programmer) Jonathan Tomkun (Programmer) ECE1778 Final report

Zainab M. AlQenaei. Dissertation Defense University of Colorado at Boulder Leeds School of Business Operations and Information Management Division

A Fuzzy Improved Neural based Soft Computing Approach for Pest Disease Prediction

Evaluating the Risk Acceptance Ladder (RAL) as a basis for targeting communication aimed at prompting attempts to improve health related behaviours

Development of an Expert System for Distinguishing Headaches from Migraines

Foundations of AI. 10. Knowledge Representation: Modeling with Logic. Concepts, Actions, Time, & All the Rest

Computation Tree Logic vs. Linear Temporal Logic. Slides from Robert Bellarmine Krug University of Texas at Austin

Data Mining in Bioinformatics Day 4: Text Mining

HBML: A Representation Language for Quantitative Behavioral Models in the Human Terrain

AU TQF 2 Doctoral Degree. Course Description

Why did the network make this prediction?

Representing Association Classification Rules Mined from Health Data

MODULE 5 Motivation Definition of Motivation Work Motivation Work Motivation Sources of Motivation

Models of Information Retrieval

PUBLIC HAPPINESS HEIRs conference 2013, Rome 4-5 June. Department of Labour Market and Social Policies Social Inclusion Area Rome, Corso Italia 33

Using Association Rule Mining to Discover Temporal Relations of Daily Activities

Web-Mining Agents Cooperating Agents for Information Retrieval

The Cochrane Collaboration

Workshop Report. Follow-up care / post-acute care / aftercare, long-term and palliative care

Understanding the True Realities of Influencing. What do you need to do in order to be Influential?

Evolutionary Approach to Investigations of Cognitive Systems

Trading off coverage for accuracy in forecasts: Applications to clinical data analysis

PEER REVIEW HISTORY ARTICLE DETAILS TITLE (PROVISIONAL)

UCA Speech-Language Hearing Center UCA Box Donaghey Avenue Conway, AR Phone: Fax: APHASIA CASE HISTORY

Section 3. deaf culture

Processing of Logical Functions in the Human Brain

Administrative notes. Computational Thinking ct.cs.ubc.ca

Data Mining Diabetic Databases

Customer case study Genome-wide variation Impact of Human Variation on Disease. Sample to Insight

Dementia Action Alliance survey for carers and professionals

Assignment Question Paper I

A Smart Texting System For Android Mobile Users

The Social Norms Review

Workplace smoking ban eects in an heterogenous smoking population

Human Activities: Handling Uncertainties Using Fuzzy Time Intervals

Real-time Summarization Track

Transcription:

Mining first-order frequent patterns in the STULONG database Jan Blaťák KD Group at Faculty of Informatics Masaryk University in Brno, Czech Republic xblatak@fi.muni.cz Abstract. We show how RAP can be employed to answer questions concerning relations in the database STULONG. RAP is a system of inductive logic programming which is designed to search long first-order frequent patterns and association rules. The STULONG database contains information about atherosclerosis in the population of middle-aged men. We define several types of the background knowledge and outline the influence of their complexity on mining patterns. We describe the found rules and explain their meaning. 1 Introduction The main aim of the STULONG project was to identify atherosclerosis risk factors and examine their impact on the health (especially with respect to atherosclerosis cardiovascular diseases), investigate the evolution of found risk factors and their influence on a cardiovascular system. The STULONG database contains multiple records of long-term observations of each patient. To process this data by the attribute-value oriented systems it is necessary to transform the data into one relation. Those transformation are not trivial and may require special tools. Therefore the systems for mining in relational data appear attractive for analysing this data. These attractiveness follows from flexibility of those systems. It is relatively easy to define and integrate background knowledge which can take the place of the systems for data transformation. In this paper we demonstrate the use of the RAP [1] system for mining long first-order frequent patterns. Frequent pattern [3] is define as a conjunction of literals which covers at least σ examples (records) from the database. The parameter σ is called minimal frequency threshold and it is given by the user. RAP is a system for searching long frequent patterns in the multirelational data. RAP generates frequent patterns in first-order logic. To find frequent patterns RAP uses several searching strategies, depth-first, random, and best-first search. To process numerical attributes RAP provides built-in discretization of continuous values (equal frequency intervals algorithm). Main goal of our work was to explore and describe the influence of the background knowledge complexity on the patterns found. The structure of this paper

is as follows. In Section 2 we describe the data and transformations we used to prepare them for processing with RAP. We defined several types of background knowledge. They are described in Section 3. In Section 4 we explain how RAP can be used to process the STULONG database. The discovered knowledge is described in Section 5. The summary and conclusion is in Section 6. 2 Data All interesting characteristics related to the Entry data are available at http:// euromise.vse.cz/challenge2004. The Control data consists of 10,572 records of long-term observations of 1,226 patients. There are 332 patients who came down with some disease. The maximal number of examinations per patient without observing some disease is 18. The average number of observations before some disease occurs is 7.60. For answering Challenge questions related to the long-term observations, the number of patients from RGI ( risk group intervened ) and RGC ( risk group control ) groups who suffer some diseases is important. For the Control data, 136 patients who came down with some disease belong to the RGI group while 26 belong to the RGC. For healthy patients, there are 296 patients in the RGI and 89 in the RGC group. We transformed the original data into a database of the Prolog facts. We grouped attributes into groups with similar meaning. These groups of attributes were then used as arguments of new predicates. E.g. for the group of three attributes concerning smoking we generated a new predicate smoking(ico,koure- NI,DOBAKOUR,BYVKURAK) where ICO is an identification number of a patient, KOURENI is an intensity of smoking, the value of DOBAKOUR says how long the patient has been smoking, and BYVKURAK specifies how long an ex-smoker has not been smoking. In this transformation step we also crated the simplest language definition and appropriate background knowledge (further referred as B 0 ). It contains only such literals which introduce value of some attribute (e.g. cattr(ico,poradk,#value) where ATTR is an attribute from Control matrix, PORADK specifies sequence of patient s identified by ICO control examination, and #value is introduced value). 3 Background knowledge The simplest background knowledge, B 0 contains only literals that for a given patient s id (ICO) returns a value of some feature from an input data matrix. This background knowledge also contains the predicate age(ico,rokvys,age) which computes the age (variable Age) of the patient with identification number ICO in the given year Rokvys. Frequent patterns constructed over B 0 are fully equivalent to patterns found by any propositional system. To analyze the influence of an educational level, we defined attributes EDUC (reached education) with values high and low. Similarly we defined attribute SMOKINGYN (smoking yes no) with values yes and no.

We also designed the background knowledge B 1 which generates such new constants automatically. In this background knowledge is the argument #value replaced by list of some values from attribute s domain. E.g. for the attribute STAV (marital status) RAP can generate the pattern estav(ico,[1,4]) with meaning: The marital status of patient identified by a value of the argument ICO is married or widower. Another possibility to obtain more general patterns is by using negation. We defined background knowledge B 2 which contains literals which say that patient has not some property (e.g. he is not married). We performed experiments with all mentioned types of background knowledge also on the Control data. Evidently, it can be rather difficult to interpret obtained patterns. We are not able to say if patient changed its behaviour after or before he came down with some disease. Therefore we also designed background knowledge which respects the order of control examinations. It consists of predicates such a hascontrol, nextcontrol, and beforedisease and it is further referred as B 3. 4 Experiments In our experiments we used several settings. These settings differ in the method used for searching in the hypothesis space, the definition of background knowledge, and in the set of initial constraints. We tested following settings: Classification to specify the class. This method is appropriate to answer the question on dependencies between value of class attribute and other attributes. In the fact RAP constructs an association rules with class definition in the head. For example, we declare attribute KOURENI (smoking) as a class and use attributes related to social factors to construct patterns for finding relations between social factors and smoking. Frequent patterns with constraints to give some frequent query (initial constraint). The system RAP append new literals to the end of this query. In this setting we can easily specify some a priori conditions and delimit the subgroup of patients we are interested in. As an initial pattern we can use the set of frequent patterns found by RAP. E.g. we can look for dependencies in a subset of married graduated patients. Classification with constraints to combine the first two methods. E.g. we can generate patterns for patients belonging to the normal group classified by values of the attribute KOURENI. Association rules with constraints. To process continuous values we used built-in discretization. The value of minimal frequency threshold varied between 1 10% of the number of all (or selected) patients. We demonstrate capabilities of RAP on answering questions about relations between social factors, activity after a work and smoking.

5 Results 5.1 Analysis of entry examinations We found out that patients mainly sit in their work if they are graduated were born in or before the year 1930 and belong to the risk group (81%, 93 of 115), or if they are married managerial workers with a university education and belong to the risk group of patients (92%, 69 of 75). By using background knowledge with new attribute Age we found pattern which says that 80% (44 of 55) of married man with a university education belonging to the normal group of patients who entered the study before the age of 47 sit in their job. For the class DOPRAVA (means of transport a patient uses for getting to work), 75% (15 of 20) patients from pathologic group who are married were born before the year 1928 and their educational level is the apprentice school use public transport to get to their work. We also looked for frequent patterns that concern all patients no matter what value of the attribute KONSUP (studied group of patients) is. One of them says that 89% (124 of 134) of graduated patients who are managerial workers mainly sits in their work. We observed that there is no significant difference between patterns found with simple and complex background knowledge. We did not observe any influence of using initial constraints (we used literal which determines a group of patients the value of attribute KONSUP). When we restricted search on a selected group of patients we also found patterns with high confidence and relatively high support. Although the results achieved by using the B 0, B 1, or B 2 background knowledge are comparable, use of the B 1 and B 2 background knowledge resulted in finding interesting rules which cannot be found in the original representation. E.g. we found rule which says that 77% (107 of 139) of men who belong to the normal group of patients whose education level is the secondary school or the university are married, divorced, or widowers and who are managerial or an independent workers mainly sit in their work. The subsets of values selected by the system RAP can be interpreted as a higher educational level reached by a patient who is not single. We also generated associations rules that satisfy a given condition on the left and on the right side of these rules. We found relations between attributes that concern physical activity and social factors. From frequent patterns found by RAP we generated association rules containing physical activity attributes in the head and attributes related to social factors in the body. We found out that 75% (18 of 24) of patients from pathologic group born before the year 1930 who are married and whose job responsibility is others show moderate activity after their work and use public transport for getting to their work. 5.2 Analysis of long-term observations For mining in Control matrix we used the classification with constraints setting. The values RGI and RGC of the KONSUP attribute was used as a class attribute (head of generated association rule).

RAP generated rules for healthy persons and for patients separately (we used classification with constraints setting). Then a confidence was computed for a given rule for both classes. We look for rules for which the confidence values significantly differ. Such rules express a difference between these two groups of men and can be further used e.g. for classification. We observed that 73% (8 of 11) of patients who came down with some disease, who left their work to the partial retirement and who never changed their diet belong to the risk group intervened (RGI) group of patients. 48% (10 of 21) of patients who never came down with some disease and who satisfy given condition belong to the risk group control (RGC). We also changed the roles of class attribute and initial condition. The patterns were generated separately for patients from RGI and RGC groups and class was defined as a value yes if patient came down with some disease and no in other case. Besides many patterns with very small support we also found the rule which says that 29% (2 of 7) of patients belonging to the group RGC who changed their diet to value he take medicines for reduction of serum cholesterol never came down with any disease. From the group RGI 71% (15 of 21) of patients who changed their diet suffer by any disease. This rule contains literals which introduce values of some attributes from the Control data without any regard to the order of examinations. Therefore, the right meaning of this pattern is that there is for each literal at least one control examination of the patient which satisfies the given condition. A patient changed the character of his work and also the diet and this information appears in potentially different control examinations. It must be stressed that this rule cannot be obtained with propositional systems without creating new features. Nevertheless, these rule do not say anything about the order of the change of diet and of the character of work. Therefore we defined the B 3 background knowledge which takes into account temporal relations between events. The patterns obtained now contain information about the order of changes of patients behaviour. It says that 80% (4 of 5) of patients who came down with some disease and who changed their diet without any changes of work character in the period of the last two examinations, belong to the group RGC. For the healthy men the opposite condition holds.: 13% (2 of 15) of healthy men that fulfil the rule condition belongs to the group RGC. 6 Conclusion ILP techniques for mining the STULONG data was used also in the last year Discovery Challenge when Van Assche et al. [4] applied the ILP system Tilde The main difference between RAP and Tilde lies in the fact that Tilde constructs a regression or a classification trees while the system RAP generates frequent patterns and association rules. Moreover, RAP provides many choices which can significantly impact its behaviour. RAP is also able to find a dependency that hold only for a small subset of the learning data.

Some results of analysing the STULONG database with association rules can be found in [2], but the attribute-value oriented approach was used. In this paper we presented the use of the system RAP for mining in the STULONG database. We defined different types of background knowledge. We showed that RAP can be easily used for analysing both propositional and multirelational data. We also showed that temporal aspects of data can be incorporated in a background knowledge and then successfully used for mining temporal rules. Acknowledgments The study (STULONG) was realized at the 2 nd Department of Medicine, 1 st Faculty of Medicine of Charles University and Charles University Hospital, U nemocnice 2, Prague 2 (head. Prof. M. Aschermann, MD, SDr, FESC), under the supervision of Prof. F. Boudík, MD, ScD, with collaboration of M. Tomečková, MD, PhD and Ass. Prof. J. Bultas, MD, PhD. The data were transferred to the electronic form by the European Centre of Medical Informatics, Statistics and Epidemiology of Charles University and Academy of Sciences (head. Prof. RNDr. J. Zvárová, DrSc). The data resource is on the web pages http://euromise.vse.cz/challenge2004. At present time the data analysis is supported by the grant of the Ministry of Education CR Nr LN 00B 107. I am grateful to Luboš Popelínský for his comments. This work has been partially supported by the Czech Ministry of Education under the Grant no. 143300003. References 1. Blaťák J. and Popelínský L. Feature construction with RAP. In Horváth T. and Yamamota A., editors, Proceedings of the Work-in-Progress Track at the 13 th International Conference on Inductive Logic Programming, pages 1 11. University of Szeged, September 29 October 1 2003. 2. Burian J. and Rauch J. Analysis of death causes in the STULONG data set. In Berka P., Rauch J., and Tsumoto S., editors, Proceedings of the ECML/PKDD 2003 Workshop on Discovery Challenge, pages 47 58, 2003. 3. Mannila H. and Toivonen H. An algorithm for finding all interesting sentences. In Proceedings of the 6 th Internation Conference on Database Theory, pages 215 229, 1996. 4. Van Assche A., Verbaeten S., Krzywania D., Struyf J., and Blockeel H. Attributevalue and first order data mining within the STULONG project. In Berka P., Rauch J., and Tsumoto S., editors, Proceedings of the ECML/PKDD 2003 Workshop on Discovery Challenge, pages 108 119, 2003.