Mining first-order frequent patterns in the STULONG database

Mining first-order frequent patterns in the STULONG database Jan Blaťák KD Group at Faculty of Informatics Masaryk University in Brno, Czech Republic xblatak@fi.muni.cz Abstract. We show how RAP can be employed to answer questions concerning relations in the database STULONG. RAP is a system of inductive logic programming which is designed to search long first-order frequent patterns and association rules. The STULONG database contains information about atherosclerosis in the population of middle-aged men. We define several types of the background knowledge and outline the influence of their complexity on mining patterns. We describe the found rules and explain their meaning. 1 Introduction The main aim of the STULONG project was to identify atherosclerosis risk factors and examine their impact on the health (especially with respect to atherosclerosis cardiovascular diseases), investigate the evolution of found risk factors and their influence on a cardiovascular system. The STULONG database contains multiple records of long-term observations of each patient. To process this data by the attribute-value oriented systems it is necessary to transform the data into one relation. Those transformation are not trivial and may require special tools. Therefore the systems for mining in relational data appear attractive for analysing this data. These attractiveness follows from flexibility of those systems. It is relatively easy to define and integrate background knowledge which can take the place of the systems for data transformation. In this paper we demonstrate the use of the RAP [1] system for mining long first-order frequent patterns. Frequent pattern [3] is define as a conjunction of literals which covers at least σ examples (records) from the database. The parameter σ is called minimal frequency threshold and it is given by the user. RAP is a system for searching long frequent patterns in the multirelational data. RAP generates frequent patterns in first-order logic. To find frequent patterns RAP uses several searching strategies, depth-first, random, and best-first search. To process numerical attributes RAP provides built-in discretization of continuous values (equal frequency intervals algorithm). Main goal of our work was to explore and describe the influence of the background knowledge complexity on the patterns found. The structure of this paper

is as follows. In Section 2 we describe the data and transformations we used to prepare them for processing with RAP. We defined several types of background knowledge. They are described in Section 3. In Section 4 we explain how RAP can be used to process the STULONG database. The discovered knowledge is described in Section 5. The summary and conclusion is in Section 6. 2 Data All interesting characteristics related to the Entry data are available at http:// euromise.vse.cz/challenge2004. The Control data consists of 10,572 records of long-term observations of 1,226 patients. There are 332 patients who came down with some disease. The maximal number of examinations per patient without observing some disease is 18. The average number of observations before some disease occurs is 7.60. For answering Challenge questions related to the long-term observations, the number of patients from RGI ( risk group intervened ) and RGC ( risk group control ) groups who suffer some diseases is important. For the Control data, 136 patients who came down with some disease belong to the RGI group while 26 belong to the RGC. For healthy patients, there are 296 patients in the RGI and 89 in the RGC group. We transformed the original data into a database of the Prolog facts. We grouped attributes into groups with similar meaning. These groups of attributes were then used as arguments of new predicates. E.g. for the group of three attributes concerning smoking we generated a new predicate smoking(ico,koure- NI,DOBAKOUR,BYVKURAK) where ICO is an identification number of a patient, KOURENI is an intensity of smoking, the value of DOBAKOUR says how long the patient has been smoking, and BYVKURAK specifies how long an ex-smoker has not been smoking. In this transformation step we also crated the simplest language definition and appropriate background knowledge (further referred as B 0 ). It contains only such literals which introduce value of some attribute (e.g. cattr(ico,poradk,#value) where ATTR is an attribute from Control matrix, PORADK specifies sequence of patient s identified by ICO control examination, and #value is introduced value). 3 Background knowledge The simplest background knowledge, B 0 contains only literals that for a given patient s id (ICO) returns a value of some feature from an input data matrix. This background knowledge also contains the predicate age(ico,rokvys,age) which computes the age (variable Age) of the patient with identification number ICO in the given year Rokvys. Frequent patterns constructed over B 0 are fully equivalent to patterns found by any propositional system. To analyze the influence of an educational level, we defined attributes EDUC (reached education) with values high and low. Similarly we defined attribute SMOKINGYN (smoking yes no) with values yes and no.

We also designed the background knowledge B 1 which generates such new constants automatically. In this background knowledge is the argument #value replaced by list of some values from attribute s domain. E.g. for the attribute STAV (marital status) RAP can generate the pattern estav(ico,[1,4]) with meaning: The marital status of patient identified by a value of the argument ICO is married or widower. Another possibility to obtain more general patterns is by using negation. We defined background knowledge B 2 which contains literals which say that patient has not some property (e.g. he is not married). We performed experiments with all mentioned types of background knowledge also on the Control data. Evidently, it can be rather difficult to interpret obtained patterns. We are not able to say if patient changed its behaviour after or before he came down with some disease. Therefore we also designed background knowledge which respects the order of control examinations. It consists of predicates such a hascontrol, nextcontrol, and beforedisease and it is further referred as B 3. 4 Experiments In our experiments we used several settings. These settings differ in the method used for searching in the hypothesis space, the definition of background knowledge, and in the set of initial constraints. We tested following settings: Classification to specify the class. This method is appropriate to answer the question on dependencies between value of class attribute and other attributes. In the fact RAP constructs an association rules with class definition in the head. For example, we declare attribute KOURENI (smoking) as a class and use attributes related to social factors to construct patterns for finding relations between social factors and smoking. Frequent patterns with constraints to give some frequent query (initial constraint). The system RAP append new literals to the end of this query. In this setting we can easily specify some a priori conditions and delimit the subgroup of patients we are interested in. As an initial pattern we can use the set of frequent patterns found by RAP. E.g. we can look for dependencies in a subset of married graduated patients. Classification with constraints to combine the first two methods. E.g. we can generate patterns for patients belonging to the normal group classified by values of the attribute KOURENI. Association rules with constraints. To process continuous values we used built-in discretization. The value of minimal frequency threshold varied between 1 10% of the number of all (or selected) patients. We demonstrate capabilities of RAP on answering questions about relations between social factors, activity after a work and smoking.

5 Results 5.1 Analysis of entry examinations We found out that patients mainly sit in their work if they are graduated were born in or before the year 1930 and belong to the risk group (81%, 93 of 115), or if they are married managerial workers with a university education and belong to the risk group of patients (92%, 69 of 75). By using background knowledge with new attribute Age we found pattern which says that 80% (44 of 55) of married man with a university education belonging to the normal group of patients who entered the study before the age of 47 sit in their job. For the class DOPRAVA (means of transport a patient uses for getting to work), 75% (15 of 20) patients from pathologic group who are married were born before the year 1928 and their educational level is the apprentice school use public transport to get to their work. We also looked for frequent patterns that concern all patients no matter what value of the attribute KONSUP (studied group of patients) is. One of them says that 89% (124 of 134) of graduated patients who are managerial workers mainly sits in their work. We observed that there is no significant difference between patterns found with simple and complex background knowledge. We did not observe any influence of using initial constraints (we used literal which determines a group of patients the value of attribute KONSUP). When we restricted search on a selected group of patients we also found patterns with high confidence and relatively high support. Although the results achieved by using the B 0, B 1, or B 2 background knowledge are comparable, use of the B 1 and B 2 background knowledge resulted in finding interesting rules which cannot be found in the original representation. E.g. we found rule which says that 77% (107 of 139) of men who belong to the normal group of patients whose education level is the secondary school or the university are married, divorced, or widowers and who are managerial or an independent workers mainly sit in their work. The subsets of values selected by the system RAP can be interpreted as a higher educational level reached by a patient who is not single. We also generated associations rules that satisfy a given condition on the left and on the right side of these rules. We found relations between attributes that concern physical activity and social factors. From frequent patterns found by RAP we generated association rules containing physical activity attributes in the head and attributes related to social factors in the body. We found out that 75% (18 of 24) of patients from pathologic group born before the year 1930 who are married and whose job responsibility is others show moderate activity after their work and use public transport for getting to their work. 5.2 Analysis of long-term observations For mining in Control matrix we used the classification with constraints setting. The values RGI and RGC of the KONSUP attribute was used as a class attribute (head of generated association rule).

RAP generated rules for healthy persons and for patients separately (we used classification with constraints setting). Then a confidence was computed for a given rule for both classes. We look for rules for which the confidence values significantly differ. Such rules express a difference between these two groups of men and can be further used e.g. for classification. We observed that 73% (8 of 11) of patients who came down with some disease, who left their work to the partial retirement and who never changed their diet belong to the risk group intervened (RGI) group of patients. 48% (10 of 21) of patients who never came down with some disease and who satisfy given condition belong to the risk group control (RGC). We also changed the roles of class attribute and initial condition. The patterns were generated separately for patients from RGI and RGC groups and class was defined as a value yes if patient came down with some disease and no in other case. Besides many patterns with very small support we also found the rule which says that 29% (2 of 7) of patients belonging to the group RGC who changed their diet to value he take medicines for reduction of serum cholesterol never came down with any disease. From the group RGI 71% (15 of 21) of patients who changed their diet suffer by any disease. This rule contains literals which introduce values of some attributes from the Control data without any regard to the order of examinations. Therefore, the right meaning of this pattern is that there is for each literal at least one control examination of the patient which satisfies the given condition. A patient changed the character of his work and also the diet and this information appears in potentially different control examinations. It must be stressed that this rule cannot be obtained with propositional systems without creating new features. Nevertheless, these rule do not say anything about the order of the change of diet and of the character of work. Therefore we defined the B 3 background knowledge which takes into account temporal relations between events. The patterns obtained now contain information about the order of changes of patients behaviour. It says that 80% (4 of 5) of patients who came down with some disease and who changed their diet without any changes of work character in the period of the last two examinations, belong to the group RGC. For the healthy men the opposite condition holds.: 13% (2 of 15) of healthy men that fulfil the rule condition belongs to the group RGC. 6 Conclusion ILP techniques for mining the STULONG data was used also in the last year Discovery Challenge when Van Assche et al. [4] applied the ILP system Tilde The main difference between RAP and Tilde lies in the fact that Tilde constructs a regression or a classification trees while the system RAP generates frequent patterns and association rules. Moreover, RAP provides many choices which can significantly impact its behaviour. RAP is also able to find a dependency that hold only for a small subset of the learning data.

Some results of analysing the STULONG database with association rules can be found in [2], but the attribute-value oriented approach was used. In this paper we presented the use of the system RAP for mining in the STULONG database. We defined different types of background knowledge. We showed that RAP can be easily used for analysing both propositional and multirelational data. We also showed that temporal aspects of data can be incorporated in a background knowledge and then successfully used for mining temporal rules. Acknowledgments The study (STULONG) was realized at the 2 nd Department of Medicine, 1 st Faculty of Medicine of Charles University and Charles University Hospital, U nemocnice 2, Prague 2 (head. Prof. M. Aschermann, MD, SDr, FESC), under the supervision of Prof. F. Boudík, MD, ScD, with collaboration of M. Tomečková, MD, PhD and Ass. Prof. J. Bultas, MD, PhD. The data were transferred to the electronic form by the European Centre of Medical Informatics, Statistics and Epidemiology of Charles University and Academy of Sciences (head. Prof. RNDr. J. Zvárová, DrSc). The data resource is on the web pages http://euromise.vse.cz/challenge2004. At present time the data analysis is supported by the grant of the Ministry of Education CR Nr LN 00B 107. I am grateful to Luboš Popelínský for his comments. This work has been partially supported by the Czech Ministry of Education under the Grant no. 143300003. References 1. Blaťák J. and Popelínský L. Feature construction with RAP. In Horváth T. and Yamamota A., editors, Proceedings of the Work-in-Progress Track at the 13 th International Conference on Inductive Logic Programming, pages 1 11. University of Szeged, September 29 October 1 2003. 2. Burian J. and Rauch J. Analysis of death causes in the STULONG data set. In Berka P., Rauch J., and Tsumoto S., editors, Proceedings of the ECML/PKDD 2003 Workshop on Discovery Challenge, pages 47 58, 2003. 3. Mannila H. and Toivonen H. An algorithm for finding all interesting sentences. In Proceedings of the 6 th Internation Conference on Database Theory, pages 215 229, 1996. 4. Van Assche A., Verbaeten S., Krzywania D., Struyf J., and Blockeel H. Attributevalue and first order data mining within the STULONG project. In Berka P., Rauch J., and Tsumoto S., editors, Proceedings of the ECML/PKDD 2003 Workshop on Discovery Challenge, pages 108 119, 2003.