Formal Framework for Data Mining with Association Rules and Domain Knowledge Jan Rauch Faculty of Informatics and Statistics University of Economics, Prague Czech Republic
Principles o o Association rule a couple of related general Boolean attributes derived from columns of a data matrix Dealing with formalized items of domain knowledge o BMI Diastolic -- if BMI increases then Diastolic blood pressure increases too o Item mapped to set Cons( ) of rules - consequences of o FOFRADAR - FOrmal FRAme for Data mining with Association Rules - enhanced logical calculus of association rules o o FOFRADAR used to describe data mining process Results of data mining o conclusion There are no consequences of in given data o conclusion There are lot of consequences of in given data o conclusion Start additional analysis with the goal o interesting rules 2
Goal and Acknowledgement The goal is to present theoretical aspects of FOFRADAR This paper is based on logical calculus of association rules and experiments with the 4ft-Miner procedure. The 4ft-Miner procedure is a part of the LISp-Miner system, see http://lispminer.vse.cz/ and References 3
Structure o o o Data matrices, Boolean attributes and association rules Logical calculus of association rules, deduction rules FOFRADAR and CRISP-DM FOFRADAR Overview Selected details o Describing data mining process with FOFRADAR 4
Data matrix and basic Boolean attributes Columns = attributes Basic boolean attributes M Sex Status BMI Sex(F) Status(single,widower) BMI(23,24,25) o 1 F single 25 1 1 1 o 2 M married 36 0 0 0 o n-1 M married 22 0 0 0 O n F widowed 28 1 1 1 Attribute = column of a data matrix: A i.e. Sex, Status, BMI, category = a possible value of the attribute i.e. F, M, single, married, basic Boolean attribute : A( ), is a coefficient of A( ) i.e. a subset of the set of possible categories A( ) is true for object o i A( )[o i ] = 1: A(o i ) 5
Additional examples of Boolean attributes BMI 13 Coefficients intervals i.e. consecutive categories BMI(22, 23, 25) Derived Boolean attributes: Sex(F) BMI(22,23,24) Status(single,widower) Education(basic) Status(single,widower) (Education(basic) BMI(22,23,24)) 6
Association rule Boolean attribute Boolean attribute 4ft-quantifier 4ft-table 4ft(M,, ) Associated function F (a,b,c,d) of M a c b d F (a,b,c,d) 1 is true in M 0 is false in M 7
Association rule examples (1) M a c b d a Founded implication 0.9,50 F 0.9,50 (a,b,c,d) = 1 iff 0.9 a 50 a b 0.9,50 At least 90 per cent of rows of M satisfying satisfy also and there are at least 50 rows satisfying both and. BMI(> 35) Age(>75) 0.9,50 Infarction(yes) Diabetes(yes) At least 90 per cent of patients (i.e. rows of M ) with BMI > 35 and age > 75 have infarction or diabetes and in M, there are at least 50 patients with BMI > 35 and age > 75 which have infarction or diabetes. 8
Association rule examples (2) M a c b d a a c Above average + 0.3,50 F + 0.9,50 (a,b,c,d) = 1 iff ( 1 0.3) a 50 a b a b c d + 0.3,50 Among the rows satisfying there are at least 30 per cent more rows satisfying than among all rows of M objects and there are at least 50 rows satisfying both and. Sex(M) Status(married) + 0.3,50 BMI(> 35) Among the married patients (i.e. rows of M ) there are at least 30 per cent more patients with BMI > 35 than among all patients in M and there are at least 50 married patients with BMI > 35. 9
- Logical calculus of association rules Language Attributes + Boolean attributes + Association rules + Evaluation of rules in data matrices Attributes: A 1,, A K --- Age, Status, BMI, Diastolic, Basic Boolean attributes: A( ) - subset of possible values --- Status(single,widower) Boolean attributes: A 1 ( 1) A 2 ( 2), A 3 ( 3) A 4 ( 4) --- BMI(22;25 Education(basic) Association rules: --- Status(married) Education(basic) 0.9,50 BMI(>27) Val(, M) = F (a,b,c,d) (1= true, 0 = false) M a c b d 10
Deduction rules in is correct if for each data matrix M : if Val(, M) = 1 then Val(, M) = 1 There are criteria of correctness of for important 4ft-quantifiers Simple examples: Correct deduction rule: Status(married) 0.9,50 BMI(28,29,30) Status(married) 0.9,50 BMI(28,29,30, 31,32) Incorrect deduction rule: Status(married) 0.9,50BMI(28,29,30) Status(married) Education (basic) 0.9,50 BMI(28,29,30) 11
Structure o o o Data matrices, Boolean attributes and association rules Logical calculus of association rules, deduction rules FOFRADAR and CRISP-DM FOFRADAR Overview Selected details o Describing data mining process with FOFRADAR 12
FOFRADAR and CRISP-DM (1) 13
FOFRADAR and CRISP-DM (2) Languages and procedures enhancing calculus of association rules 14
FOFRADAR selected details o Language of domain knowledge Groups of attributes SI-formulas o Language of analytical questions o Analytical questions and association rules o ASSOC and the 4ft-Miner procedure o 4ft-Miner application - an example o 4ft-Miner output and SI-formulas o FOFRADAR - summary and additional procedures 15
- Language of domain knowledge (1) Groups of attributes Basic: G 1, G 2,...,G L, L < K, G 1 G 2... G L = {A 1,, A K } G i G j i, j = 1,..., L Additional: ad hoc Examples: Personal: Marital_Status, Education, Responsibility, BMI Problems: Diabetes, Infarction, Hypertension, Hyperlipidemia Examinations: Diastolic, Systolic, Cholesterol 16
- Language of domain knowledge (2) SI- formulas express simple influence between attributes o BMI Diastolic if BMI increases then Diastolic blood pressure increases too o BMI Diastolic X [Status] the relation BMI Diastolic does not depend on Status o Education BMI if Education increases then BMI decreases + o BMI Diabetes(yes) if BMI increases then relative frequency of Diabetes(yes) increases too + o Diabetes(yes) Infarction(yes) if Diabetes(yes) then relative frequency of Infarction(yes) increases o. 17
- Language of analytical questions (1) Example of a formula of [M: Personal, Problems? Examinations; BMI Diastolic] In M, are there interesting relations between combinations of values of attributes of groups Personal and Problems on one side and combinations of values of attributes of group Examinations on the other side which are not consequences of BMI Diastolic? 18
- Language of analytical questions (2) M - data matrix G 1,, G U, G 1,, G V - groups of attributes 1,, W - SI formulas Examples of formulas of [M: G 1,, G U? G 1,, G V ] In data matrix M, are there any interesting relations between combinations of values of attributes of groups G 1,, G U on one side and combinations of values of attributes of groups G 1,, G V on the other side? [M: G 1,, G U? G 1,, G V ; 1,, W] In data matrix M, are there any interesting relations between combinations of values of attributes of groups G 1,, G U on one side and combinations of values of attributes of groups G 1,, G V on the other side? We are, however, not interested in consequences of 1,, W. 19
FOFRADAR selected details o Language of domain knowledge Groups of attributes SI-formulas o Language of analytical questions o Analytical questions and association rules o ASSOC and the 4ft-Miner procedure o 4ft-Miner application - an example o 4ft-Miner output and SI-formulas o FOFRADAR - summary and additional procedures 20
Analytical questions and association rules [M: Personal, Problems? Examinations; BMI Diastolic] In M, are there interesting relations between combinations of values of attributes of groups Personal and Problems on one side and combinations of values of attributes of group Examinations on the other side which are not consequences of BMI Diastolic? We solve: [M: B(Personal ) B(Problems)? B(Examinations), BMI Diastolic In M, are there interesting relations between combinations of Boolean characteristics of attributes of groups Personal and Problems on one side and combinations of Boolean characteristics of attributes of group Examinations on the other side which are not consequences of BMI Diastolic? 21
GUHA procedure ASSOC (since 1960th) DATA Definition of a set of relevant association rules Generation and verification of particular rules All prime rules We use 4ft-Miner procedure, an implementation of the enhanced ASSOC procedure 22
4ft-Miner application - an example [M: B(Personal ) B(Problems)? B(Examinations), BMI Diastolic M B(Personal ) B(Problems)? B(Examinations) DATA Definition of a set of relevant association rules Generation and verification of particular rules All prime rules Filtering out consequences of BMI Diastolic 23
FOFRADAR and 4ft-Miner application 24
Defining set of relevant association rules an example (1) B(Personal ) B(Problems)? B(Examinations) B(Personal ) 0.85,30 B(Problems) B(Examinations) 0.85,30 : At least 85 per cent of rows of M satisfying satisfy also and there are at least 30 rows satisfying both and. M a c b d 25
Defining set of relevant association rules an example (2) B(Personal ) BMI 13 Intervals of length 1 4 46 = 13+12+11+10 Intervals of length 1 Intervals of length 2 Intervals of length 3 13 12 11 Intervals of length 4 10 26
Defining set of relevant association rules an example (3) B(Problems) Diabetes(yes) Diabetes(yes) Hyperlipidemia(yes) Diabetes(yes) Hyperlipidemia(yes) Hypertension(yes) Diabetes(yes) Hyperlipidemia(yes) Hypertension(yes) Infarction(yes) Diabetes(yes) Hyperlipidemia(yes) Infarction(yes) Diabetes(yes) Hypertension(yes) Infarction(yes) Diabetes(yes) Hypertension(yes) Infarction(yes).. Hypertension(yes) Infarction(yes) Infarction(yes) 27
Defining set of relevant association rules an example (4) B(Examinations) Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval 28
4ft-Miner output DATA Generation and verification of particular rules Definition of a set of relevant association rules All prime rules 112 minutes, 18*10 6 verifications 123 rules found 29
4ft-Miner output detail 30
FOFRADAR selected details o Language of domain knowledge Groups of attributes SI-formulas o Language of analytical questions o Analytical questions and association rules o ASSOC and the 4ft-Miner procedure o 4ft-Miner application - an example o 4ft-Miner output and SI-formulas o FOFRADAR - summary and additional procedures 31
4ft-Miner output and SI formulas (1) [M: B(Personal ) B(Problems)? B(Examinations), BMI Diastolic Consequences of BMI Diastolic? 32
Consequences of SI-formulas SI-formula, 4ft-quantifier Cons(, ) is a set of rules which can be considered as consequences of Atomic Consequences Logical Consequences Agreed Consequences Cons(, ) = AC (, ) LC (, ) AgC (, ) Example: Cons(BMI Diastolic, 0.85,30) = AC (BMI Diastolic, 0.85,30 ) LC (BMI Diastolic, 0.85,30) AgC (BMI Diastolic, 0.85,30) 33
Atomic consequences AC (BMI Diastolic; 0.85, 30) p 0.85, Base 30 BMI(low) p, Base Diastolic(low) BMI(medium) p, Base Diastolic(medium) BMI(high) p, Base Diastolic(hi BMI(low) Diast(low) BMI(16;21 0.85,30 Diast 50;70) BMI(21;22 0. 95,35 Diast( 50;70), 70;80) BMI((21;22, (22;23 ) 0. 87,32 Diast 50;70) 34
Logical consequence of an atomic consequence - examples atomic consequence: BMI(24,28 0.86, 32 Diast 80,100) BMI(24,28 0.86,32 Diast 80,110) logically follows from BMI(24,28 0.86, 32 Diast 80,100) Entry Diast 80;100) Diast 80;100) Entry Diast 80;110) Diast 80;110) BMI(24,28 a b BMI(24,28 a b BMI(24,28 c d BMI(24,28 c d a a b b: a a b a' 0.86 a 32 0.86 a' a' b' 32 BMI(24,28 0.86,32 Diast 80,110) Diabetes (yes) logically follows from BMI(24,28 0.86, 32 Diast 80,100) 35
Agreed consequences - an example Atomic consequence: BMI(24,26 0.875, 42 Diast 70,100) BMI(24,26 Status (single) 0.875, 42 Diast 70,100) does not logically follow from BMI(21,24 0.87,42 Diast 50,80) but it says nothing new used BMI Diastolic X [Status] i.e. BMI Diastolic does not depend on Status 36
4ft-Miner output and SI formulas (2) 123 rules found 97 consequences of BMI Diast filtered out 26 remaining rules 16 consequences of BMI Syst Start an additional research to confirm BMI Syst? Interpret remaining 10 rules 37
FOFRADAR selected details o Language of domain knowledge Groups of attributes SI-formulas o Language of analytical questions o Analytical questions and association rules o ASSOC and the 4ft-Miner procedure o 4ft-Miner application - an example o 4ft-Miner output and SI-formulas o FOFRADAR - summary and additional procedures 38
FOFRADAR Summary Personal, Problems, BMI Diastolic [M: B(Personal ) B(Problems)? B(Examinations); BMI Diastolic ] 4ft-Miner Cons(, ) = AC (, ) LC (, ) ALC (, ) 97 consequences of BMI Diast 16 consequences of BMI Syst 39
FOFRADAR Additional procedures Procedure DK_AQ transforms items of domain knowledge (i.e. formulas of language L DK ) to analytical questions (i.e. formulas of language L AQ ). DK_AQ Dt_AQ Procedure Dt_AQ transforms items of data knowledge (i.e. formulas of language L Dt ) to analytical questions (i.e. formulas of language L AQ ). AQ_RAR Procedure AQ_RAR assigns a formula of language L RAR to each analytical question Q (i.e. a formula of language L AQ ). defines the set S( ) of association rules which are relevant to the analytical question Q. CONCL Procedure CONCL produces conclusions like 97 consequences of BMI Diast 16 consequences of BMI Syst 40
Structure o o o Data matrices, Boolean attributes and association rules Logical calculus of association rules, deduction rules FOFRADAR and CRISP-DM FOFRADAR Overview Selected details o Describing data mining process with FOFRADAR 41
FOFRADAR describing data mining process A sketch of data mining process description: START 1 Formulate_Analytical_Question by DK_AQ Define_Set_of_Relevant_Rules by AQ_RAR Apply ASSOC Apply CONCL IF There is additional analytical question THEN GOTO 1 STOP FOFRADAR procedures in blue 42
Further work Finish details Run the whole data mining process START 1 Formulate_Analytical_Question by DK_AQ Define_Set_of_Relevant_Rules by AQ_RAR Apply ASSOC Apply CONCL IF There is additional analytical question THEN GOTO 1 STOP 43
References (1) Rauch J.: (2005) Logic of Association Rules. Applied Intelligence 22, pp. 9-28. Rauch J.: (2011) Consideration on a Formal Frame for Data Mining. In: Proceedings of Granular Computing 2011, pp. 562-569. IEEE Computer Society Rauch J.: (2012) Formalizing Data Mining with Association Rules. In: Proceedings of Granular Computing, pp. 485-490. IEEE Computer Society, Rauch J.: (2012) Domain Knowledge and Data Mining with Association Rules - a Logical Point of View. In: Proceedings of ISMIS 2012, pp. 11-20. Springer-Verlag Rauch J.: Observational Calculi and Association Rules. To appear at Springer-Veralg, see http://www.springer.com/engineering/computational+intelligence+and+complexity/book/978-3-642-11736-7 44
References (2) Rauch J., Šimůnek M.: (2005) An Alternative Approach to Mining Association Rules. In: Lin T Y et al. (eds) Data Mining: Foundations, Methods, and Applications, pp. 219-238. Springer-Verlag Rauch J., Šimůnek, M. (2008) LAREDAM - Considerations on System of Local Analytical Reports from Data Mining. In: An, A., Matwin, S., Ras, Z.W., Slezak, D. (eds) Foundations of Intelligent Systems, pp. 143-149. Springer-Verlag Rauch, J., Šimůnek, M.(2009) Dealing with Background Knowledge in the SEWEBAR Project. In: Berendt B. et al.: Knowledge Discovery Enhanced with Semantic and Social Information, pp. 89-106. Springer-Verlag Rauch J., Šimůnek M.: (2011) Applying Domain Knowledge in AssociationRules Mining Process - First Experience. In: Kryszkiewicz, M. et al.: (eds) Foundations of Intelligent Systems, pp. 113-122. Springer-Verlag 45
References (3) Šimůnek M.: (2003) Academic KDD Project LISp-Miner" in Abraham A., and all (eds) Advances in Soft Computing - Intelligent Systems Design and Applications, pp. 263-272. Springer-Verlag Šimůnek M., Tammisto, T.: (2010) Distributed Data-Mining in the LISp-Miner System Using Techila Grid. In: Proceedings of Networked Digital Technologies, pp. 15-20. Springer-Verlag 46
Thank you 47