Formal Framework for Data Mining with Association Rules and Domain Knowledge

Similar documents
Automatic Discovery of Hypotheses in Nuclear Cardiology

Roles of Medical Ontology in Association Mining CRISP-DM Cycle

Mining first-order frequent patterns in the STULONG database

Mining for association rules in medical data

Ontology-Enhanced Association Mining

DATA MINING AND RESOURCE ALLOCATION: A CASE STUDY

Causal Knowledge Modeling for Traditional Chinese Medicine using OWL 2

Discovering Meaningful Cut-points to Predict High HbA1c Variation

Formalizing UMLS Relations using Semantic Partitions in the context of task-based Clinical Guidelines Model

Support system for breast cancer treatment

APPROVAL SHEET. Uncertainty in Semantic Web. Doctor of Philosophy, 2005

Type II Fuzzy Possibilistic C-Mean Clustering

Evolutionary Approach to Investigations of Cognitive Systems

Coherence Theory of Truth as Base for Knowledge Based Systems

Stepwise Knowledge Acquisition in a Fuzzy Knowledge Representation Framework

Hoare Logic and Model Checking. LTL and CTL: a perspective. Learning outcomes. Model Checking Lecture 12: Loose ends

Improving the Accuracy of Neuro-Symbolic Rules with Case-Based Reasoning

Some Connectivity Concepts in Bipolar Fuzzy Graphs

Lecture 2: Foundations of Concept Learning

Speaker Notes: Qualitative Comparative Analysis (QCA) in Implementation Studies

Computational Tree Logic and Model Checking A simple introduction. F. Raimondi

Field data reliability analysis of highly reliable item

Chapter 1. Introduction

Chapter 2. Knowledge Representation: Reasoning, Issues, and Acquisition. Teaching Notes

Handling Partial Preferences in the Belief AHP Method: Application to Life Cycle Assessment

Trend Analysis in Stulong Data

SAMPLING ERROI~ IN THE INTEGRATED sysrem FOR SURVEY ANALYSIS (ISSA)

Updates to BridgeIT Reports (2017 UDS Reporting) RCHC Data Group Webinar By Ben Fouts, MPH July 11, 2017

AN INFORMATION VISUALIZATION APPROACH TO CLASSIFICATION AND ASSESSMENT OF DIABETES RISK IN PRIMARY CARE

Automated Conflict Detection Between Medical Care Pathways

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

Design Methodology. 4th year 1 nd Semester. M.S.C. Madyan Rashan. Room No Academic Year

Fundamental Concepts for Using Diagnostic Classification Models. Section #2 NCME 2016 Training Session. NCME 2016 Training Session: Section 2

Machine Learning Methods for Knowledge Discovery in Medical Data on Atherosclerosis

Artificial Intelligence For Homeopathic Remedy Selection

Answers to end of chapter questions

Predicting Heart Attack using Fuzzy C Means Clustering Algorithm

A Fuzzy Improved Neural based Soft Computing Approach for Pest Disease Prediction

High-level Vision. Bernd Neumann Slides for the course in WS 2004/05. Faculty of Informatics Hamburg University Germany

Fuzzy Expert System Design for Medical Diagnosis

Positive and Unlabeled Relational Classification through Label Frequency Estimation

A Computational Theory of Belief Introspection

Linear-Time vs. Branching-Time Properties

WHILE behavior has been intensively studied in social

Chapter 12 Conclusions and Outlook

Diagnosis Of the Diabetes Mellitus disease with Fuzzy Inference System Mamdani

Programming with Goals (3)

DEVELOPMENT OF AN EXPERT SYSTEM ALGORITHM FOR DIAGNOSING CARDIOVASCULAR DISEASE USING ROUGH SET THEORY IMPLEMENTED IN MATLAB

Analysis of Competing Hypotheses using Subjective Logic (ACH-SL)

Lead the Way with Advanced Care Management. Workbook

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Assignment Question Paper I

Finding Information Sources by Model Sharing in Open Multi-Agent Systems 1

A Unified View of Consequence Relation, Belief Revision and Conditional Logic

Hypothesis-Driven Research

An Improved Algorithm To Predict Recurrence Of Breast Cancer

05. Conversion Factors tutorial.doc. Introduction

Introduction to SPSS: Defining Variables and Data Entry

FUZZY DATA MINING FOR HEART DISEASE DIAGNOSIS

Symbolic CTL Model Checking

Fuzzy Rule Based Systems for Gender Classification from Blog Data

TWO HANDED SIGN LANGUAGE RECOGNITION SYSTEM USING IMAGE PROCESSING

Introduction to Belief Functions 1

Visualizing Clinical Trial Data Using Pluggable Components

PERSONAL SALES PROCESS VIA FACTOR ANALYSIS

Design of a Fuzzy Rule Base Expert System to Predict and Classify the Cardiac Risk to Reduce the Rate of Mortality

Representation Theorems for Multiple Belief Changes

Lecture 2: Linear vs. Branching time. Temporal Logics: CTL, CTL*. CTL model checking algorithm. Counter-example generation.

Foundations of AI. 10. Knowledge Representation: Modeling with Logic. Concepts, Actions, Time, & All the Rest

Validating and Reporting the 2017 UDS Clinical Measures (Version 1)

Text mining for lung cancer cases over large patient admission data. David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

SCIENCE & TECHNOLOGY

An Efficient Attribute Ordering Optimization in Bayesian Networks for Prognostic Modeling of the Metabolic Syndrome

Analysis of Categorical Data from the Ashe Center Student Wellness Survey

A Bayesian Network Model of Knowledge-Based Authentication

Weighted Ontology and Weighted Tree Similarity Algorithm for Diagnosing Diabetes Mellitus

The 29th Fuzzy System Symposium (Osaka, September 9-, 3) Color Feature Maps (BY, RG) Color Saliency Map Input Image (I) Linear Filtering and Gaussian

Artificially Intelligent Primary Medical Aid for Patients Residing in Remote areas using Fuzzy Logic

Measuring Performance Of Physicians In The Diagnosis Of Endometriosis Using An Expectation-Maximization Algorithm

CHAPTER ONE CORRELATION

ARTIFICIAL INTELLIGENCE

Audio: In this lecture we are going to address psychology as a science. Slide #2

Interpretability of Sudden Concept Drift in Medical Informatics Domain

CHAPTER - 7 FUZZY LOGIC IN DATA MINING

Modeling Sentiment with Ridge Regression

Bayesian Bi-Cluster Change-Point Model for Exploring Functional Brain Dynamics


Republic of Srpska Public Health Institute

VIEW AS Fit Page! PRESS PgDn to advance slides!

Presenting Survey Data and Results"

Sets, Logic, and Probability As Used in Decision Support Systems

Investigating the Reliability of Classroom Observation Protocols: The Case of PLATO. M. Ken Cor Stanford University School of Education.

etable 3.1: DIABETES Name Objective/Purpose

Difference to Inference 1. Running Head: DIFFERENCE TO INFERENCE. interactivity. Thomas E. Malloy. University of Utah, Salt Lake City, Utah

Available online at ScienceDirect. The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013)

An SVM-Fuzzy Expert System Design For Diabetes Risk Classification

Incorporating Action into Diagnostic Problem Solving (An Abridged Report)

The Exploration by Means of Repertory Grids of Semantic Differences Among Names for Office Documents.

Reliability of Ordination Analyses

ROUGH SETS APPLICATION FOR STUDENTS CLASSIFICATION BASED ON PERCEPTUAL DATA

Transcription:

Formal Framework for Data Mining with Association Rules and Domain Knowledge Jan Rauch Faculty of Informatics and Statistics University of Economics, Prague Czech Republic

Principles o o Association rule a couple of related general Boolean attributes derived from columns of a data matrix Dealing with formalized items of domain knowledge o BMI Diastolic -- if BMI increases then Diastolic blood pressure increases too o Item mapped to set Cons( ) of rules - consequences of o FOFRADAR - FOrmal FRAme for Data mining with Association Rules - enhanced logical calculus of association rules o o FOFRADAR used to describe data mining process Results of data mining o conclusion There are no consequences of in given data o conclusion There are lot of consequences of in given data o conclusion Start additional analysis with the goal o interesting rules 2

Goal and Acknowledgement The goal is to present theoretical aspects of FOFRADAR This paper is based on logical calculus of association rules and experiments with the 4ft-Miner procedure. The 4ft-Miner procedure is a part of the LISp-Miner system, see http://lispminer.vse.cz/ and References 3

Structure o o o Data matrices, Boolean attributes and association rules Logical calculus of association rules, deduction rules FOFRADAR and CRISP-DM FOFRADAR Overview Selected details o Describing data mining process with FOFRADAR 4

Data matrix and basic Boolean attributes Columns = attributes Basic boolean attributes M Sex Status BMI Sex(F) Status(single,widower) BMI(23,24,25) o 1 F single 25 1 1 1 o 2 M married 36 0 0 0 o n-1 M married 22 0 0 0 O n F widowed 28 1 1 1 Attribute = column of a data matrix: A i.e. Sex, Status, BMI, category = a possible value of the attribute i.e. F, M, single, married, basic Boolean attribute : A( ), is a coefficient of A( ) i.e. a subset of the set of possible categories A( ) is true for object o i A( )[o i ] = 1: A(o i ) 5

Additional examples of Boolean attributes BMI 13 Coefficients intervals i.e. consecutive categories BMI(22, 23, 25) Derived Boolean attributes: Sex(F) BMI(22,23,24) Status(single,widower) Education(basic) Status(single,widower) (Education(basic) BMI(22,23,24)) 6

Association rule Boolean attribute Boolean attribute 4ft-quantifier 4ft-table 4ft(M,, ) Associated function F (a,b,c,d) of M a c b d F (a,b,c,d) 1 is true in M 0 is false in M 7

Association rule examples (1) M a c b d a Founded implication 0.9,50 F 0.9,50 (a,b,c,d) = 1 iff 0.9 a 50 a b 0.9,50 At least 90 per cent of rows of M satisfying satisfy also and there are at least 50 rows satisfying both and. BMI(> 35) Age(>75) 0.9,50 Infarction(yes) Diabetes(yes) At least 90 per cent of patients (i.e. rows of M ) with BMI > 35 and age > 75 have infarction or diabetes and in M, there are at least 50 patients with BMI > 35 and age > 75 which have infarction or diabetes. 8

Association rule examples (2) M a c b d a a c Above average + 0.3,50 F + 0.9,50 (a,b,c,d) = 1 iff ( 1 0.3) a 50 a b a b c d + 0.3,50 Among the rows satisfying there are at least 30 per cent more rows satisfying than among all rows of M objects and there are at least 50 rows satisfying both and. Sex(M) Status(married) + 0.3,50 BMI(> 35) Among the married patients (i.e. rows of M ) there are at least 30 per cent more patients with BMI > 35 than among all patients in M and there are at least 50 married patients with BMI > 35. 9

- Logical calculus of association rules Language Attributes + Boolean attributes + Association rules + Evaluation of rules in data matrices Attributes: A 1,, A K --- Age, Status, BMI, Diastolic, Basic Boolean attributes: A( ) - subset of possible values --- Status(single,widower) Boolean attributes: A 1 ( 1) A 2 ( 2), A 3 ( 3) A 4 ( 4) --- BMI(22;25 Education(basic) Association rules: --- Status(married) Education(basic) 0.9,50 BMI(>27) Val(, M) = F (a,b,c,d) (1= true, 0 = false) M a c b d 10

Deduction rules in is correct if for each data matrix M : if Val(, M) = 1 then Val(, M) = 1 There are criteria of correctness of for important 4ft-quantifiers Simple examples: Correct deduction rule: Status(married) 0.9,50 BMI(28,29,30) Status(married) 0.9,50 BMI(28,29,30, 31,32) Incorrect deduction rule: Status(married) 0.9,50BMI(28,29,30) Status(married) Education (basic) 0.9,50 BMI(28,29,30) 11

Structure o o o Data matrices, Boolean attributes and association rules Logical calculus of association rules, deduction rules FOFRADAR and CRISP-DM FOFRADAR Overview Selected details o Describing data mining process with FOFRADAR 12

FOFRADAR and CRISP-DM (1) 13

FOFRADAR and CRISP-DM (2) Languages and procedures enhancing calculus of association rules 14

FOFRADAR selected details o Language of domain knowledge Groups of attributes SI-formulas o Language of analytical questions o Analytical questions and association rules o ASSOC and the 4ft-Miner procedure o 4ft-Miner application - an example o 4ft-Miner output and SI-formulas o FOFRADAR - summary and additional procedures 15

- Language of domain knowledge (1) Groups of attributes Basic: G 1, G 2,...,G L, L < K, G 1 G 2... G L = {A 1,, A K } G i G j i, j = 1,..., L Additional: ad hoc Examples: Personal: Marital_Status, Education, Responsibility, BMI Problems: Diabetes, Infarction, Hypertension, Hyperlipidemia Examinations: Diastolic, Systolic, Cholesterol 16

- Language of domain knowledge (2) SI- formulas express simple influence between attributes o BMI Diastolic if BMI increases then Diastolic blood pressure increases too o BMI Diastolic X [Status] the relation BMI Diastolic does not depend on Status o Education BMI if Education increases then BMI decreases + o BMI Diabetes(yes) if BMI increases then relative frequency of Diabetes(yes) increases too + o Diabetes(yes) Infarction(yes) if Diabetes(yes) then relative frequency of Infarction(yes) increases o. 17

- Language of analytical questions (1) Example of a formula of [M: Personal, Problems? Examinations; BMI Diastolic] In M, are there interesting relations between combinations of values of attributes of groups Personal and Problems on one side and combinations of values of attributes of group Examinations on the other side which are not consequences of BMI Diastolic? 18

- Language of analytical questions (2) M - data matrix G 1,, G U, G 1,, G V - groups of attributes 1,, W - SI formulas Examples of formulas of [M: G 1,, G U? G 1,, G V ] In data matrix M, are there any interesting relations between combinations of values of attributes of groups G 1,, G U on one side and combinations of values of attributes of groups G 1,, G V on the other side? [M: G 1,, G U? G 1,, G V ; 1,, W] In data matrix M, are there any interesting relations between combinations of values of attributes of groups G 1,, G U on one side and combinations of values of attributes of groups G 1,, G V on the other side? We are, however, not interested in consequences of 1,, W. 19

FOFRADAR selected details o Language of domain knowledge Groups of attributes SI-formulas o Language of analytical questions o Analytical questions and association rules o ASSOC and the 4ft-Miner procedure o 4ft-Miner application - an example o 4ft-Miner output and SI-formulas o FOFRADAR - summary and additional procedures 20

Analytical questions and association rules [M: Personal, Problems? Examinations; BMI Diastolic] In M, are there interesting relations between combinations of values of attributes of groups Personal and Problems on one side and combinations of values of attributes of group Examinations on the other side which are not consequences of BMI Diastolic? We solve: [M: B(Personal ) B(Problems)? B(Examinations), BMI Diastolic In M, are there interesting relations between combinations of Boolean characteristics of attributes of groups Personal and Problems on one side and combinations of Boolean characteristics of attributes of group Examinations on the other side which are not consequences of BMI Diastolic? 21

GUHA procedure ASSOC (since 1960th) DATA Definition of a set of relevant association rules Generation and verification of particular rules All prime rules We use 4ft-Miner procedure, an implementation of the enhanced ASSOC procedure 22

4ft-Miner application - an example [M: B(Personal ) B(Problems)? B(Examinations), BMI Diastolic M B(Personal ) B(Problems)? B(Examinations) DATA Definition of a set of relevant association rules Generation and verification of particular rules All prime rules Filtering out consequences of BMI Diastolic 23

FOFRADAR and 4ft-Miner application 24

Defining set of relevant association rules an example (1) B(Personal ) B(Problems)? B(Examinations) B(Personal ) 0.85,30 B(Problems) B(Examinations) 0.85,30 : At least 85 per cent of rows of M satisfying satisfy also and there are at least 30 rows satisfying both and. M a c b d 25

Defining set of relevant association rules an example (2) B(Personal ) BMI 13 Intervals of length 1 4 46 = 13+12+11+10 Intervals of length 1 Intervals of length 2 Intervals of length 3 13 12 11 Intervals of length 4 10 26

Defining set of relevant association rules an example (3) B(Problems) Diabetes(yes) Diabetes(yes) Hyperlipidemia(yes) Diabetes(yes) Hyperlipidemia(yes) Hypertension(yes) Diabetes(yes) Hyperlipidemia(yes) Hypertension(yes) Infarction(yes) Diabetes(yes) Hyperlipidemia(yes) Infarction(yes) Diabetes(yes) Hypertension(yes) Infarction(yes) Diabetes(yes) Hypertension(yes) Infarction(yes).. Hypertension(yes) Infarction(yes) Infarction(yes) 27

Defining set of relevant association rules an example (4) B(Examinations) Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval 28

4ft-Miner output DATA Generation and verification of particular rules Definition of a set of relevant association rules All prime rules 112 minutes, 18*10 6 verifications 123 rules found 29

4ft-Miner output detail 30

FOFRADAR selected details o Language of domain knowledge Groups of attributes SI-formulas o Language of analytical questions o Analytical questions and association rules o ASSOC and the 4ft-Miner procedure o 4ft-Miner application - an example o 4ft-Miner output and SI-formulas o FOFRADAR - summary and additional procedures 31

4ft-Miner output and SI formulas (1) [M: B(Personal ) B(Problems)? B(Examinations), BMI Diastolic Consequences of BMI Diastolic? 32

Consequences of SI-formulas SI-formula, 4ft-quantifier Cons(, ) is a set of rules which can be considered as consequences of Atomic Consequences Logical Consequences Agreed Consequences Cons(, ) = AC (, ) LC (, ) AgC (, ) Example: Cons(BMI Diastolic, 0.85,30) = AC (BMI Diastolic, 0.85,30 ) LC (BMI Diastolic, 0.85,30) AgC (BMI Diastolic, 0.85,30) 33

Atomic consequences AC (BMI Diastolic; 0.85, 30) p 0.85, Base 30 BMI(low) p, Base Diastolic(low) BMI(medium) p, Base Diastolic(medium) BMI(high) p, Base Diastolic(hi BMI(low) Diast(low) BMI(16;21 0.85,30 Diast 50;70) BMI(21;22 0. 95,35 Diast( 50;70), 70;80) BMI((21;22, (22;23 ) 0. 87,32 Diast 50;70) 34

Logical consequence of an atomic consequence - examples atomic consequence: BMI(24,28 0.86, 32 Diast 80,100) BMI(24,28 0.86,32 Diast 80,110) logically follows from BMI(24,28 0.86, 32 Diast 80,100) Entry Diast 80;100) Diast 80;100) Entry Diast 80;110) Diast 80;110) BMI(24,28 a b BMI(24,28 a b BMI(24,28 c d BMI(24,28 c d a a b b: a a b a' 0.86 a 32 0.86 a' a' b' 32 BMI(24,28 0.86,32 Diast 80,110) Diabetes (yes) logically follows from BMI(24,28 0.86, 32 Diast 80,100) 35

Agreed consequences - an example Atomic consequence: BMI(24,26 0.875, 42 Diast 70,100) BMI(24,26 Status (single) 0.875, 42 Diast 70,100) does not logically follow from BMI(21,24 0.87,42 Diast 50,80) but it says nothing new used BMI Diastolic X [Status] i.e. BMI Diastolic does not depend on Status 36

4ft-Miner output and SI formulas (2) 123 rules found 97 consequences of BMI Diast filtered out 26 remaining rules 16 consequences of BMI Syst Start an additional research to confirm BMI Syst? Interpret remaining 10 rules 37

FOFRADAR selected details o Language of domain knowledge Groups of attributes SI-formulas o Language of analytical questions o Analytical questions and association rules o ASSOC and the 4ft-Miner procedure o 4ft-Miner application - an example o 4ft-Miner output and SI-formulas o FOFRADAR - summary and additional procedures 38

FOFRADAR Summary Personal, Problems, BMI Diastolic [M: B(Personal ) B(Problems)? B(Examinations); BMI Diastolic ] 4ft-Miner Cons(, ) = AC (, ) LC (, ) ALC (, ) 97 consequences of BMI Diast 16 consequences of BMI Syst 39

FOFRADAR Additional procedures Procedure DK_AQ transforms items of domain knowledge (i.e. formulas of language L DK ) to analytical questions (i.e. formulas of language L AQ ). DK_AQ Dt_AQ Procedure Dt_AQ transforms items of data knowledge (i.e. formulas of language L Dt ) to analytical questions (i.e. formulas of language L AQ ). AQ_RAR Procedure AQ_RAR assigns a formula of language L RAR to each analytical question Q (i.e. a formula of language L AQ ). defines the set S( ) of association rules which are relevant to the analytical question Q. CONCL Procedure CONCL produces conclusions like 97 consequences of BMI Diast 16 consequences of BMI Syst 40

Structure o o o Data matrices, Boolean attributes and association rules Logical calculus of association rules, deduction rules FOFRADAR and CRISP-DM FOFRADAR Overview Selected details o Describing data mining process with FOFRADAR 41

FOFRADAR describing data mining process A sketch of data mining process description: START 1 Formulate_Analytical_Question by DK_AQ Define_Set_of_Relevant_Rules by AQ_RAR Apply ASSOC Apply CONCL IF There is additional analytical question THEN GOTO 1 STOP FOFRADAR procedures in blue 42

Further work Finish details Run the whole data mining process START 1 Formulate_Analytical_Question by DK_AQ Define_Set_of_Relevant_Rules by AQ_RAR Apply ASSOC Apply CONCL IF There is additional analytical question THEN GOTO 1 STOP 43

References (1) Rauch J.: (2005) Logic of Association Rules. Applied Intelligence 22, pp. 9-28. Rauch J.: (2011) Consideration on a Formal Frame for Data Mining. In: Proceedings of Granular Computing 2011, pp. 562-569. IEEE Computer Society Rauch J.: (2012) Formalizing Data Mining with Association Rules. In: Proceedings of Granular Computing, pp. 485-490. IEEE Computer Society, Rauch J.: (2012) Domain Knowledge and Data Mining with Association Rules - a Logical Point of View. In: Proceedings of ISMIS 2012, pp. 11-20. Springer-Verlag Rauch J.: Observational Calculi and Association Rules. To appear at Springer-Veralg, see http://www.springer.com/engineering/computational+intelligence+and+complexity/book/978-3-642-11736-7 44

References (2) Rauch J., Šimůnek M.: (2005) An Alternative Approach to Mining Association Rules. In: Lin T Y et al. (eds) Data Mining: Foundations, Methods, and Applications, pp. 219-238. Springer-Verlag Rauch J., Šimůnek, M. (2008) LAREDAM - Considerations on System of Local Analytical Reports from Data Mining. In: An, A., Matwin, S., Ras, Z.W., Slezak, D. (eds) Foundations of Intelligent Systems, pp. 143-149. Springer-Verlag Rauch, J., Šimůnek, M.(2009) Dealing with Background Knowledge in the SEWEBAR Project. In: Berendt B. et al.: Knowledge Discovery Enhanced with Semantic and Social Information, pp. 89-106. Springer-Verlag Rauch J., Šimůnek M.: (2011) Applying Domain Knowledge in AssociationRules Mining Process - First Experience. In: Kryszkiewicz, M. et al.: (eds) Foundations of Intelligent Systems, pp. 113-122. Springer-Verlag 45

References (3) Šimůnek M.: (2003) Academic KDD Project LISp-Miner" in Abraham A., and all (eds) Advances in Soft Computing - Intelligent Systems Design and Applications, pp. 263-272. Springer-Verlag Šimůnek M., Tammisto, T.: (2010) Distributed Data-Mining in the LISp-Miner System Using Techila Grid. In: Proceedings of Networked Digital Technologies, pp. 15-20. Springer-Verlag 46

Thank you 47