Mining Medical Data for Causal and Temporal Patterns

Similar documents
Genetic programming applied to Collagen disease & thrombosis.

Guide to Medical Data on Collagen Disease and Thrombosis

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

International Journal of Software and Web Sciences (IJSWS)

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Bayesian Networks in Medicine: a Model-based Approach to Medical Decision Making

Causal Protein Signalling Networks

Outline. What s inside this paper? My expectation. Software Defect Prediction. Traditional Method. What s inside this paper?

Propensity Score Analysis Shenyang Guo, Ph.D.

Doing After Seeing. Seeing vs. Doing in Causal Bayes Nets

Using Bayesian Networks to Direct Stochastic Search in Inductive Logic Programming

P(HYP)=h. P(CON HYP)=p REL. P(CON HYP)=q REP

Representation and Analysis of Medical Decision Problems with Influence. Diagrams

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

ERA: Architectures for Inference

Bayesian (Belief) Network Models,

Progress in Risk Science and Causality

Pythia WEB ENABLED TIMED INFLUENCE NET MODELING TOOL SAL. Lee W. Wagenhals Alexander H. Levis

A Bayesian Network Analysis of Eyewitness Reliability: Part 1

Learning with Rare Cases and Small Disjuncts

Learning Deterministic Causal Networks from Observational Data

A Bayesian Network Model of Causal Learning

Evaluating the Causal Role of Unobserved Variables

Application of Bayesian Networks to Quantitative Assessment of Safety Barriers Performance in the Prevention of Major Accidents

Discovering Hidden Dispositions and Situational Factors in Causal Relations by Means of Contextual Independencies

Human Activities: Handling Uncertainties Using Fuzzy Time Intervals

A Bayesian Network Model for Analysis of the Factors Affecting Crime Risk

COHERENCE: THE PRICE IS RIGHT

The Semantics of Intention Maintenance for Rational Agents

The Regression-Discontinuity Design

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

Predicting Breast Cancer Survivability Rates

Chapter 02. Basic Research Methodology

Okayama University, Japan

Decisions and Dependence in Influence Diagrams

Inferring Hidden Causes

The role of sampling assumptions in generalization with multiple categories

Bayesian Belief Network Based Fault Diagnosis in Automotive Electronic Systems

Artificial Intelligence For Homeopathic Remedy Selection

How to describe bivariate data

Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients

Root Cause Analysis. December, 9 th, 2008

Center for Biomedical Informatics, Suite 8084, Forbes Tower, Meyran Avenue, University of Pittsburgh, Pittsburgh PA 15213

Causation and Management of Disruptive Behavior in Long-Term Care Homes. Michael Stones Lakehead University

Chapter 1 Introduction to Educational Research

TEACHING YOUNG GROWNUPS HOW TO USE BAYESIAN NETWORKS.

Assignment Question Paper I

PO Box 19015, Arlington, TX {ramirez, 5323 Harry Hines Boulevard, Dallas, TX

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

Biostatistics II

Causality Refined Diagnostic Prediction

A TETRAD-based Approach for Theory Development in Information Systems Research

Plan Recognition through Goal Graph Analysis

Acknowledgements. Section 1: The Science of Clinical Investigation. Scott Zeger. Marie Diener-West. ICTR Leadership / Team

Using directed acyclic graphs to guide analyses of neighbourhood health effects: an introduction

The 29th Fuzzy System Symposium (Osaka, September 9-, 3) Color Feature Maps (BY, RG) Color Saliency Map Input Image (I) Linear Filtering and Gaussian

Analysis A step in the research process that involves describing and then making inferences based on a set of data.

SLAUGHTER PIG MARKETING MANAGEMENT: UTILIZATION OF HIGHLY BIASED HERD SPECIFIC DATA. Henrik Kure

From Description to Causation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

in Engineering Prof. Dr. Michael Havbro Faber ETH Zurich, Switzerland Swiss Federal Institute of Technology

Using Bayesian Networks to Analyze Expression Data Λ

Lecture 9 Internal Validity

AALBORG UNIVERSITY. Prediction of the Insulin Sensitivity Index using Bayesian Network. Susanne G. Bøttcher and Claus Dethlefsen

Using Probabilistic Reasoning to Develop Automatically Adapting Assistive

Introduction. Gregory M. Provan Rockwell Science Center 1049 Camino dos Rios Thousand Oaks, CA

Prognostic Prediction in Patients with Amyotrophic Lateral Sclerosis using Probabilistic Graphical Models

The Development and Application of Bayesian Networks Used in Data Mining Under Big Data

Careful with Causal Inference

Continuous/Discrete Non Parametric Bayesian Belief Nets with UNICORN and UNINET

Extended Case Study of Causal Learning within Architecture Research (preliminary results)

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Challenges of Automated Machine Learning on Causal Impact Analytics for Policy Evaluation

Empirical function attribute construction in classification learning

Hoare Logic and Model Checking. LTL and CTL: a perspective. Learning outcomes. Model Checking Lecture 12: Loose ends

An Alternative Explanation for Premack & Premack

Plan Recognition through Goal Graph Analysis

Cognitive modeling versus game theory: Why cognition matters

TESTING at any level (e.g., production, field, or on-board)

A Taxonomy of Decision Models in Decision Analysis

Finding Information Sources by Model Sharing in Open Multi-Agent Systems 1

Some Connectivity Concepts in Bipolar Fuzzy Graphs

An Efficient Attribute Ordering Optimization in Bayesian Networks for Prognostic Modeling of the Metabolic Syndrome

Evaluating Classifiers for Disease Gene Discovery

Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Causal Inference Course Syllabus

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Regression Discontinuity Analysis

Probabilistic Graphical Models: Applications in Biomedicine

BAYESIAN NETWORKS AS KNOWLEDGE REPRESENTATION SYSTEM IN DOMAIN OF RELIABILITY ENGINEERING

Convergence Principles: Information in the Answer

Goal-Oriented Measurement plus System Dynamics A Hybrid and Evolutionary Approach

Questions and Answers

Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis

Chapter 3 Tools for Practical Theorizing: Theoretical Maps and Ecosystem Maps

A probabilistic method for food web modeling

Application of Bayesian Network Model for Enterprise Risk Management of Expressway Management Corporation

Numerical Integration of Bivariate Gaussian Distribution

Transcription:

Mining Medical Data for Causal and Temporal Patterns Ahmed Y. Tawfik 1 and Krista Strickland 2 1 School of Computer Science, University of Windsor, Windsor, Ontario N9B 3P4, Canada atawfik@uwindsor.ca 2 Department of Mathematics and Computer Science, University of Prince Edward Island, Charlottetown, PEI C1A 3P4, Canada kstrickland@upei.ca Abstract. The present work presents some causal and temporal patterns found in clinical data collected in Chiba University Hospital, Japan for patients suffering from collagen di s- eases. To extract causal patterns, statistical techniques based on covariance analysis are used. The extraction of temporal patterns uses survival analysis techniques. The results obtained using these techniques are analyzed and discussed. 1 Introduction Probabilistic causality is playing an increasingly important role in intelligent data mining and evidence-based reasoning. Probabilistic causality [6, 3] extends the established notion of conditional dependence to causal interpretation. Bayesian Belief Networks [4] rely on a causal interpretation of conditional dependence to perform probabilistic reasoning based on evidence and causation. Data analysis methods have been developed to extract causal patterns from data including Tetrad [5] which is used in the present study, learning belief nets [2], and structural equ a- tion models. Probabilistic causality aims at developing methodologies that allow us to deduce that X is a probable cause of Y when the probability of X given Y is diffe r- ent from the probability of X. The techniques differ however in how they try to eliminate spurious correlation, identify when a correlation can be attributed to hi d- den common causes, and in other implicit assumptions they make about the data (whether it is discrete or continuous for example) and the data distribution. The degree to which causal analysis techniques allow the integration of background knowledge also differs from one technique to another. The causal analysis perform ed here uses Tetrad [5] to build structural graph representing causal patterns implied by the data. A structural graph consists of a set of nodes (V) and a set of edges (E) such that each node represents an entity (a data field) and an edge is directed from the cause to the effect. Undirected edges suggest a possible common hidden cause. Medicine has traditionally used pr obabilistic causality to measure the e f- fectiveness of a treatment. In such studies, a sample of patients is drawn at random and is then divided into a treatment group and a control group. Statistically signif i-

2 A. Tawfik and K. Strickland cant differences between the treatment and the control group are causally attributed to the treatment. This approach to probabilistic causality cannot be used to study disease causing factor or symptoms associated with diseases for obvious ethical and health problems. Methods of probabilistic causality similar to the one used here allow such studies to be conducted. These methods use the data available from i n- fected individuals only (i.e. without control) by analyzing the covariance of different observed quantities. In addition to causal analysis, t his study considers some temporal aspects related to thrombosis. The time between being diagnosed by an auto-immune di s- ease and developing thrombosis is studied. Here we assume that the description time is the time of diagnosis. If this date is not available, we consider the field labeled first date. This may seem counter-intuitive but this choice produced more logical and acceptable results than using the first date as the time of first diagnosis. The temporal analysis utilizes survival analysis te chniques. All dates are converted to days. The date of diagnosis is considered as the time origin for each individual, and the date of thrombosis is used as the event of interest. Patients who did not develop thrombosis are considered censored. Following this introduction, Section 2 describes the data preparation stage. Section 3 presents the results of analyzing the medical data for the collagen disease patients. Section 4 presents the results of a temporal analysis of the data. Section 5 concludes with a discussion of the limitations of causal analysis observed in the results. 2 Data Preparation The data preparation stage involved two steps: data-cleaning and data transform a- tion. The data-cleaning step eliminated unprintable characters, and inconsistent data. Inconsistencies of two forms have been processed in this step as needed: 1. Temporal inconsistencies: for example, some patients have test results predating their date of birth. 2. Data type inconsistencies: for example, some numeric fields contain alphan u- meric characters. In many cases, additional characters indicating the units used followed a test result. Records containing temporal inconsistencies were discarded only if the i n- consistency is relevant to the quantities analyzed. For example, we ignored the re c- ords corresponding to tests predating birth when the time since infection is consi d- ered. Following the cleaning step, transformation and extraction operations are performed. The transformations aim at converting all the data fields to numeric fields as required by Tetrad input module. It is also necessary to extract a subset of fields and records that do not contain missing values because Tetrad does not allow missing values. Therefore, the number of records decreases at the number of fields increases.

Mining Medical Data for Causal and Temporal Patterns 3 3 Causal Graphs Tetrad starts with a complete graph and uses statistical tests to verify that two data fields are independent or conditionally independent. Edges are eliminated from the graph to reflect the independencies detected. The remaining edges form a graph with three type of edges: 1. Directed edges () indicating a cause-effect relationship: the head of the arrow points to the effect and its tail points to the cause. 2. Bidirected edges () indicating a hidden common cause. 3. Directed edges with a small circle at its tail (o ) indicate a causeeffect or common cause. 4. Edge with circles at both ends (o--o) indicate that either could be causing the other or that there is a common cause. CRE sex UA UN age ALP RBC S (ANA P(ANA aortitis LAC SJS WBC ALB HCT HGB CRP PLT D (ANA ANA KCT RA PM BEHCET SLE RVVT LDH MCTD APS thrombosis GPT acl_igg DM GOT acl_iga acl_igm TP Figure 1. Structural graph (120 patients, tables A, B, and C)

4 A. Tawfik and K. Strickland aortitis sex P(ANA mra ra age FUO KCT dm sjs behcet S(ANA acl IgG MCT SLE RVVT pss LAC aps ANA pm N(ANA pattern) Thrombosis D(ANA pattern) acl IgA acl IgM Figure 2. Structural graph (419 patients from tables A and B) Note that the influence of a measured variable on another can be prop a- gated along a path in the graph. Tetrad allows some background knowledge to be incorporated in a number of ways. In the set of results described here we used te m- poral constraints to avoid having gender and age be caused by other variable. The

Mining Medical Data for Causal and Temporal Patterns 5 variables that temporally precede all others (i.e. cannot be caused by them) are shown ahs diamonds in the figures. For convenience, we distinguish in the graphs between test results and medical conditions such as diagnoses by using a box for a node corresponding to a test and an ellipse for a node corresponding to a medical condition. It is apparent from figures 1 and 2 that some causal patterns remain u n- changed but others changed as the attributes and the sample size change. For exa m- ple, both graphs imply that APS is a cause for thrombosis. However, the causal arc connecting APS and acl IgG got reversed. Moreover, in some situations, the pr o- gram could not find a consistent orientation to satisfy the causal patterns deduced from the data. This can be a limitation as Tetrad generates acyclic directed graphs and some causal patterns are not acyclic. It is worth noting that APS, identified by Tetrad as a cause for thrombosis, is also chosen as the root of a decision tree gene r- ated by C5.0. 4 Temporal Analysis The temporal data in Table C is rather sparse and weakly related to thrombosis and we have not been able to extract useful temporal patterns showing signs or immed i- ate consequences of thrombosis. Instead, we considered the time between infection with an auto-immune disease and developing thrombosis. This study uses survival analysis (or event history analysis) [1] to represent the evolution over time or the rate of occurrence of thrombosis. Initially at the time of diagnosis all the patients diagnosed with an auto-immune disease are considered at risk. As time evolves some patients develop thrombosis, some are known not to have developed (i.e. negative thrombosis results), and for the remaining patients, we do not know (ce n- sored). This temporal analysis has shown that the rate of occurrence of type 1 thrombosis is rather constant over time. Figure 3 shows the survival function for this type. Type 2 results are inclusive and rather problematic. Type 3 thrombosis seems is rare during the first few years after thrombosis but increases after seven years. Another set of temporal analysis tried to find a re lationship between the rate of occurrence of thrombosis of any type following the diagnosis of each autoimmune disease. For many diseases, the number of patients who developed thrombosis was so small that it was not possible to extract a pattern. Table 1. Time to thrombosis for SJS patients Number Number Exposed Survivor Std. Cum. Time Events Censored to Risk Function Error Rate 0.00 0 0 113 1.00000 0.00000 0.00000 1.00 3 0 113 0.97345 0.01512 0.02691 46.00 1 65 45 0.95182 0.02600 0.04938 238.00 1 8 36 0.92538 0.03632 0.07755 247.00 1 0 35 0.89894 0.04386 0.10654 570.00 1 3 31 0.86994 0.05114 0.13933

6 A. Tawfik and K. Strickland 905.00 1 3 27 0.83772 0.05852 0.17707 932.00 1 0 26 0.80550 0.06453 0.21629 Fig. 3. Survival curve for type 1 thrombosis (Number of patients who survived versus time in days) Fig. 4. Survival curve for type 3 thrombosis (Number of patients who survived versus time in days) Table 1 is a sample of the survival data obtained for the risk of thrombosis over time following a particular diagnosis. It is clear that the rate of occurrence increases gradually as well as the standard error as the population decreases over time due to censoring and the occurrence of the event. Note that the survivor function in the

Mining Medical Data for Causal and Temporal Patterns 7 table represents a cumulative probability distribution over time not the rate of occu r- rence of the event usually referred to as the hazard rate. 5 Conclusions Structure equation models can be a useful tool for exploratory data analysis and data mining. Here, we used Tetrad to perform a causal analysis of a medical data set. The results presented show that using such tool may be useful. Tetrad assumes linear normally distributed data. Such assumptions do not necessarily hold here. Moreover, as the data lacks information about interventions on the date of thrombosis, we are concerned that results may be reflective of some interventions rather than the normal progression of the condition. In addition to the causal analysis, this work presented some initial finding in a temporal data analysis of this data set. Both the causal and temporal analysis would have benefited from a larger data set. References 1. Allison, P. (1984). Event History Analysis. Sage, Beverly Hills,CA. 2. Cooper, G. and Herskovits, E. (1992). A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning, Vol. 9, 308-347. 3. Eells,E.{1991}. Probabilistic Causality. Cambridge Studies in Probability, I n- duction and Decision Theory. Cambridge University Press, Cambridge, MA. 4. Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Mateo, California. 5. Sprites, P., Glymour,C. and Schienes, R. (1993) Causation, Prediction, and Search. Springer- Verlag Lecture Notes in Statistics 81, Springer-Verlag, NY. 6. Suppes, P. (1970) A probabilistic Theory of Causality, North-Holland, Amsterdam.