Discovering Meaningful Cut-points to Predict High HbA1c Variation

Similar documents
AN INFORMATION VISUALIZATION APPROACH TO CLASSIFICATION AND ASSESSMENT OF DIABETES RISK IN PRIMARY CARE

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Methods for Predicting Type 2 Diabetes

Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets

WP9: CVD RISK IN OBESE CHILDREN AND ADOLESCENTS

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Computational Neuroscience. Instructor: Odelia Schwartz

SUPPLEMENTAL MATERIAL

Conditional Outlier Detection for Clinical Alerting

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

A Survey on Prediction of Diabetes Using Data Mining Technique

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Keywords Missing values, Medoids, Partitioning Around Medoids, Auto Associative Neural Network classifier, Pima Indian Diabetes dataset.

Using Perceptual Grouping for Object Group Selection

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

Predictive and Similarity Analytics for Healthcare

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Predicting Sleep Using Consumer Wearable Sensing Devices

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

A Deep Learning Approach to Identify Diabetes

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

A HMM-based Pre-training Approach for Sequential Data

International Journal of Advance Research in Computer Science and Management Studies

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

Predicting Breast Cancer Survivability Rates

Error Detection based on neural signals

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Evolutionary Programming

Classification of EEG signals in an Object Recognition task

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

4. Model evaluation & selection

Statistical questions for statistical methods

Hemodynamic Monitoring Using Switching Autoregressive Dynamics of Multivariate Vital Sign Time Series

Graphical Exploration of Statistical Interactions. Nick Jackson University of Southern California Department of Psychology 10/25/2013

A Method for Analyzing Commonalities in Clinical Trial Target Populations

A Fuzzy Improved Neural based Soft Computing Approach for Pest Disease Prediction

Variable Features Selection for Classification of Medical Data using SVM

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

SEPTIC SHOCK PREDICTION FOR PATIENTS WITH MISSING DATA. Joyce C Ho, Cheng Lee, Joydeep Ghosh University of Texas at Austin

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Prediction of Diabetes Using Bayesian Network

MTH 225: Introductory Statistics

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

On the feasibility of speckle reduction in echocardiography using strain compounding

Chapter 1. Introduction

Prepared by: Assoc. Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies

CSE 258 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Mining first-order frequent patterns in the STULONG database

Statistics: A Brief Overview Part I. Katherine Shaver, M.S. Biostatistician Carilion Clinic

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

An empirical evaluation of text classification and feature selection methods

Improved Intelligent Classification Technique Based On Support Vector Machines

Learning to Cook: An Exploration of Recipe Data

MODELING AN SMT LINE TO IMPROVE THROUGHPUT

DPPred: An Effective Prediction Framework with Concise Discriminative Patterns

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction

Unsupervised Pattern Discovery in Sparsely Sampled Clinical Time Series

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Deep Learning Models for Time Series Data Analysis with Applications to Health Care

Bayesian models of inductive generalization

Towards Learning to Ignore Irrelevant State Variables

Reader s Emotion Prediction Based on Partitioned Latent Dirichlet Allocation Model

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

IJREAS Volume 2, Issue 2 (February 2012) ISSN: LUNG CANCER DETECTION USING DIGITAL IMAGE PROCESSING ABSTRACT

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Modelling and Application of Logistic Regression and Artificial Neural Networks Models

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

Data Mining and Knowledge Discovery: Practice Notes

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

SYNOPSIS. Publications No publications at the time of writing this report.

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

ECG Beat Recognition using Principal Components Analysis and Artificial Neural Network

Framework for Comparative Research on Relational Information Displays

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

Computational Cognitive Neuroscience

Project exam in Cognitive Psychology PSY1002. Autumn Course responsible: Kjellrun Englund

Data Mining Diabetic Databases

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Classification of ECG Data for Predictive Analysis to Assist in Medical Decisions.

Visualizing Temporal Patterns by Clustering Patients

An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework

Rajiv Gandhi College of Engineering, Chandrapur

Adaptive Treatment of Epilepsy via Batch Mode Reinforcement Learning

AGE IN THE DEVELOPMENT OF CLOSURE ABILITY IN CHILDREN

Prediction of Diabetes Using Probability Approach

Association Rule Mining on Type 2 Diabetes using FP-growth association rule Nandita Rane 1, Madhuri Rao 2

Thought Technology Ltd.

The 29th Fuzzy System Symposium (Osaka, September 9-, 3) Color Feature Maps (BY, RG) Color Saliency Map Input Image (I) Linear Filtering and Gaussian

3. Model evaluation & selection

Reveal Relationships in Categorical Data

Does Machine Learning. In a Learning Health System?

Informative Gene Selection for Leukemia Cancer Using Weighted K-Means Clustering

Effective Values of Physical Features for Type-2 Diabetic and Non-diabetic Patients Classifying Case Study: Shiraz University of Medical Sciences

Transcription:

Proceedings of the 7th INFORMS Workshop on Data Mining and Health Informatics (DM-HI 202) H. Yang, D. Zeng, O. E. Kundakcioglu, eds. Discovering Meaningful Cut-points to Predict High HbAc Variation Si-Chi Chin, University of Iowa W. Nick Street, University of Iowa Ankur Teredesai, University of Washington - Tacoma Abstract Monitoring HbAc, the measurement for the average blood glucose level, is important to diabetic patients and may help improve their treatment. This study aims to train learning algorithms that predict patients with high variation in their HbAc readings. Attributes in clinical data are often continuous, such as age, blood pressure, and lab tests. However, many machine learning algorithms work better or work only with categorical attributes. We propose using discretization processes to identify meaningful cut points for continuous attributes. For example, the study could help understand questions like, What is the range of age and BMI for diabetic patients that have high variation in their HbAc readings? The discretization process finds the number of intervals and the boundaries for the intervals. The process may reveal meaningful cut points in continuous attributes and contribute knowledge to the medical domain. Discretization process also allows us to discover suitable intervals and boundaries for a particular attribute depending on variety of parameters that can be controlled by the experimenter. Keywords: HbAc Monitoring, Discretization algorithm, Decision Tree Introduction HbAc is a measurement for the average blood glucose level and is commonly used to monitor diabetes patients. In this paper, we explore methods to analyze patients who exhibit high variation for their HbAc monitoring. We adopt Time Abstraction (TA) based analysis [] to derive useful summaries of time series. In addition, we employed discretization algorithms to identify meaningful cut-points among continuous features to enhance the analysis. The goal of the proposed methods is to improve predictive generalization and translate the discretization results into interpretable clinical knowledge. In clinical data analysis, discrete features are easier to interpret for both data scientists and clinicians. Discrete features are closer to a knowledge-level representation than continuous ones. Prior research also indicates that discretization makes learning faster and more accurate [2]. Discretization methods determine cut-points (or splitpoints) for continuous features, dividing a range of continuous values into intervals of various lengths. The discovery of meaningful cut-points would enhance knowledge to the question: what is the range of age and BMI for diabetic patients that are more inclined to have high variance in their HbAc readings? Variation for HbAc can be observed from two aspects: the variance and the delta. Variance indicates the usual difference between a measurement and the mean of those measurements, while the delta is the mean of the differences between consecutive readings. Figure a exemplifies ten HbAc readings for two patients, represented separately by a solid red line (Patient A) and a dashed blue line (Patient B). Both patients have the same variance (0.08) for their HbAc time series. However, Patient A has higher delta (0.46) than Patient B (0.5). Although Figure b shows that the variance and the delta are linearly correlated, they represent different concepts. In this paper, we aim to learn methods that explain and predict patients with high variance and high delta for their HbAc. We define that a patient has high variance and delta if the value is higher than one standard deviation away from the mean of the data. This work is done during the author s visit to the Center for Web and Data Science at the University of Washington - Tacoma.

2 4 6 8 0 6.0 6.5 7.0 7.5 8.0 Delta vs. Variance Sequence of HbAc Readings HbAc (a) Illustration of delta and variance 0 5 0 5 0.0 0.5.0.5 2.0 2.5 3.0 3.5 Delta and Variance hbac_var hbac_delta (b) Correlation between delta and variance Figure : Variance vs. Delta Time Series Abstraction Raw Data Input Data Mining Discretized Data Discretization (Cut-point discovery) Predictive Models Visualization Models Input/ Output Procedure Figure 2: Analysis framework The remainder of the paper is organized as follows. In the next section, we propose an analysis framework that incorporates the TA-based analysis and the discretization methods. In Section 3, we summarize the data and detail our experiments. In Section 4, we discuss the experimental results. Section 5 concludes the work and outlines directions for future research. 2 Methods We incorporated TA based analysis and two discretization methods. Figure 2 shows the proposed analysis. The preprocessed input data contains the sequence of measurements for diastolic blood pressure, systolic blood pressure, BMI, and patient age, ordered by the sequence of visits. Each sequence of measurements is a time series feature. As shown in Figure 2, we first summarize the preprocessed time series to obtain abstract descriptions of the data. To provide this higher level description of the time series, we computed the mean, the variance, and delta (i.e. the difference between two consecutive data), and the variance of delta for each time series feature. As described in the proposed framework (see Figure 2), we apply discretization methods on these summaries of time series to discover meaningful cut-points. We then use visualization techniques to examine the cut-points derived 2

from different discretization algorithms. The visualizations of different cut-points support the interpretation of the data. The discretized features is then used for data mining tasks to learn predictive models. The proposed framework aims to investigate how to discover meaningful cut-points through visualizations and whether these summaries could be used to learn some characteristics of the patients. We adopted Minimum Description Length Principle (MDLP) and ChiMerge to discretize continuous features. MDLP is a supervised, top-down (splitting) discretization method. It selects cut-points that maximizes the information entropy. ChiMerge is a supervised, bottom-up (merging) discretization method. It uses the chi-square statistic to determine if the class frequencies of the two intervals are significantly different. Our predictive models include logistic regression model and decisions trees. We compare the performance of logistic regression model on the original continuous features and the discretized features. The logistic regression model over continuous attributes is used as a baseline to compare with a discretized logistic regression. The performance is measured in AUC and F scores, averaging from 0-fold cross validation. 3 Data Description We used real patient data extracted from the patient record system at the University of Iowa Hospitals and Clinics. The data contains the engineered summary of time series clinical data. The data includes,208 patients who have at least 0 recorded HbAc readings. All patients have a diagnosis of Type II diabetes. Table describes the variables used in the data. The processed data include time abstractions (i.e. mean, variance, delta, and variance of delta) for diastolic blood pressure, systolic blood pressure, BMI, and the patient age. Since the class label is determined by the feature hbac var, and hbac delta, we excluded the two features from the data for the classification tasks. Table 2 shows the correlation results of the 4 explanatory variables (x-variables) and 2 predicted variables (yvariables, i.e. hbac var and hbac delta). The results indicate that the four time abstractions provide different information for the predicted variables. As shown in Table 2, although bp sys mean does not have significant correlation to hbac var and hbac delta, bp sys var and bp sys delta exhibit correlations. In addition, although bmi mean is significantly correlated to hbac var and hbac delta, bmi var is not. In this paper, we aim to learn methods that explain and predict patients with high variance and high delta for their HbAc. We define that a patient has high variance and delta if the value is higher than one standard deviation above the mean of the data. The final processed data for classification tasks contains 3 patients are labeled positive for having high variance for HbAC and 65 patients are labeled positive for having high delta for HbAc. Table : Variable description. HbAc related metrics are excluded from predictive models. Index Variable Description bp dia mean Mean of diastolic blood pressure 2 bp dia var Variance of diastolic blood pressure 3 bp dia delta Difference between two consecutive diastolic blood pressure 4 bp dia delta var Variance of the difference between two consecutive diastolic blood pressure 5 bp sys mean Mean of systolic blood pressure 6 bp sys var Variance of systolic blood pressure 7 bp sys delta Difference between two consecutive systolic blood pressure 8 bp sys delta var Variance of the difference between two consecutive systolic blood pressure 9 bmi mean Mean of BMI 0 bmi var Variance of BMI bmi delta Difference between two consecutive BMI 2 bmi delta var Variance of the difference between two consecutive BMI 3 pat min age Lowest recorded age for a patient 4 pat max age Highest recorded age for a patient 5 hbac var Variance of HbAc 6 hbac delta Difference between two consecutive HbAc readings 7 class var Patients with HbAc variance larger than one standard deviation 8 class delta Patients with HbAc delta larger than one standard deviation 3

Table 2: Correlations between variable. * indicates the significance level <0.05. index 2 3 4 5 6 7 8 9 0 2 3 4 hbac var 0.20* 0.24* 0.22* 0.2* 0.02 0.5* 0.* 0.3* 0.09* 0.007 0.06* 0.06* -0.7* -0.9* hbac delta 0.22* 0.23* 0.22* 0.8* 0.04 0.6* 0.3* 0.5* 0.* 0.0 0.08* 0.07* -0.9* -0.2* Table 3: Performance Comparison for AUC and F AUC F Learner Y Raw MDLP ChiMerge Raw MDLP ChiMerge Logistic Variance 0.657 0.735 0.728 0.07 0 0.535 Regression Delta 0.694 0.752 0.8 0.045 0.09 0.579 Decision Tree Delta 0.65 0.527 0.646 0.052 0.066 0. 4 Predictive Models We constructed logistic regression models and decision trees using both the numeric features and the discretized features. Table 3 compares the performances of three models a logistic regression model to predict high variance, a logistic regression model to predict high delta, and a decision tree model to predict high delta. We also compare the results between the original continuous features, the features discretized by MDLP method, and the features discretized by the ChiMerge method. We use the models that constructed with the original continuous features as the baseline to evaluate the performance. The results show that using features discretized by ChiMerge methods consistently enhance the performance of AUC and F score for the three predictive models. Although using features discretized by MDLP method introduces mixed results to the models, it outperforms the baseline on both AUC and F scores for the logistic regression model that predicts high delta. The MDLP method also achieves higher performance on F score for the decision tree model and has higher AUC for the logistic regression model that predicts the high variance. The experimental results show promise using discretized features to enhance predictive models. 5 Cut-points Visualization The choice of cut-points could affect the perceptions and the understanding of the data. Figure 5 shows how the choices of alpha parameters changes the number of selected cut-points. Experts in medical domain may provide input on how the parameters could be tuned. For example, Figure 3c is simpler to interpret than Figure 3b and Figure 3c. However, Figure 3b provides a more granulated view of the class distribution, indicating that the negative correlation observed in Table 2 may mostly exhibit among patients older than 39 years old. This pattern does not exhibit in a simpler discretization results. It requires experiments and collaborations with domain experts to select the most appropriate and useful number of cut-points. Figure 5 shows six examples of cut-points visualization. Each sub-figure in Figure 5 shows the distribution of class label for each discretized interval. The width of each vertical columns reflects the size of intervals the number of patients within the interval. The goal of cut-point visualizations is to provide graphical representations of the data to reveal patterns that are concealed in correlation analysis. In general, MDLP method provides a discretization result that is easier to interpret. However, with different choices of alpha parameter, ChiMerge methods has the flexibility to represent the data with more granularity. For example, Figure 4b suggests that the group of patients between age and 32 has lower probability for high variance in HbAc despite the general linear trend that the variance decrease as the age increases. 6 Conclusion Data mining techniques are increasingly used in clinical domains such as treatment planning. However, simply building a better model traditionally the goal of the data mining community is not enough in these problems. Health care 4

Age and HbAc Delta (ChiMerge, alpha=0.00) Delta Class 0 2 Age and HbAc Delta (ChiMerge, alpha=0.0) Delta Class 0 2 3 4 5 Age and HbAc Delta (ChiMerge, alpha=0.) Delta Class 0 2 3 4 5 6 7 8 9 0 Lowest recorded patient age Lowest recorded patient age Lowest recorded patient age (a) (b) (c) Figure 3: Comparisons between choices of alpha parameter. Figure (a) has the cut-point at 56.5, Figure (b) has the cut-points at 38.5, 46.5, 56.5, 68.5, and Figure (c) has the cut-points at 2.5 9.5, 22.5, 23.5, 27.5, 38.5, 46.5, 56.5, 68.5. Table 4: Cut-points for Figure 5 X Y MDLP ChiMerge pat min age var {5.5} (Fig. 4a) {9.5, 32.5, 5.5} (Fig. 4b) pat min age delta {56.5} (Fig. 4c) {38.5, 46.5, 56.5, 68.5} (Fig. 4d) bmi delta delta {0.94} (Fig.4e) {0.58} (Fig. 4f) professionals are unlikely to trust models which are not understandable. Variable discretization can be an important tool in building models that are both accurate and interpretable. In this paper we explored the use discretization to aid in the prediction and understanding of which diabetic patients will have trouble controlling their blood sugar. Our conclusions are as follows. First, we note that HbAc variation is better measured using the delta measurement, or change in consecutive readings, as opposed to the more traditional variance measure. Clinically, we are looking for patients whose readings show large variations in the short term, while a large variance may indicate a continuous but undramatic trend. Applying discretization methods to continuous features such as age, blood pressure, and BMI provide insight into the profiles of patients with different variation properties, especially when used in conjunction with interpretable predictive models such as decision trees. For example, patients in their 40s show a much greater probability of high variation than those in their 60s and 70s. Finally, the discretization process often leads to improved generalization, as was seen with all of the models we tested. In the future we will expand our list of predictive variables and explore the use of multi-dimensional discretization methods. The goal is to build easily recognizable profiles of patients who may have difficulty controlling their blood sugar in order to help guide treatment planning. Acknowledgement This work was made possible by Grant Number ULRR024979 from the National Center for Research Resources (NCRR), part of the National Institutes of Health (NIH). References [] Riccardo Bellazzi and Ameen Abu-Hanna. Data mining technologies for blood glucose and diabetes management. Journal of diabetes science and technology, 3(3):603 62, May 2009. PMID: 2044300. [2] Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretization: An enabling technique. Data Mining and Knowledge Discovery, 6(4):393 423, 2002. 5

(a) (b) (c) (d) (e) (f) Figure 4: Visualization examples for cut-points. See Table 4 for the values of cut-points. 6