Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Similar documents
BACKPROPOGATION NEURAL NETWORK FOR PREDICTION OF HEART DISEASE

Application of Tree Structures of Fuzzy Classifier to Diabetes Disease Diagnosis

Predicting Heart Attack using Fuzzy C Means Clustering Algorithm

ML LAId bare. Cambridge Wireless SIG Meeting. Mary-Ann & Phil Claridge 23 November

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

Australian Journal of Basic and Applied Sciences

Keywords Missing values, Medoids, Partitioning Around Medoids, Auto Associative Neural Network classifier, Pima Indian Diabetes dataset.

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation

Reveal Relationships in Categorical Data

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Unit 1 Exploring and Understanding Data

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

SUPPLEMENTAL MATERIAL

DEVELOPMENT OF AN EXPERT SYSTEM ALGORITHM FOR DIAGNOSING CARDIOVASCULAR DISEASE USING ROUGH SET THEORY IMPLEMENTED IN MATLAB

KEYWORDS: body mass index, diabetes,, diastolic blood pressure, decision tree, glucose, prediction

A prediction model for type 2 diabetes using adaptive neuro-fuzzy interface system.

Identification of Tissue Independent Cancer Driver Genes

CLASSIFICATION OF BREAST CANCER INTO BENIGN AND MALIGNANT USING SUPPORT VECTOR MACHINES

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Predicting Sleep Using Consumer Wearable Sensing Devices

AN INFORMATION VISUALIZATION APPROACH TO CLASSIFICATION AND ASSESSMENT OF DIABETES RISK IN PRIMARY CARE

Machine Learning Classifications of Coronary Artery Disease

A Deep Learning Approach to Identify Diabetes

Analysis and Interpretation of Data Part 1

An SVM-Fuzzy Expert System Design For Diabetes Risk Classification

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Clustering Autism Cases on Social Functioning

Data Mining Diabetic Databases

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Sudden Cardiac Arrest Prediction Using Predictive Analytics

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

Am I at Risk for Type 2 Diabetes?

Detection and Classification of Lung Cancer Using Artificial Neural Network

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

J2.6 Imputation of missing data with nonlinear relationships

Am I at Risk for Type 2 Diabetes?

Mayuri Takore 1, Prof.R.R. Shelke 2 1 ME First Yr. (CSE), 2 Assistant Professor Computer Science & Engg, Department

Prediction of Diabetes Using Probability Approach

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Assessing Functional Neural Connectivity as an Indicator of Cognitive Performance *

Unsupervised MRI Brain Tumor Detection Techniques with Morphological Operations

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Medical Statistics 1. Basic Concepts Farhad Pishgar. Defining the data. Alive after 6 months?

Numerical Integration of Bivariate Gaussian Distribution

Malignant Tumor Detection Using Machine Learning through Scikit-learn

ECG Beat Recognition using Principal Components Analysis and Artificial Neural Network

Classification of Synapses Using Spatial Protein Data

Early Learning vs Early Variability 1.5 r = p = Early Learning r = p = e 005. Early Learning 0.

Predicting Breast Cancer Survivability Rates

A Rough Set Theory Approach to Diabetes

bivariate analysis: The statistical analysis of the relationship between two variables.

Discovering Meaningful Cut-points to Predict High HbA1c Variation

DETECTING DIABETES MELLITUS GRADIENT VECTOR FLOW SNAKE SEGMENTED TECHNIQUE

Outlier Analysis. Lijun Zhang

Improved Optimization centroid in modified Kmeans cluster

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Predicting clinical outcomes in neuroblastoma with genomic data integration

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

Variable Features Selection for Classification of Medical Data using SVM

Predicting the Likelihood of Heart Failure with a Multi Level Risk Assessment Using Decision Tree

SUPPLEMENTARY APPENDIX

CHAPTER 3 Describing Relationships

Diabetes Diagnosis through the Means of a Multimodal Evolutionary Algorithm

Evaluation on liver fibrosis stages using the k-means clustering algorithm

Assistant Professor, School of Computing Science and Engineering, VIT University, Vellore, Tamil Nadu

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

Midterm project (Part 2) Due: Monday, November 5, 2018

Lesson 9 Presentation and Display of Quantitative Data

Study of cigarette sales in the United States Ge Cheng1, a,

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Modeling Sentiment with Ridge Regression

Effective Values of Physical Features for Type-2 Diabetic and Non-diabetic Patients Classifying Case Study: Shiraz University of Medical Sciences

A Feed-Forward Neural Network Model For The Accurate Prediction Of Diabetes Mellitus

Measuring the User Experience

One-Year Survival Prediction of Myocardial Infarction

Learning Classifier Systems (LCS/XCSF)

Clustering mass spectrometry data using order statistics

Gender Based Emotion Recognition using Speech Signals: A Review

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)

CHAPTER 9 SUMMARY AND CONCLUSION

PREDICTION OF HEART DISEASE USING HYBRID MODEL: A Computational Approach

PCA Enhanced Kalman Filter for ECG Denoising

Undertaking statistical analysis of

Neuroinformatics. Ilmari Kurki, Urs Köster, Jukka Perkiö, (Shohei Shimizu) Interdisciplinary and interdepartmental

Artificial Intelligence Approach for Disease Diagnosis and Treatment

Content Scope & Sequence

STATISTICS INFORMED DECISIONS USING DATA

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Outline. Practice. Confounding Variables. Discuss. Observational Studies vs Experiments. Observational Studies vs Experiments

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Data complexity measures for analyzing the effect of SMOTE over microarrays

arxiv: v2 [cs.cv] 8 Mar 2018

Lecture Notes Module 2

Sound Texture Classification Using Statistics from an Auditory Model

Transcription:

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering Kunal Sharma CS 4641 Machine Learning Abstract Clustering is a technique that is commonly used in unsupervised learning in order to find structure in data without labels. A supporting technique that is used to reduce the dimensionality of higher dimensional data is dimensionality reduction, often with a method called Principle Component Analysis, or PCA. This paper will explore the use of two clustering algorithms K-Means and GMM as well as PCA on data related to the diagnosis of diabetes and heart disease. Both datasets will be normalized so that features with higher magnitude will not weigh considerably more and so that training time will be reduced. Cross-validation will be used to determine the accuracy of the model. The dataset will be split randomly to create a training set with 80% of the data and a test set with 20%. Particular attention will be given to hyper parameter choices, various performance indicators, and clustering visualizations. The datasets have been obtained from kaggle.com and their web addresses will be included in the references section. Python and Jupyter Notebook was used for development. Packages used for data manipulation and visualization include Numpy, Matplotlib, Pandas. Sklearn was used to implement machine learning methods. This paper will have the following sections: Section 1 Section 2 Section 3 Section 4 Section 5 1. Data Exploration Data Exploration Dimensionality Reduction and Clustering Clustering After PCA Neural Network Classification After PCA Neural Network Classification Using Extracted Features from Clustering 1.1 Diabetes Diagnosis Introduction Originally, this dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases. The aim of this dataset is to use a pre-selected set of

diagnostic measurements to predict whether a patient has diabetes. The patients in this dataset are females of Pima Indian heritages that are at least 21 years of age. This dataset has a total of 768 rows, a single target column (outcome) and seven medical predictor column detailed in the chart below. Feature Description Pregnancies Number of times pregnant Glucose Plasma glucose concentration a 2 hr in an oral glucose tolerance test Blood Pressure Diastolic blood pressure (mm Hg) Skin Triceps skin fold Thickness thickness (mm) Insulin 2-Hour serum insulin (mu U/ml) BMI Body mass index (weight Diabetes Pedigree Function in kg/(height in m)^2) Diabetes pedigree function Outcome Class variable (0 or 1) Outcome Distribution 65% of the data represents patients diagnosed with diabetes and 35% represent patients without diabetes. This dataset is therefore unbalanced and we should expect our accuracies should be above 65% if we expect our models to be predictive. Feature Data Distribution Glucose, Blood Pressure, and BMI follow a roughly normal distribution while the remaining features are typically skewed to the right. 1.2 Heart Disease Diagnosis Introduction Originally, this dataset is from the Cleveland database. The aim of this dataset is to also use a pre-selected set of diagnostic measurements to predict whether a patient has diabetes. The original dataset contained 76 attributes, however, this dataset contains the 13 most statistically significant attributes. This dataset has a total of 303 rows, a single target column (outcome) and thirteen medical predictor columns detailed in the following table. Feature Description Age Sex Chest Pain Type Resting Blood Pressure Serum Cholesterol Fasting Blood Sugar Patient Age 1 = Male, 0 = Female 4 Values associated with chest pain types Units in mm Hg Units in mg/dl >120 mg/dl Left to Right: Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, Age

Resting Electrocardiogr aphic results Maximum Heart Rate Achieved Exercise Induced Angina Oldpeak Peak exercise slope Number of Major Vessels Thal Heart Disease Outcome Distribution Values 0,1,2 Beats/Min numerical ST depression induced by exercise relative to rest Slope of ST segment (0-3) Colored by flourosopy 3 = normal, 6 = fixed defect, 7 = reversible defect 1 = Diagnosed, 0 = Not Diagnosed 46% of the data represents patients diagnosed with heart disease and 54% represents patients without. This dataset is nearly balanced and we should expect our accuracies should be above 54% if we expect our models to be predictive. Feature Data Distribution Left to Right: Age, Sex, CP, Trestbps, Chol, fbs, Restecg, Thalach, Exang, Old Peak, Slope, Ca, thal 2. Dimensionality Reduction and Clustering In this section we will be applying Principle Component Analysis, K-Means Clustering, and Gaussian Mixture Model Clustering to our diagnostic datasets. PCA is a dimensionality reduction technique that is used to summarize multiple features into only a few retaining as much variance as possible. More specifically, this method uses orthogonal transformations to convert the data into a set of values of linearly uncorrelated variables called principal components. K-Means clustering is a method that attempts to separate the given data into K different groups that each have roughly the same amount of variance. Inertia or within cluster sum of square distance is essentially minimized in order to ensure that the total distance of all the data points to their respective cluster centroid is as small as possible. There are three main steps involved in the K-Means clustering algorithm. First, the algorithm initializes the cluster centers known as centroids. Each sample is then assigned to the closest centroid. Once all of the points have been assigned to a cluster, the centroid positions are recalculated based on the new groups. This process repeats until there is little to no change in the iterations. The Gaussian mixture model is method that attempts to fit Gaussian distributions to clusters of the unlabeled data using the expected maximization algorithm. The expected maximization algorithm is an

iterative process that ensures that the algorithm converges. Expected maximization first assumes random components and then calculates the probability of each point being generated by every cluster of the model and then the parameters are adjusted to maximize the likelihood of the data. This process repeats until convergence. The elbow method is commonly used in order to find the optimal number of clusters. This method looks at the inertia value for several K values. One should select the k value after which the decrease in inertia is not as significant. It is important to note that this method is not always reliable in finding the best number of clusters. Key indicators include homogeneity, completeness, and silhouette. Homogeneity receives a high score when each cluster only contains members of a single class. Completeness receives a high score when all members of a given class are assigned to the same cluster. Silhouette explains how similar an instance is to its own cluster as compared to other nearby clusters. positive to the right. There is however, significant overlap of the data which may make it harder for clustering algorithms to separate. It is also important to note that the principle components in 2 Dimensions only represent 47% of the variance in the original data, and in 3 dimensions only represent 59%. This means we must be extra careful when interpreting visualizations for this data. 2.2 Applying PCA to Heart Disease Dataset We will use principle component analysis to reduce the dimensionality of our feature set from 13 to 2 and 3. 2D 3D 2.1 Applying PCA to Diabetes Dataset We will use principle component analysis to reduce the dimensionality of our feature set from 8 to 2 and 3. The data can now be visualized: 2D 3D With this dataset as well, we can see clusters have significant overlap. The 2d and 3d principle components of this dataset capture 87% and 96% of the variance in the data respectively. These results are significantly better than those of the diabetes dataset. This means that fewer features in this dataset contribute significantly more to the overall variance. We can see that the negative instances are clustered mainly to the left and the

At K = 4 and K= 5 there is a significantly high amount of overlap between the data. K = 3 seems like the best choice. 2.3 Clustering the Diabetes Dataset using K-Means After graphing the elbow curve to determine the optimal number of clusters we find that the answer is not directly clear. This suggests that the data is not cleanly separable and this is likely given that it is in such a high dimension. While it is not obvious, the elbow seems to be at 3 clusters. We can plot the clusters at nearby values to determine if 3 is a good choice. Homogeneity increases because there are more clusters, at around K = 3 or K = 4 all of the performance scores are high than the others. 2.4 Clustering the Heart Disease Dataset using K-Means Similar to the Diabetes data K = 3 seems to be the best choice. Performance metrics more or less stay constant. K = 2 K = 3 K = 4 K = 5 The plot at K = 2 seems most like the original dataset even though it seems to be divided vertically while the originally dataset seemed to be divided horizontally from the center. The accuracy of K = 2 in labelling the original points is very high. K = 2 K = 3 K = 4 K = 5 KMeans Training accuracy for K = 2: 71.01 KMeans Testing accuracy for K = 2: 70.78

2.5 Clustering the Diabetes Dataset using GMM Training accuracy for GMM for K = 2: 66.61 Testing accuracy for GMM for K = 2: 69.48 This accuracy is slightly lower than the accuracy for K-Means. This may be due to overlapping cluster. The AIC/BIC curve is the lowest at K = 4. All performance indicators jump at K = 5. This may be the best choice for K here, even though BIC is increasing at K = 5. The above is a visualization of the GMM for K = 3 clusters. 2.6 Clustering the Heart Disease Dataset using GMM The AIC/BIC curve is the lowest for 3 and 4. Performance indicators stay relatively constant. There is a small bump in the log likelihood at K = 4. 3. Clustering After PCA 3.1 Clustering the Diabetes Dataset using K-Means after PCA Since silhouette decreases after K = 3 and K = 4, we can say that K = 3 or K = 4 are the

best K values. The elbow method is more clearly 3 after PCA is conducted. The visualizations look the same. K = 2 K = 3 K = 4 K = 5 K = 2 K = 3 K = 4 K = 5 It is important to note that all of the performance indicators have higher values after PCA. This is likely because the dimensionality has been reduced so the points do not face the curse of dimensionality. 3.3 Clustering the Diabetes Dataset using GMM after PCA 3.2 Clustering the Heart Disease Dataset using K-Means after PCA Unlike for the diabetes dataset, after PCA the elbow method does not clearly indicate that the best value is 3. The visualizations still look very similar. The AIC/BIC curves are increasing. Performance evaluation scores are higher after PCA. The average log likelihood decreases as the number of clusters increases meaning greater accuracy (less loss). Silhouette decreases as K values increase while homogeneity goes up. This means that smaller K values are optimal here.

the original 13 dimension data is reduced to 3 dimensions using principal component analysis. The data can be visualized below: 3.4 Clustering the Heart Disease Dataset using GMM after PCA This data sets has the same results as the diabetes dataset after PCA on GMM. The neural network obtained an accuracy of 62.2% which is higher than the baseline prediction of 54% which means that our model is predictive. Although, its performance is still far less than when the neural network is trained on the original data directly (91% accuracy). This might be because the original data includes 100% of the variance in the data, while after postpca data variance is lost. 5 Neural Network Classification Using Extracted Features from Clustering 5.1 Neural Network Classification Using Extracted Features from K-Means Clustering Instead of using the original data we can use features from our K-Means clustering as a form of dimensionality reduction to train our model. 4 Neural Network Classification After PCA We will select the following features: Cluster classification, distance to closest cluster, distance to farthest cluster, average distance to all clusters. These features were chosen as they likely capture the relationship between the data. The feature set can be visualized below can clearly see three clusters. Instead of using the original data to train our neural network we can use the data after

2D 3D 2D 3D The neural network obtained an accuracy of only 55.7% after the training. This is much less than the baseline prediction accuracy of 65% and thus can be concluded as not predictive. This implies that the features extracted by the K-Means algorithm do not adequately represent the original data. It may be worth exploring different cluster values to improve the accuracy. We could also try inputting additional features from the K-Means model. 5.2 Neural Network Classification Using Extracted Features from GMM Clustering Similar to what we did with K-Means, instead of using the original data we can use features from our GMM clustering as a form of dimensionality reduction to train our model. We will select the following features: GMM classification, highest probability of belonging to a cluster, average probability of belonging to a cluster, and the weighted log probability. These features were chosen as they likely capture the variance and relationship between the data. The feature set can be visualized below, again we clearly see clusters. Since 87% of the variance is explained by the 2D pca we can be confident that our data is clearly clustered. The neural network obtained an accuracy of 68.8% which is significantly higher than the baseline prediction of 54%. This implies that the features extracted by the GMM algorithm captured a considerable amount of variance of the original dataset. This performance is better than when the neural network was trained on the pca results (only 62.2% accuracy). This accuracy however is still far less than the accuracy of 91% when the neural network is trained on the orignal data. Again, to improve the accuracy we can explore different cluster values and try adding addittional features from the Gaussian Mixture Model. 6 Conclusion Dimensionality reduction using PCA is a powerful way to represent data in lower dimensions. After PCA, performance metrics consistently increased for the clustering methods. However, training with data that has undergone dimensionality reduction results in lower accuracy likely because much of the variance from the original data is lost. While features extracted by clustering algorithms like K-Means and GMM can be used as a dimensionality reduction technique, the classification performance does not match when neural networks are trained on the original data by a large margin. Future work can attempt to additionally include the features extracted by clustering into the original feature set in an attempt to increase classification accuracy.

7 References Diabetes Dataset: https://www.kaggle.com/uciml/pima-indiansdiabetes-database Heart Disease Dataset: https://www.kaggle.com/ronitf/heartdisease-uci