Predicting Juvenile Diabetes from Clinical Test Results

Similar documents
A study of machine learning performance in the prediction of juvenile diabetes from clinical test results

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

A Deep Learning Approach to Identify Diabetes

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Keywords Missing values, Medoids, Partitioning Around Medoids, Auto Associative Neural Network classifier, Pima Indian Diabetes dataset.

Predicting Breast Cancer Survivability Rates

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

Evaluating Classifiers for Disease Gene Discovery

Effective Values of Physical Features for Type-2 Diabetic and Non-diabetic Patients Classifying Case Study: Shiraz University of Medical Sciences

Diabetes Diagnosis through the Means of a Multimodal Evolutionary Algorithm

R Jagdeesh Kanan* et al. International Journal of Pharmacy & Technology

Application of Tree Structures of Fuzzy Classifier to Diabetes Disease Diagnosis

A Feed-Forward Neural Network Model For The Accurate Prediction Of Diabetes Mellitus

Modelling and Application of Logistic Regression and Artificial Neural Networks Models

Classification of Smoking Status: The Case of Turkey

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

An Experimental Study of Diabetes Disease Prediction System Using Classification Techniques

TABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO. ABSTRACT

Reduction of Overfitting in Diabetes Prediction Using Deep Learning Neural Network

Lung Cancer Diagnosis from CT Images Using Fuzzy Inference System

International Journal of Advance Research in Computer Science and Management Studies

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE

Data Mining Diabetic Databases

ARTIFICIAL NEURAL NETWORKS TO DETECT RISK OF TYPE 2 DIABETES

Application of distributed lighting control architecture in dementia-friendly smart homes

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network

Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients

Medical Diagnosis System based on Artificial Neural Network

A prediction model for type 2 diabetes using adaptive neuro-fuzzy interface system.

Fuzzy Decision Tree FID

AN EXPERT SYSTEM FOR THE DIAGNOSIS OF DIABETIC PATIENTS USING DEEP NEURAL NETWORKS AND RECURSIVE FEATURE ELIMINATION

Implementation of Inference Engine in Adaptive Neuro Fuzzy Inference System to Predict and Control the Sugar Level in Diabetic Patient

Rohit Miri Asst. Professor Department of Computer Science & Engineering Dr. C.V. Raman Institute of Science & Technology Bilaspur, India

Analysis of Cow Culling Data with a Machine Learning Workbench. by Rhys E. DeWar 1 and Robert J. McQueen 2. Working Paper 95/1 January, 1995

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

Prediction of Malignant and Benign Tumor using Machine Learning

An SVM-Fuzzy Expert System Design For Diabetes Risk Classification

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

Intelligent Control Systems

Recognition of sign language gestures using neural networks

AN EXPERIMENTAL STUDY ON HYPOTHYROID USING ROTATION FOREST

Keywords Artificial Neural Networks (ANN), Echocardiogram, BPNN, RBFNN, Classification, survival Analysis.

An Efficient Hybrid Rule Based Inference Engine with Explanation Capability

COMPARISON OF DECISION TREE METHODS FOR BREAST CANCER DIAGNOSIS

Survival Prediction Models for Estimating the Benefit of Post-Operative Radiation Therapy for Gallbladder Cancer and Lung Cancer

Supplementary Online Content

ERA: Architectures for Inference

Association Rule Mining on Type 2 Diabetes using FP-growth association rule Nandita Rane 1, Madhuri Rao 2

Performance Based Evaluation of Various Machine Learning Classification Techniques for Chronic Kidney Disease Diagnosis

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Prediction Models of Diabetes Diseases Based on Heterogeneous Multiple Classifiers

BLOOD GLUCOSE PREDICTION MODELS FOR PERSONALIZED DIABETES MANAGEMENT

Impute vs. Ignore: Missing Values for Prediction

An Approach for Diabetes Detection using Data Mining Classification Techniques

Predictive Models for Healthcare Analytics

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Improving the Accuracy of Neuro-Symbolic Rules with Case-Based Reasoning

Cardiac Arrest Prediction to Prevent Code Blue Situation

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

A Fuzzy Expert System for Heart Disease Diagnosis

Leveraging Pharmacy Medical Records To Predict Diabetes Using A Random Forest & Artificial Neural Network

Prediction of Diabetes Using Probability Approach

SCIENCE & TECHNOLOGY

A Fuzzy Improved Neural based Soft Computing Approach for Pest Disease Prediction

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Prediction of heart disease using k-nearest neighbor and particle swarm optimization.

KEYWORDS: body mass index, diabetes,, diastolic blood pressure, decision tree, glucose, prediction

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

A Novel Iterative Linear Regression Perceptron Classifier for Breast Cancer Prediction

Procedia Computer Science

Radiotherapy Outcomes

Predictive Modeling for Wellness and Chronic Conditions

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Prediction of Diabetes Using Bayesian Network

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

Comparison of discrimination methods for the classification of tumors using gene expression data

Prediction of Diabetes Disease using Data Mining Classification Techniques

DEVELOPING A PREDICTED MODEL FOR DIABETES TYPE 2 TREATMENT PLANS BY USING DATA MINING

A FUZZY LOGIC BASED CLASSIFICATION TECHNIQUE FOR CLINICAL DATASETS

Validation-based normalization and selection of interestingness measures for association rules

A hybrid Model to Estimate Cirrhosis Using Laboratory Testsand Multilayer Perceptron (MLP) Neural Networks

Memory Prediction Framework for Pattern Recognition: Performance and Suitability of the Bayesian Model of Visual Cortex

A Review on Fuzzy Rule-Base Expert System Diagnostic the Psychological Disorder

Automatic Definition of Planning Target Volume in Computer-Assisted Radiotherapy

Analysis of Classification Algorithms towards Breast Tissue Data Set

IDENTIFYING MOST INFLUENTIAL RISK FACTORS OF GESTATIONAL DIABETES MELLITUS USING DISCRIMINANT ANALYSIS

PROGNOSTIC COMPARISON OF STATISTICAL, NEURAL AND FUZZY METHODS OF ANALYSIS OF BREAST CANCER IMAGE CYTOMETRIC DATA

ABSTRACT I. INTRODUCTION II. HEART DISEASE

Wavelet Neural Network for Classification of Bundle Branch Blocks

A Rough Set Theory Approach to Diabetes

Transcription:

2006 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 2006 Predicting Juvenile Diabetes from Clinical Test Results Shibendra Pobi and Lawrence O. Hall Abstract Two approaches to building models for prediction of the onset of Type 1 diabetes mellitus in juvenile subjects were examined. A set of tests performed immediately before diagnosis was used to build classifiers to predict whether the subject would be diagnosed with juvenile diabetes. A second training set consisting of differences between test results taken at different times was used to build classifiers to predict whether a subject would be diagnosed with juvenile diabetes. Neural networks were compared with decision trees and ensembles of both types of classifiers. The highest known predictive accuracy was obtained when the data was encoded to explicitly indicate missing attributes in both cases. In the latter case, high accuracy was achieved without test results which, by themselves, could indicate diabetes. T I. INTRODUCTION HERE has been a lot of study in the area of machine learning on data from the domain of diabetes, especially on the Pima Indian Diabetes dataset [7], [10], [16] from the UCI repository. There has also been tremendous interest in using machine learning algorithms for post diagnosis care, like prediction of blood glucose levels to control the dosage of insulin [3] and the use of association rules to predict the occurrence of certain diseases in diabetic patients [4], [5]. However our study differs from both these approaches in the sense, that we use a dataset that is not restricted to a particular ethnicity but to a specific type of diabetes, namely Type 1 diabetes in the juvenile population and our objective is not to monitor diabetic patients but learn a model to predict the occurrence of this type of diabetes by taking the patient s past medical records into consideration. test results indicate they may have it. Both random forests of decision trees [10] and bagged classifiers [9] for both neural networks and decision trees were used. Surprisingly, it was necessary to explicitly encode missing attributes to achieve over 95% accuracy in diabetes prediction for both decision trees and neural networks. The ensemble classifiers provided the best accuracy. Decision tree classifiers were comparable to cascade correlation neural network classifiers. Of interest was the fact that approximately 80% accuracy can be obtained in predicting diabetes from differences in test results over time without using data from the last time period before diagnosis as diabetic. II. PREVIOUS RESEARCH WORK ON DIABETES DATA Most of the work related to machine learning in the domain of diabetes diagnosis concentrated on the study of the Pima Indian Diabetes dataset in the UCI repository. In this context, Shanker [7] used neural networks to predict the onset of diabetes mellitus in Pima Indian women and showed that neural networks are a viable approach for such classification. Using neural networks with one hidden layer, he obtained an overall accuracy of 81.25% which was higher than the prediction accuracy obtained using a logistic regression method (79.17%) and the ADAP model [12] (76%). Many other papers have reported results on this dataset. We used two types of classifiers: C4.5 based decision trees [1] and Cascade Correlation [2] based neural networks, to predict diabetic cases from non diabetic ones by using subject test results. This is not known to be a difficult problem for Physicians, but the reader will see building a good predictive model was not trivial. Next, we used the same base classifiers to predict diabetes, but this time the attributes were the differences in test results, between consecutive tests of the same type for a subject. This approach has the promise of allowing a prediction that someone is susceptible to diabetes before any This research was partially supported by National Institutes of Health grant 1U54RR019259-01. Lawrence O. Hall and Shibendra Pobi are with the Department of Computer Science and Engineering, University of South Florida, Tampa, Fl 33620, e-mail: {hall, spobi}@cse.usf.edu). 0-7803-9490-9/06/$20.00/ 2006 IEEE 4161 Research on diabetes data (other than the Pima Indian dataset), related to the application of machine learning techniques, has mainly focused on trying to predict and monitor the Blood Glucose Levels (BGL) of diabetic patients ([3]) or possible health risks of such patients ([4] and [5]). In [3], a combination of Artificial Neural Networks (ANN) and a Neuro-Fuzzy Optimizer is used to predict the BGL of a diabetic patient in the recent future and then a possible schedule of diet and exercise as well as the dosage of insulin for the patient is suggested. Although the BGL predictions were close to the actual readings, the dataset was restricted to only two Type 1 diabetic patients, which raises doubts about it s usability for large groups. For experiments concentrating on predicting potential

health risks of diabetic patients, the machine learning algorithm of choice for most researchers is association rule mining. In [4], the authors make a comparative study of association rules and decision trees to predict the occurrences of certain diseases prevalent in diabetic patients. In [5], they deal solely with association rule mining on diabetes patient data, to come up with new rules for prediction of specific diseases in such patients. A Local Causal Discovery (LCD [11]) algorithm ([8]) is used to study how causal structures can be determined from association rules and generate rules to map symptoms to diseases. Moreover, exception rule mining leads to more useful rules from a medical point of view. In [6] the prediction of diabetes from repeated Health Risk Appraisal (HRA) questionnaires of the participants using a neural network model was studied. It used sequential multilayered perceptron (SMLP) with back propagation and captured the time-sensitiveness of the risk factors as a tool for prediction of diabetes among the participants. A hierarchy of neural networks was used, where each network outputs the probability of a subject getting diabetes in the following year. This probability value is then fed forward to the next neural network along with the HRA records for the next year. Results show improvement in accuracy over time, i.e. the study of the risk factors over time rather than at any particular instant, yields better results. With the SMLP approach, the maximum accuracy of prediction obtained was 99.3% for non-diabetics and 83.6% for diabetics at a threshold (of output probability from each neural network in the hierarchy) of 20%. While [6] focuses on the importance of time-sensitiveness of the risk factors in diabetes predictions using only neural networks, our study compares decision tree learning methods (including random forests [10]) and an ensemble of neural networks applied to a specific juvenile diabetes dataset. Our study also differs from [6] in the following aspects: - 1) [6] used HRA records of employees from a manufacturing firm with ages of the subjects ranging from 45 to 64, while our subjects are all juveniles. 2) The attributes in the dataset in [6] are general health parameters like Body Mass Index (BMI), Alcohol Consumption, Back pain, Cholesterol, etc. which are completely different from the attributes that we deal here with, like Intravenous Glucose Tolerance, C- Peptides and other medical tests that are specific to Type 1 diabetes. In our current study we have used data from the Diabetes Prevention Trial Type 1 [15], which was the first largescale trial in North America designed to test whether intervention with antigen-based therapies, parenteral insulin and oral insulin would prevent or delay the onset of diabetes. In [13] it was shown that a strong correlation between first-phase (1 minute + 3 minute) insulin (FPIR) production during intravenous glucose tolerance tests (IV- GTT) and risk factors for developing type 1 diabetes existed using the DPT-1 data. In [14] the asymptotic group of cases in the DPT-1 trial whose diabetes could be directly diagnosed by the 2-h criteria on Oral Glucose Tolerance Test (OGTT) was studied. Both these studies [13, 14] helped identify the tests we used for our training data. III. DIABETES DATASET The DPT-1 group collected data involving first-degree relatives (ages 3-45 years) or second degree relatives (ages 3-20 years) of persons who had begun insulin therapy before the age of 40 years. We have used data from a subset of 711 subjects from the entire dataset used in DPT-1. Our labeled set of data consisted of 455 cases (64%) with no diabetes label and 256 subjects (36%), who got diabetes. The features or attributes consisted of the different types of tests performed on the subjects over time to determine the occurrence of Type1 diabetes and also included the subjects race and sex. The major classes of test results that were considered in the dataset were: - 1) IVGTT: Intravenous Glucose Tolerance Test 2) IAA: Insulin Auto-Antibody 3) ICA: Islet Cell Antibody 4) GAD65 and ICA512 antibodies 5) HbA1C: Hemoglobin A1 C test 6) C Peptide Tests There were 42 attributes or features available. We used 37 of them. Leaving out the nominal attributes (ab_ivgtt, co_ivgtt, fpg, ogtt and subtstn_r which either very rarely had a value, or were based on the evaluation of the results of other tests - i.e., the results of co_ivgtt might be normal if the results of some other features fall within a range of values which are considered normal - and we are including the exact results for these other tests). IV. LEARNING APPROACHES USED The machine learning algorithms used in our experiments are cascade correlation neural networks ([2]) and C4.5 release 8 based decision trees ([1]). A. Cascade Correlation Cascade Correlation ([2]) has two main features: the cascade architecture which allows the incremental addition of new hidden units. It freezes the input weights, once a 4162

unit is added to the network. The second aspect is the learning algorithm, whereby, the addition of a new hidden unit involves the maximization of the correlation between the residual error from the outputs and the output from the new hidden unit (by adjusting the input weights). Cascade Correlation based neural networks are self configuring, i.e. the number of hidden layers are not fixed during initialization, but are added during the training process as and when required to reduce the error rate. In our experiments with the diabetes data, we have used the tool usfcascor [16], which is based on the cascade correlation algorithm ([2]). B. C4.5 C4.5 uses a divide-and-conquer approach to growing decision trees [1]. In our experiments we have used the program usfc4.5, which works on the same principles as C4.5. Decision trees are fast for classification, easy to build and often provide reasonable accuracy [1]. The results from single trees were obtained with default settings and are from default pruned trees. C. Ensemble Approaches We used the ensemble techniques, bagging [9] and random forests [10]. Bagging, which stands for bootstrap aggregation, is a method of generating multiple versions of a predictor and then using them to obtain an aggregated predictor. Training data sets of a given size are created by selecting, with replacement, from the original training data set. The class which is predicted by the majority of classifiers is given to a test example, with ties broken by assigning the majority class label. Bagging [9] can be applied to both neural networks and decision trees. Random Forests [10] on the other hand are a combination of decision trees where at each node a random selection of n attributes is made and then the best attribute test is chosen. So at any given node where a test is being chosen, only a subset of features or attributes is considered. The final prediction is the class predicted by the majority of decision trees. This approach can only be applied to decision trees, but has been shown to provide highly accurate predictive classifiers [10]. For all experiments, a 10-fold cross validation was done to get an estimate of the ability of a classifier to predict unseen data. This means that the data was broken up into 10 disjoint sets, each of size 1/10 of the overall data. The breakup was done randomly. Each training set was made up of 90% of the data with 10% of the data for test. This was done 10 times to allow the calculation of an average accuracy. 4163 V. EXPERIMENTS AND RESULTS A. Predicting Type 1 Juvenile Diabetes by using Test results The training data for and Non subjects were extracted in different ways. For Non subjects, all test results throughout their recorded medical history from the dataset were considered. For repetitive tests the results for the first test was used. This reduces the number of missing test data per subject (the training data had an average of 30 data values per subject out of the total 37 attributes). Out of the 256 diabetic subjects, 201 of them had tests in the immediate vicinity of their date of diagnosis. The other 55 diabetic subjects had tests quite distant in time (approximately 3 months prior to their date of diagnosis). So for diabetic subjects the test results from their most recent tests before or after diagnosis were used. The most recent test results were used by a Physician to record the results. The confusion matrix generated from a 10-fold cross validation using usfc4.5 is shown in Table 1. Classified as s Non s s 188 67 Non s 11 444 Table 1: Confusion Matrix after 10-fold cross validation using usfc4.5 The average accuracy was 88.05%. This result was a little disappointing given that physicians accurately make a diagnosis from exactly these test results. Results from all tests were not performed for all subjects when they were seen. Hence there were many missing test results. This was especially of concern when trying to predict diabetes. With one subject who was diagnosed with diabetes, there was only one test result in the set performed closest to the diagnosis time and so the test set just before that was also used. We suspected that 55 subjects who got diabetes but had no tests in the three-month period before diagnoses would be hard to predict. So, their prediction results were examined separately. Only 5 out of the 55 subjects were correctly diagnosed as diabetic. The reason why almost all of these 55 subjects were wrongly classified might be attributed to the absence of any test results in the immediate vicinity of their respective dates of diagnosis. For the Neural Network approach, the same dataset was used with a few necessary modifications: - 1) The missing data values in the decision tree approach

were denoted by the symbol?. However in the dataset for the Neural Network approach, each of the 37 attributes (pertaining to the 37 tests) had an indicator attribute associated with it. The indicator attribute has only 2 possible values: -0.5 and 0.5 where -0.5 denotes that the test attribute associated with it does not have a value (i.e. indicates a missing data value) and 0.5 represents that the associated test attribute has a valid value. If the indicator attribute has a value -0.5 then the associated test attribute too has a value of -0.5. 2) The data values for each of the test attributes were scaled to lie between -0.5 and 0.5. 3) The two nominal attributes: Sex and Race, were split up into a total of 7 distinct attributes (5 for Race and 2 for Sex) each indicating an individual nominal value of the attributes. No separate indicator attributes were associated with the nominal attributes. In case of a missing value, each of the attributes associated with that nominal feature (i.e. 5 attributes for Race and 2 attributes for Sex) were given a value of -0.5. The neural network approach resulted in an average accuracy of 99.72% after 10-fold cross validation which was quite surprising to us. So we used the decision tree classifier on this modified dataset, which reveals that the absence of a couple of tests in the subjects records (indicated by the indicator attributes associated with the tests) improves the accuracy of prediction previously obtained (88.05%) to 99.85%. Fig. 1 shows the decision tree built on all the data. From the decision tree we can see that the absence of PEP-4 and GLU0 plays a vital role in achieving the high accuracy (as shown by the attributes SUBTSTN_PEP_4_IND and SUBTSTN_GLU0_IND). If those tests are not done it is likely the diagnosis is diabetes. B. Type 1 Juvenile Diabetes Prediction from Differences in Test Results The next phase of our study involved the comparison of Decision Trees and Neural Networks, to predict the occurrence of Type 1 diabetes in a set of juvenile subjects, by looking at the difference in consecutive test results over the entire period of study for non-diabetic subjects and before the date of diagnosis for diabetic subjects. The objective was to see if it was possible to predict with reasonable precision, whether a juvenile subject was potentially at risk of having Type 1 diabetes by examining how the different diagnostic test results change over time. Ensemble classifiers were also used to explore increasing the accuracy for this difficult problem. =-0.5 240 s = M 1 Non- Subtstn_Pep_4_Ind = O Sex =0.5 Race 1 Subtstn_Glu0_Ind =0.5 =-0.5 = F <> O 454 Non- s Fig. 1: Decision tree generated using the modified data set. 15 s To build the dataset, we first calculated the difference in the consecutive test results for each test type for each subject. For diabetic subjects, we considered all tests before diagnosis, excluding the most recent tests done just before the date of diagnosis, as these results would likely differentiate the diabetic subjects from the normal subjects. Moreover, for the 55 diabetic subjects who did not have any tests in the immediate vicinity of their diagnosis, their most recent tests before diagnosis were considered (since these tests were at least 3 months before their dates of diagnosis). For non-diabetic subjects, all tests throughout the period of study were considered. We called this data set the differential data set. The attributes in the dataset were the difference values between consecutive tests of the same test type. The maximum number of difference values for a particular test type among all the subjects determines the number of attributes associated with that test type, e.g. if the maximum number of repetitions of the test GAD65 is 13, the attributes would be from GAD65_0 to GAD65_12 and for subjects who have GAD65 test conducted less than 13 times, the remaining difference attributes for GAD65 are denoted as missing values. The number of attributes in the dataset became 473 including nominal attributes: Sex and Race. For the Decision tree approach the missing values in the dataset were denoted by the symbol?. The 10 fold cross validation resulted in an average accuracy of 66.80%. The confusion matrix after a 10 fold cross validation is shown in Table 2. Classified as s Non s s 129 127 Non 109 346 4164 Table 2: Confusion Matrix after 10-fold cross validation: Using a Single Decision Tree on the Differential Dataset

As we can see, the average accuracy obtained from the 10 fold cross validation using the C4.5 based usfc4.5 decision tree classifier is quite low and is close to the accuracy we would have obtained (64%) if all examples were predicted to be the majority class (i.e. non-diabetics). Next we evaluated random forests [10] with usfc4.5 as the classifier, to see if there was an increase in accuracy of prediction. The accuracy of prediction was found to be optimum when using a subset of 50 attributes (an ensemble of 100 trees was used). The 10 fold cross validation results for the random forest [10] approach using subsets of 50 attributes and 100 trees improves the average accuracy to 71.31%. The confusion matrix showing the predictions after a 10 fold cross validation is given in Table 3. Classified as Non 99 157 Non 47 408 Table 3: Confusion Matrix after 10-fold cross validation: Using Random Forests with 50 attributes and 100 trees on the Differential Dataset Although the overall average accuracy improves, the number of diabetic subjects who were classified as non diabetics far outweighs the number of correctly classified diabetic examples, which is a major cause of concern in this case. Cascade Correlation (using usfcascor) was applied to the differential dataset for which the following modifications were made to the dataset: - 1) Each attribute now has an indicator attribute associated with it indicating whether the difference attribute has a valid value (indicator attribute equals 0.5) or if the data is missing (indicator attribute has a value of -0.5). If the difference attribute does not have a valid data value then it was given a value of -0.5. 2) The values of each of the 471 difference attributes were scaled to lie between -0.5 and 0.5. 3) The nominal attributes (Race and Sex) have been split up into attributes each representing individual values (i.e. 5 attributes for Race each indicating one of the five races and 2 attributes for sex, one indicating Male and the other Female). No separate indicator attributes were associated with the two nominal attributes. In case a value of any of the nominal attributes was missing, then all the associated attributes (i.e. 5 for Race and 2 for Sex) were given a value of -0.5. The 10 fold cross validation results using usfcascor on the modified data set resulted in an average accuracy of 72.71%, which is a marginal improvement on the random forests [10] approach. The confusion matrix of the 10 fold cross validation is given in Table 4. Classified as Non 151 105 Non 89 366 Table 4: Confusion Matrix after 10-fold cross validation: Using a single Neural Network on the Modified Differential Dataset Although there has not been much improvement in the overall average accuracy, compared to the random forests [10] approach, the increase in the true positives (i.e. correct prediction of diabetic subjects) was good. We used the F-measure to compare performance from the confusion matrix. The F-value is defined as: F-measure = 2 x (Precision x Recall) / (Precision + Recall) Where Precision = True Positives / (True Positives + False Positives) And Recall = True Positives / (True Positives + False Negatives) For random forests [10] with subsets of 50 attributes, the F-value is 0.493, while for the neural network approach the F-value calculated from the confusion matrix is 0.609. So, though the overall accuracies for both approaches are almost the same, neural networks perform better according to the F-value. Next we studied the effect of using an ensemble of bagged neural networks [9] on the Differential dataset used above. We used 100% bags (i.e. 100% of the training data was chosen randomly with replacement) and an ensemble of 100 neural networks. The average accuracy obtained after a 10- fold cross validation is 77.21%. Table 5 shows the confusion matrix obtained from the 10-fold cross validation. Classified as Non 162 94 Non 68 387 Table 5: Confusion Matrix after 10-fold cross validation: Using Bagged Ensemble of 100 Neural Networks on the Modified Differential Dataset Using ensemble of neural networks improves the overall accuracy and also the F value (=0.66) calculated using Table 5 is greater than the F-value obtained using the confusion matrix in Table 4 (i.e. using a single neural 4165

network). We also experimented with 90% bags (to have greater variability in the training data) and 100 neural networks, but the average accuracy did not increase when doing a 10- fold cross validation (77.07%). In the next set of tests, we used ensembles of decision trees (random forests [10] and bagging [9]) with the modified dataset (the dataset used for the neural network approach). The 10-fold cross validation results from the decision tree approach on the modified dataset were: 1) A single decision tree yields an average accuracy of 72.29% which is much higher than the previous accuracy of 66.80%. 2) Using Bagging [9] with 100% bags and 100 trees, the average accuracy obtained was 79.04% 3) Next, we used Bagging [9] with 100% bags and 100 trees, but this time with unpruned trees which marginally improves the accuracy to 79.76%. 4) Random Forests using a heuristic suggested in [10] (i.e. log 2 (total number of attributes)) with 100 trees gave an average accuracy of 77.78%. 5) Random Forests [10] with subsets of 100 attributes at each node and 100 trees yields a better accuracy of 80.31%. Table 6 shows the confusion matrix from the 10-fold cross validation for the random forests, using a subset of 100 attributes at each node, classifier. The F-value calculated from the confusion matrix in Table 6 is 0.68 Classified as Non 156 100 Non 40 415 Table 6: Confusion Matrix after 10-fold cross validation: Using Random Forests with 100 attributes and 100 trees on the Modified Differential Dataset which is just higher than the F-value obtained from bagged [9] cascor networks (i.e. 0.66). We also performed a set of five tests using Random Forests with 100 attributes and 100 trees but with different random seeds. The minimum accuracy achieved was 78.91% and a maximum of 80.59% (close to the accuracy of 80.31% that we got with the default seed) with an average accuracy value of 79.75% over the five tests. The above results show that ensemble approaches with decision trees perform better in terms of overall accuracy than neural networks, but the bagging [9] approach with cascade correlation networks has almost the same F-value (compared to the most accurate random forest [10] approach shown in Table 6). Table 7 presents a summary of the 10 fold cross validation results obtained from all the different approaches used in both the test results dataset and the differential test results dataset. VI. CONCLUSION AND DISCUSSION From the results of the experiments conducted, it can be seen that ensemble approaches with decision trees (random forest [10] and bagging [9]) give comparable and sometimes even better diabetes prediction accuracy for juvenile subjects (in the context of our dataset) than cascade correlation based neural networks. Another area where decision tree approaches have potential advantages over neural networks is that they are fast and easy to build and understand. However, it may also be noted that a bagged [9] ensemble of cascor networks are able to predict more diabetic subjects correctly than decision tree based approaches (as can be seen from the confusion matrices, discussed earlier). Further, a well tuned neural network of another type may well provide a higher ensemble accuracy. Another interesting observation that can be made from the results is that the representation of data in the current context has important implications and affects the accuracy of prediction to a great extent. In the first series of experiments, where the medical test records were directly used for diabetes prediction, the accuracy figures improved by over 10% (from 88.05% to 99.85%) in the case of decision trees, when the learning set was modified (i.e. using associated indicator attributes for test existence and normalizing the data values between -0.5 to 0.5). Similarly, when the differential dataset was modified for the neural network approach (with indicator attributes and scaled data values) the accuracy of prediction showed a significant increase even with ensemble of decision trees (random forests [10] and bagging [9]) from 66.8% to a maximum of 80.31% (using random forests [10] with 100 attributes). Decision trees have built-in methods to deal with missing attributes (for example allowing examples to go down multiple branches with weights as in our implementation). However, in this case a more explicit representation of missing values was very useful. Moreover, it can also be inferred from the results, that the absence of some medical tests reveal important information which helps to predict the occurrence of Type 1 diabetes in 4166

juvenile subjects with better accuracy. Such information needs to be embedded, as done here with indicator attributes associated with each test result attribute, in the learning data to improve prediction accuracy. ACKNOWLEDGMENT The authors would like to sincerely thank Jeffrey P. Krischer of Moffitt Cancer Research Institute, Tampa, Fl. for providing the juvenile diabetes dataset that has been used in our current study. REFERENCES [1] Quinlan, J.R., C4.5: Programs for Machine Learning, 1993: Morgan Kaufmann Publishers Inc. 302. [2] S. E. Fahlman, C Lebiere., The cascade-correlation architecture. Advances in Neural Information Processing Structures, 1990. 2: p. 524 532. [3] Sandham, W.A, Hamilton, D.J. Japp, A. Patterson, K. Neural network and neuro-fuzzy systems for improving diabetes therapy, in Engineering in Medicine and Biology Society, 1998. Proceedings of the 20th Annual International Conference of the IEEE. 1998. Hong Kong, China [4] Zorman, M.M., G. Kokol, P. Yamamoto, R., Stiglic, B. Mining Diabetes Database with Decision Trees and Association Rules, in Proceedings of the 15th IEEE Symposium on Computer-Based Medical Systems, 2002. (CBMS 2002). [5] Wynne Hsu, Mong Li Lee., Bing Liu, Tok Wang Ling. Exploration mining in diabetic subjects databases: findings and conclusions, in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. 2000. Prediction on Test Results Dataset Predictions on Differential Test Results Dataset [6] Park, J.E., Eddington, DW, A sequential neural network model for diabetes prediction. Artificial intelligence in medicine, 2001, Nov: p. 23(3):277-93. [7] Shanker M.S., Using neural networks to predict the onset of diabetes mellitus. J Chem Inform Computer Science 1996; 36:35-41. [8] Craig Silverstein, S. Brin, Rajeev Motwani, Jeff Ullman, Scalable Techniques for Mining Causal Structures, Data Mining and Knowledge Discovery, 2000. 4(2-3): p. 163-192. [9] Breiman, L., Bagging predictors, Machine Learning, 1996 vol.24, no.2, p.123--140. [10] Breiman, L., Random Forests, Machine Learning, 2001 vol.45, no.1, p.5-32. [11] Cooper, Gregory F., A Simple Constraint-Based Algorithm for Efficiently Mining Observational Databases for Causal Relationships, Machine Learning, 1997 vol. 1, no. 2, p.203-224. [12] Smith, J W; Everhart, J. E, Dickson, W. C, Knowler, W. C, Johannes, R S, Using the ADAP learning algorithm to forecast the onset of diabetes mellitus, 12 th Annual Symposium on Computer Applications in Medical Care; Washington; DC (USA); 6-9 Nov. 1988. pp. 261-265. 1988. [13] Chase, H. P., Cuthbertson, D. D., Dolan, L. M. et al. First-phase insulin release during the intravenous glucose tolerance test as a risk factor for type 1 diabetes. Journal of Pediatrics 2001; 138: 244 249. [14] Greenbaum, C. J., Cuthbertson, D., Krischer, J. P., The Diabetes Prevention Trial of Type 1 Diabetes Study Group, Type 1 diabetes manifested solely by 2-h oral glucose tolerance test criteria, in Diabetes 2001; Vol. 50: pp. 470 6. [15] DPT-1 Study Group. The Diabetes Prevention Trial of Type 1 Diabetes [abstract]. Diabetes 1994; 43 (suppl 1): 159A. [16] Eschrich, S., Learning from Less: A Distributed Method for Machine Learning, Ph.D. Dissertation, Dept. of CSE, Univ. of South Florida, 2003. Dataset used Classifier Average Accuracy after 10-fold cross validation Dataset with missing values represented as? A single Usfc4.5 decision tree 88.05% Data representation modified for cascor A single Neural Network (NN) 99.72% networks A single Usfc4.5 tree. 99.85% Dataset with missing A single Usfc4.5 66.80% values represented as decision tree? Random Forests with 71.31% subsets of 50 attributes Data representation A single NN 72.71% modified for cascor 90% bags and 100 NNs. 77.07% networks. Bagged ensemble of 100 77.21% NNs. Bagging (100% bags) 79.33% with 100 decision trees Bagging with 100 79.76% unpruned decision trees Random forests (lg(n)) 77.78% and 100 decision trees Random forests with 80.31% subsets of 100 attributes and 100 trees Table 7: Summary of 10-fold cross validation results 4167