A study on Feature Selection Methods in Medical Decision Support Systems

Similar documents
Performance of SVM Classifiers in Predicting Diabetes

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

ECG Beat Recognition using Principal Components Analysis and Artificial Neural Network

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

AN EXPERIMENTAL STUDY ON HYPOTHYROID USING ROTATION FOREST

Development of Soft-Computing techniques capable of diagnosing Alzheimer s Disease in its pre-clinical stage combining MRI and FDG-PET images.

Downloaded from ijbd.ir at 19: on Friday March 22nd (Naive Bayes) (Logistic Regression) (Bayes Nets)

Gender Based Emotion Recognition using Speech Signals: A Review

AN INFORMATION VISUALIZATION APPROACH TO CLASSIFICATION AND ASSESSMENT OF DIABETES RISK IN PRIMARY CARE

Prediction of heart disease using k-nearest neighbor and particle swarm optimization.

Data complexity measures for analyzing the effect of SMOTE over microarrays

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

Study of Data Mining Algorithms in the Context of Performance Enhancement of Classification

Improved Intelligent Classification Technique Based On Support Vector Machines

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Modelling and Application of Logistic Regression and Artificial Neural Networks Models

Utilizing Posterior Probability for Race-composite Age Estimation

Association Rule Mining on Type 2 Diabetes using FP-growth association rule Nandita Rane 1, Madhuri Rao 2

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

AN EXPERT SYSTEM FOR THE DIAGNOSIS OF DIABETIC PATIENTS USING DEEP NEURAL NETWORKS AND RECURSIVE FEATURE ELIMINATION

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Enhancement of Twins Fetal ECG Signal Extraction Based on Hybrid Blind Extraction Techniques

Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on Mutual Information

PCA Enhanced Kalman Filter for ECG Denoising

Genetic Algorithm based Feature Extraction for ECG Signal Classification using Neural Network

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Detection of Heart Disease using Binary Particle Swarm Optimization

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Capitulation of Machine Learning Techniques for Detection of Auto immune Thyroiditis

Automated Prediction of Thyroid Disease using ANN

A REVIEW ON CLASSIFICATION OF BREAST CANCER DETECTION USING COMBINATION OF THE FEATURE EXTRACTION MODELS. Aeronautical Engineering. Hyderabad. India.

Prediction Models of Diabetes Diseases Based on Heterogeneous Multiple Classifiers

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Design of Multi-Class Classifier for Prediction of Diabetes using Linear Support Vector Machine

MRI Image Processing Operations for Brain Tumor Detection

EXTRACT THE BREAST CANCER IN MAMMOGRAM IMAGES

An Efficient Attribute Ordering Optimization in Bayesian Networks for Prognostic Modeling of the Metabolic Syndrome

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Facial Expression Recognition Using Principal Component Analysis

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

Automated Medical Diagnosis using K-Nearest Neighbor Classification

EECS 433 Statistical Pattern Recognition

Fuzzy Based Early Detection of Myocardial Ischemia Using Wavelets

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Robust system for patient specific classification of ECG signal using PCA and Neural Network

Keywords Missing values, Medoids, Partitioning Around Medoids, Auto Associative Neural Network classifier, Pima Indian Diabetes dataset.

Emotion Detection Using Physiological Signals. M.A.Sc. Thesis Proposal Haiyan Xu Supervisor: Prof. K.N. Plataniotis

Journal of Advanced Scientific Research ROUGH SET APPROACH FOR FEATURE SELECTION AND GENERATION OF CLASSIFICATION RULES OF HYPOTHYROID DATA

T-Relief: Feature Selection for Temporal High- Dimensional Gene Expression Data

Variable Features Selection for Classification of Medical Data using SVM

An SVM-Fuzzy Expert System Design For Diabetes Risk Classification

Diagnosis of Coronary Arteries Stenosis Using Data Mining

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

Rajiv Gandhi College of Engineering, Chandrapur

Medical Diagnosis System based on Artificial Neural Network

Detection of Cognitive States from fmri data using Machine Learning Techniques

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

A Fuzzy Expert System for Heart Disease Diagnosis

A Survey on Detection and Classification of Brain Tumor from MRI Brain Images using Image Processing Techniques

Investigating the performance of a CAD x scheme for mammography in specific BIRADS categories

Empirical function attribute construction in classification learning

Contributions to Brain MRI Processing and Analysis

Outlier Analysis. Lijun Zhang

Bayesian Face Recognition Using Gabor Features

Application of Tree Structures of Fuzzy Classifier to Diabetes Disease Diagnosis

Statistical Analysis Using Machine Learning Approach for Multiple Imputation of Missing Data

Classification of ECG Data for Predictive Analysis to Assist in Medical Decisions.

EFFICIENT MULTIPLE HEART DISEASE DETECTION SYSTEM USING SELECTION AND COMBINATION TECHNIQUE IN CLASSIFIERS

Detection of Glaucoma and Diabetic Retinopathy from Fundus Images by Bloodvessel Segmentation

Comparative Study of K-means, Gaussian Mixture Model, Fuzzy C-means algorithms for Brain Tumor Segmentation

The Long Tail of Recommender Systems and How to Leverage It

Nearest Shrunken Centroid as Feature Selection of Microarray Data

CLASSIFICATION OF BREAST CANCER INTO BENIGN AND MALIGNANT USING SUPPORT VECTOR MACHINES

Colon cancer subtypes from gene expression data

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012

EEG signal classification using Bayes and Naïve Bayes Classifiers and extracted features of Continuous Wavelet Transform

An Enhanced Approach on ECG Data Analysis using Improvised Genetic Algorithm

Development of novel algorithm by combining Wavelet based Enhanced Canny edge Detection and Adaptive Filtering Method for Human Emotion Recognition

Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY

Comparative Analysis of Machine Learning Algorithms for Chronic Kidney Disease Detection using Weka

Algorithms Implemented for Cancer Gene Searching and Classifications

Chapter 1. Introduction

NMF-Density: NMF-Based Breast Density Classifier

A prediction model for type 2 diabetes using adaptive neuro-fuzzy interface system.

SEPTIC SHOCK PREDICTION FOR PATIENTS WITH MISSING DATA. Joyce C Ho, Cheng Lee, Joydeep Ghosh University of Texas at Austin

Weighted Naive Bayes Classifier: A Predictive Model for Breast Cancer Detection

Copyright 2007 IEEE. Reprinted from 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, April 2007.

Positive and Unlabeled Relational Classification through Label Frequency Estimation

arxiv: v1 [cs.lg] 4 Feb 2019

Assigning B cell Maturity in Pediatric Leukemia Gabi Fragiadakis 1, Jamie Irvine 2 1 Microbiology and Immunology, 2 Computer Science

Dilated Recurrent Neural Network for Short-Time Prediction of Glucose Concentration

Analysis of EEG Signal for the Detection of Brain Abnormalities

Comparative study of Naïve Bayes Classifier and KNN for Tuberculosis

A Novel Prediction Approach for Myocardial Infarction Using Data Mining Techniques

Automatic Hemorrhage Classification System Based On Svm Classifier

Transcription:

A study on Feature Selection Methods in Medical Decision Support Systems Rahul Samant, SVKM S NMIMS, Shirpur Campus, India; Srikantha Rao, TIMSCDR, Mumbai University, Kandivali, Mumbai, India, Abstract Clinical databases often consist of a large number of disease markers. For clinical data analysis, some disease markers are not helpful and sometimes even have negative effects. Therefore, applying feature selection is necessary as it can remove those unimportant disease markers. It also increases the effectiveness of Medical Decision support system by effectively decreasing learning time of the system. We evaluated three different feature selection methods, such as Principle Component Analysis (PCA), Factor Analysis ( FA), and Attribute Ranking(AR) method. Finally, the promising performance of PCA was validated through a set of experiments on a dataset using Naïve Bayesian (NB) classifier and K-nearest neighbor (KNN) classifier. 1. Introduction Machine learning techniques have been widely used to help the medical experts in analyzing medical data [1]. Generally large scale medical databases are having large number of attributes or dimensions. This large data dimensionality can badly influence many aspects of analysis process. It can increase learning system s time on both training and runtime phases. Meanwhile, it may cause the "curse of dimensionality" problem. To handle the high dimension medical data, feature reduction is an important technique [4-6]. Researchers and practitioners realize that in order to use data mining tools effectively data preprocessing is essential to successful data mining [13]. The idea of feature reduction is to use fewer dimensions of data to represent original data. Note that although the number of dimensions is reduced, the discriminative capability should not be hampered. There are many benefits with feature reduction. For example, it can avoid over-fitting, reduce data analysis complexity and improve data analysis performance. Generally, feature reduction can be divided into two categories [7]: feature extraction and feature selection. In feature extraction, the original feature space is mapped into a lower dimensional one. Therefore, the features are totally new and different with original features. The popular feature extraction methods include principal component analysis (PCA) and independent component analysis (ICA) [8]. Through feature extraction, although a much smaller dimension is obtained, this usually requires high computational overhead. In addition, since the features are new, it is hard to interpret by human. For example, the disease markers are the popular features for medical data. Through feature extraction, these disease markers will be projected to another space. In that case, the medical doctor cannot interpret what is the meaning of the new features. Feature selection will choose partial features from original feature spaces according to a specified evaluation function. Usually this evaluate function evaluates the discrimination capability of each feature. Unlike feature extraction, feature selection does not generate any new features. This will not make any difficulties for human understanding the meaning of features. Due to this advantage, feature selection is more widely used for medical data analysis compared to feature extraction. Mainly there are three kinds of feature selection methods [9]: wrapper, filter, and embedded methods. In the embedded method, feature selection is one part of the learning algorithm. C4. 5 is one of the typical embedded methods [10]. Wrappers [11] evaluate the prediction performances of features by employing a specified learning algorithm. Based on the learning algorithm, the feature which gives higher classification accuracy will be selected. In general, the quality of features selected by wrappers is good as classification accuracy is directly taken into account. But meanwhile, wrapper also has some limitations. Firstly, the features selected by wrapper depend on the learning algorithm. They might be not suitable for other learning algorithms. Secondly, wrapper is time-consuming and often intractable for large-scale problems. Different with wrappers, filters do not employ any specified learning algorithm. Instead, it identifies a subset of features according to some evaluation criterions. Different evaluate criterions are used by various filters. Filter is independent with learning algorithm and computational efficient. Therefore, it has been widely applied on medical data [12]. 615

2. Methods for Feature Selection In this section of the paper, we briefly describe the three attribute selection methods used in our study. The excellent summary of feature selection methods is provided in the book Data Preparation for Data Mining by D. Pyle.[13]. 2.1 Principle Component Analysis (PCA) Principal component analysis (PCA) is the best, in the mean-square error sense, linear dimension reduction technique. Being based on the covariance matrix of the variables, it is a second-order method. The basic procedure is as follows: a. The input data is normalized to ensure that attributes with large domains will not dominate attributes with smaller domain. b. PCA computes K orthogonal vectors that provide a basis for the normalized input data. These vectors are referred to as principle components. c. The principle components are sorted in order of decreasing significance or strength. They essentially serve as a new set of axes for data, providing important information about variance. d. Because the components are sorted according to decreasing order of significance the size of the data can be reduced by eliminating the weaker components. Using the strongest components it should be possible to reconstruct a good approximation of the original data. 2.2 Factor Analysis (FA) Like PCA, factor analysis (FA) is also a linear method, based on the second-order data summaries. FA assumes that the measured variables depend on some unknown, and often not measurable, common factors. For examples, for many psychiatric data is not possible to measure a certain factor of interest directly (such as intelligence ); however it is possible to measure other parameters such as various test scores of individuals to reflect the intelligence" factor. The goal of FA is to uncover such relations, and thus can be used to reduce the dimension of datasets following the factor model. 2.3 Median Imputation (MDI) Since the mean is affected by the presence of outliers it seems natural to use the median instead just to assure robustness. In this case the missing data for a given feature is replaced by the median of all known values of that attribute in the class where the instance with the missing feature belongs. This method is also a recommended choice when the distribution of the values of a given feature is skewed. 2.3 Attribute Ranker (AR) In this algorithm evaluation a subset of 1 or more features is given a goodness score by the evaluator (SubsetEvaluator); in attribute evaluation the evaluator (AttributeEvaluator) gives each individual attribute a goodness score. When a search algorithm is paired with a SubsetEvaluator it explores the space of possible subsets and returns the best one with respect to the evaluation. In the case of the later, a ranked list of attributes is produced by pairing the attribute evaluator with a special "search" called the Ranker. For ranked lists of attributes you can easily specify that you want to retain the top M ranked attributes (or alternatively, set a threshold on the goodness score by which to discard some of the ranked attributes). It is still possible to specify that you want the best M attributes when using a subset evaluator by pairing it with the GreedyStepwise search method and turning on the option to produce a ranked list. This works by forcing the search to the far side of the search space by continuing to add attributes to the best subset even if it decreases the overall goodness (it still adds the "best" attribute at each stage - i.e. the one that that decreases the goodness the least). The order that attributes are added forms the ranking. 3. Experiments 3.1 Dataset Acquisition The database used for analysis in this paper has been compiled as a part of an earlier study entitled Early Detection Project (EDP) conducted at the Hemorheology Laboratory of the erstwhile Inter- Disciplinary Programme in Biomedical Engineering at the School (now department) of Biosciences and Bioengineering, Indian Institute of Technology Bombay (IITB), Mumbai, India. Spanning over a period from Jan.1990 to Apr. 1996, it consists of 968 records, each with 30 parameters, which encapsulate the biochemical, hemorheological and clinical status of the individuals visiting Hospital for routine checkups or treatment of common ailments. We note that the Hemorheology Laboratory has pioneered the research in the field of Clinical Hemorheology by conducting the baseline hemorheological studies in the Indian population and correlating various hemorheological parameters with several disease conditions. 3.2 Profile of the sample In all, 30 parameters have been noted for each respondent. They include age, gender, habit (smoking, alcohol consumption), blood groups, disease state, health indicators (e.g.; pulse, systolic blood pressure (BP1), diastolic blood pressure (BP2) ) and biochemical parameter like Serum 616

Values Proteins (SP), Serum Albumin (SALB), Serum Fibrinogen (SFIB), Hematocrit (HCT), Erythrocyte Sedimentation Rate (ESR), Serum Cholesterol (SC), Serum Triglycerides (STG), Hemoglobin (HB), Platelet Aggregation (PLA), along with various hemorheological (HR) parameters (e.g.; Whole Blood Viscosity WBV- measured over eight different shear rates, Plasma Viscosity- PVmeasured over three different shear rates, using a Contraves 30 viscometer, and Red Cell Aggregation - RCA). The database covers a very wide age-range (14-83 years), although majority of the respondents are closer to 40 years of age (average age 41.67 years). The female component of the database is comparable to the male component in most parameters like age, blood group distribution and biochemical values. The average BP of the entire database (127.27 /83.89 mm of Hg) and also that of its male (127.97/85.14 mm of Hg ) and female (125.46/82.36 mm of Hg ) components are close to the normal value of 120/80 mm of Hg, reported in the literature, signifying preponderance of normal controls in the sample. About 16% of male subjects indulge in smoking, alcohol consumption or both, while the corresponding value for females is 1%. The distribution of blood groups in the database is consistent with other reports related to Indian population. The incidence of HT among the database studied is found to be higher in males as compared to females (21.45% and 13.10% resp.); while the corresponding figures for Diabetes Mellitus (DM) in these sexes are 15.24% and 12.46 % resp. Most biochemical parameters are found to lie within the normal range. Plasma Viscosity (PV) shows 100% variation between maximum and minimum reported values (1.02 to 2.02cp), both in males and females. The average plasma viscosity in females is slightly higher than that reported in males (1.40 cp against 1.385 cp), the difference is not statistically significant. On the contrary, the Whole Blood Viscosity at high shear rate (WBVh) is significantly higher in males than in females (5.47 cp against 4.48 cp, p<0.01). The higher PV in females may be attributed to a higher percentage of females reporting elevated PF values and is consistent with earlier reported studies. 800 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930 Column Number Fig. 1. The dataset visualization 3.1 Data preprocessing The database was having missing data values at random. The data records having missing values were dropped entirely from the database and a cleaned dataset was prepared. We applied Principle Component Analysis, Factor analysis and Attribute Ranking techniques for feature selection. The score given to each attribute by respective feature selection method results are tabulated in Table 1. Table 1. Score of selected features Feature PCA Factor Attribute analysis Ranker AGE 37.3121 0.9989 0.3585 BP1 26.1401 0.9776 0.5201 BP2 15.9286 0.9620 1.0605 HB 10.5788 0.9876 0 BSF 3.3159 0.9987 0.9392 BSP 2.1825 0.9995 0.7714 SC 1.6910 0.9993 0.2136 STG 0.9856 0.9778 0.1919 SALB 0.6781 0.9924-0.1789 SP 0.5222 0.9923 0.3984 CPV1 0.3823 0.9203 0.1644 CPV2 0.1479 0.8966-0.1247 CPV3 0.0714 0.8874 0.1743 CB1 0.0195 0.1507 0.3943 CB2 0.0182 0.0982 0.3586 CB3 0.0116 0.0383 0.4028 CB4 0.0044 0.0050 0.576 CB5 0.0034 0.0207-0.0636 CB6 0.0018 0.0523-0.4214 CB7 0.0017 0.1113-0.1953 CB8 0.0014 0.2277 0.1153 HCT 0.0005 0.5149 0.1296 RHCT 0.0004 0.9963 0 SF 0.0002 0.9992-0.4776 RG 0.0002 0.9838 0.0238 PA 0.0001 0.9974-0.1153 PLA 0.0001 0.9999-0.0025 ESR4 0.0000 0.9610-0.3187 ESR_COR 0.0000 0.9902 0.1387 617

Accuracy (%) 4. Results and Discussion In the experiments, we divided the dataset into two subsets. First subset consists of data related to hypertensive and normal patients. Second subset contains mixed population data about diabetic and normal patients. We used Naïve Bayesian (NB) classifier and K-nearest neighbor (KNN) classifier to evaluate the effectiveness of the three attribute selection methods. For classification models, we consider four different feature set of the datasets. In the first attempt we use all feature set for building classification model. Subsequently we considered selected features by PCA, FA and AR methods to build the classification models. MatLab 2007a tool was used to code and test these classification models. Classifier accuracy was used as a measure to determine the effectiveness of feature selection methods. As shown PCA resulted in best classification accuracy for both, Naïve Bayesian as well as KNN classifier. When we use PCA the classification accuracy for Naïve Bayesian Classifier increased from 62.34% for dataset-1 to 84.33% and for dataset-2, accuracy improved from 50.34% to 79.21%, compared to consideration of all features for classification. Similarly for KNN classifier the accuracy improved from 59.13% to 80.25% for dataset-1 and for dataset-2 it increased from 52.17% to 82.58%, compared to consideration of all features for classification. Other feature selection methods also show promising increase of accuracy compared to accuracy displayed by considering all the attributes (features) for building classification model. But PCA recorded best results. Table 1: Datasets used in the study # Dataset Features Instances 1 Diabetes dataset 30 235 2 Hypertension dataset 30 338 Fig. 2. Effect of attribute selection on 4. Conclusion classifier accuracy Medical data usually consists of a large number of disease markers, it is hard to analyze by humans. In building medical decision support systems a small subset of relevant disease markers or symptoms (features) are needed to show higher accuracy and lower learning time. Generally a large amount of diagnosed samples are required to achieve the good feature selection performance. The feature selection methods definitely improve the classification accuracy. In our study all the three methods viz. PCA, FA and AR shows promising performance to improve the accuracy of the classification models. The performance of PCA was the best. Therefore, future work will investigate the using of undiagnosed samples for other advanced feature selection methods. Moreover, feature selection is a kind of data preprocessing technique for medical data. In addition to it, there exist other preprocessing techniques such as noisy medical data detection yet to be explored. 100 80 60 40 20 0 5. References ALL PCA FA AR Feature Seletion Methods NB 1 NB 2 KNN 1 KNN 2 Classifier Naïve Bayesian classifier KNN classifier Table 2: Experimental results % Classifier accuracy Data All set PCA FA AR features 1 62.34 84.33 80.21 74.43 2 50.34 79.21 75.38 65.69 1 59.13 80.25 76.83 68.25 2 52.17 82.58 74.32 66.34 [1] I. Kononenko, "Machine learning for medical diagnosis: history, state of the art and perspective," Artificial Intelligence in Medicine 23 (1), 2001, pp. 89-109. [2] T. T. Tang, G. Zheng, Y. L. Huang, G. F. Shu and P. T. Wang, "A comparative study of medical data classification methods based on decision tree and system reconstruction analysis," Journal of EMS 4, 2005, pp. 102-108. [3] W. Wajs, P. Wais, M. Swiecicki, and H. Wojtowicz, "Artificial immune system for medical data classification," in: Proc. of 2005 International Conference on Computational Science, 2005, pp. 810-812. [4] T. H. Cheng, C. P. Wei, V. S. Tseng, "Feature selection for medical data mining: omparisons of expert judgment and automatic approaches," in: Proc. of the 19th IEEE Symposium on Computer based Medical Systems, 2006, pp. 165-I 70. 618

[5] K. Polat and S. Gunes, "A new feature selection method on classification of medical datasets: Kernel F- score feature selection,"expert Systems with Applications 36 (7), 2009, pp. 10367-10373. [6] L. Chuang, H. Chang, C. Tu, and C. Yang, "Improved binary PSO for feature selection using gene expression data," Computational Biology and Chemistry 32 (1), 2008, pp. 29-38. [7] M. Raymer, W. Punch, E. Goodman, L. Kuhn, and A. Jain, "Dimensionality reduction using genetic algorithms," IEEE Transactions on Evolutionary Computation, vol. 4, 2000, pp. 164-171, [8] A. Hyvarinen, E. Oja, "Independent component analysis: algorithms and applications," Neural Networks 13 (4-5), 2000, pp. 411-430. [9] I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," 1. Mach. Learn. Res. 3,2003, pp. 1157-1182. [10] L. Breiman, 1. H. Friedman, R. A. Olshen, and C. 1. Stone, Classification and Regression Trees, Wadsworth and Brooks, 1984. [11] R. Kohavi and G. John, "Wrappers for feature selection," Artificial Intelligence 97 (1-2),1997, pp. 273-324. [12] K. Kira and L. Rendell, "A practical approach to feature selection," In Proc. International Conference on Machine Learning, 1992, pp. 368-377. [13] D. Pyle. Data Preparation for Data Mining, Morgan Kaufmann Publishers, 1999. 619