Variable Features Selection for Classification of Medical Data using SVM

Size: px

Start display at page:

Download "Variable Features Selection for Classification of Medical Data using SVM"

Ethel Daniel
5 years ago
Views:

1 Variable Features Selection for Classification of Medical Data using SVM Monika Lamba USICT, GGSIPU, Delhi, India ABSTRACT: The parameters selection in support vector machines (SVM), with regards to accuracy is very important. In this paper using support vector machine classifier, i have used an approach to construct a model that is useful for classification. I have used both 10 cross validation and 5 cross validation for variable selection on input feature vectors. Additionally, the accuracy (), sensiti, specifi and error rate of SVM is evaluated against five well-known medical datasets. SVM performance is much better than the other machine learning classifier. The empirical results demonstrate that the proposed variable feature selection SVM method can obtain much more appropriate model parameters that generates a high classification accuracy. Promisingly, the proposed SVM method can be regarded as a useful clinical tool for medical data classification and medical decision making. Keywords- Support Vector Machine 10-Fold cross validation 5-Fold cross validation Machine Learning feature selection. 1. BACKGROUND KNOWLWDGE Machine Learning (ML) plays an important role in many applications of pattern classification and data mining. Major ML areas, where it can be successfully applied, are regression and classification problems, by improving the efficiency and design of the systems. The features in dataset may be of different kind of dimensions, if features are provided with known labels with corresponding correct outputs, is known as supervised learning. But in case of unsupervised learning, the attributes are unlabeled or outputs are not known. One more kind of ML exists that is reinforcement learning where the training knowledge is provided to the system by the external teacher that constitutes a measure of how well the system works is in the form of reinforcement learning. The learner is never instructed to make any of the desired actions, but rather discovering which actions lead to the best solution, by continuously trying each and every action to improve the efficiency and accuracy (). A. Supervised Learning Algorithms Machine Learning is one of the process of learning a set of rules from instance from a training set, or more specific creating a classifier which may be used to generalize from new features or instances. The learning procedure is as follows: very first step is to collect the dataset, if the collected dataset by any of the arbitrary method used is not directly suitable for induction, it may contain missing and noisy data values, hence requires significant preprocessing (Zhang, Schwartz, Wagner & Miller 2000). The second step is to prepare data, data preprocessing and the feature subset selection is identified as the process of identifying and removing as more redundant and irrelevant features as possible (Yu & Liu 2004). This lead for reducing the dimension of the data and allowing algorithms to perform efficiently and faster. As many features depend on one another and may affect the accuracy of supervised Machine Learning classification Models. B. Algorithm Selection The selection of learning algorithm is one of the critical process. Once at primary stage when testing is judged and it comes out satisfactory, then that classifier is generalized (Marcano-Cedeno, Quintanilla-Dominguez & Andina 2011). The accuracy of the classifier s is generalized based on prediction (that is the ratio of correct prediction divided by the total number of predictions). There are multiple techniques mentioned to find the classifier s accuracy. One of the way is to split the training set by dividing the two-third for training and remaining for estimating performance. Another method used in this paper is known as cross validation in which training set is divided into equally-sized mutually exclusive subsets and every subset of classifier is trained on the union of remaining subsets. The error rate in terms of average of each subset results an estimate of the rate of error for the classifier. If the error (%) is not 202 Monika Lamba

2 tolerable then algorithm will go back to the previous stage of the Machine Learning supervised process. There are two major motives to make use SVMs in the field of computational biology first, many problems have high dimensions as well as noisy data, for which SVM are called to perform well as compared to other statistical or ML methods. Second, in comparison to most ML methods, kernel methods like SVM can easily handle non-vector inputs, like graphs or variable length sequences. These types of data are most common in biology applications. 2. METHODOLY A. Support Vector machines The support vector machine, initially started with a binary classification method developed by Vapnik (Vapnik 1995) et.al at Bell laboratories. For a binary problem, we have training data points: {x i, y i }, i=1 I, y i {-1,1}, x i R d. Suppose we have some hyper-planes that classify the positive label from the negative labels separating with the help of hyper-plane. The points x which is on the hyperplane satisfy w.x + b = 0, where w is normal to the hyper-plane, b / w is the perpendicular distance from the hyper-plane to the origin, and w is the Euclidean norm of w. Let d+ d- be the shortest distance from the separating hyper-plane to the closest positive or negative points. Define the margin of a separating hyper-plane to be d+ + d-. For the classes to be linearly separable, the support vector algorithm looks for the separating hyperplane with the biggest margin. This can be mathematically stated as follows: Assume that all the training data satisfy the following constraints: x i. w + b +1 for y i = +1, (1) x i. w + b -1 for y i = -1, (2) Combining (1) and (2) into one set of in-equalities results: Y i (x i. w + b) -1 0 i (3) Considering, the equal in equation (1) holds that there exist a point which is equivalent to choosing a value for w and b. These points are on the hyperplane H1: xi w + b = 1 with normal w and perpendicular distance from the origin 1-b / w. Similarly, the points for the equal in equation (2) hold to lie on the hyper-plane H2: xi w + b = -1, with normal again w and perpendicular distance from the origin -1-b / w. Hence d+ =d- =1/ w and the margin 2/ w. w =. i α i y i x i (4). i α i y i = 0. (5) Support vector training, hence amounts to maximizing LD with respect to the αi, subject to the constraints defined in equation (5) and positi to the αi, with solution given by given in equation (4). Now we have Lagrange multiplier αi for the every training point. Those points from solution set, where αi > 0, are known as support vectors and therefore are lying on any of the hyper-planes H1, H2. All other training points have αi = 0 and lie either on H1 or H2 as earlier defined in the equal in equation (3) holds, or on other side of H1 or H2 such that it is defined inequal in equation (3) holds. For these kind models the support vectors are major component of the training set. They are located nearest to the decision boundary, if we remove all the remaining training points or moved them around subjected to a condition that they do not cross H1 or H2, and training has repeated and consequently the same hyper-plane is generated then the above algorithm for linearly separable data when applied for the non-separable data does not guarantee a feasible solution. H2: xi w + b = -1, with normal again w and perpendicular distance from the origin -1-b / w. Hence d+ =d- =1/ w and the margin 2/ w. Support vector training (linearly separable) therefore amounts to maximizing LD with respect to the αi, subject to the constraints defined in equation (5) and positi to the αi, with solution given by given in equation (4). Now we have Lagrange multiplier αi for the every training point. Those points from solution set, where αi > 0, are known as support vectors and therefore are lying on any of the hyper-planes H1, H2. All other training points have αi = 0 and lie either on H1 or H2 as earlier defined in the equal in equation (3) holds, or on other side of H1 or H2 such that it is defined inequal in equation (3) holds. For these kind models the support vectors are major component of the training set. They are located nearest to the decision boundary, if we remove all the remaining training points or moved them around subjected to a condition that they do not cross H1 or H2, and training has repeated and consequently the same hyper-plane is generated then the above 203 Monika Lamba

3 algorithm for linearly separable data when applied for the non-separable data does not guarantee a feasible solution. 3. EXPERIMENTAL SETUP In order to evaluate the multi-variant Support Vector Machine, five biomedical datasets were used from the UCI machine learning data repository, including Indians diabetes dataset, Iris dataset, Transfusion service center dataset, dataset and Cancer Wisconsin (Original) dataset. The five medical datasets represent pima diabetes, iris, blood transfusion, ecoli, breast cancer problems respectively. Table 1 contains detailed descriptions of the datasets. Only Cancer Wisconsin dataset contains missing values, and the missing categorical attributes are replace with the mode of the attributes. In this study, the Indian Database, Iris Database, Transfusion Service Center database, Database and Cancer Database, an UCI Machine Learning Repository was analysed. The Indian dataset consists of 768 instances, each record in the database has eight attributes. Therefore, the class has a distribution of 268(34.89% approx) samples and 500 (65.10% approx) not having diabetes samples. The Iris consists of 150 instances, each record in the database has four attributes. Therefore, the class has a distribution of 50(33.33%) Iris-setosa samples, 50(33.33%) Iris-virsicolor samples and 50(33.33%) Iris-virginica samples. Table 1. Description of the five medical datasets used in the experiments. No. s # of classe s # of Instance s # of feature s Mis sing Val ues. 1. Indians No 2. Iris No 3. Transfusion Service Center No No 5. Cancer Wisconsin(ori ginal) Yes Detailed description of Medical s. Table 2 Using Five cross validation on different attributes of Indian Indians with attributes on five cross validation Table 3 Using Five cross validation on different attributes of Iris Attribut e [1,2] [1,2,3] [1,2,3,4] Iris with attributes on five cross validation Table 4 Using Five cross validation on different attributes of Transfusion Service Center Attribut e [1,2] [1,2,3] [1,2,3,4] Transfusion Service Center with attributes on five cross validation Table 5 Using Five cross validation on different attributes of v [2,3] [2,3,4] [2,3,4,5] ] [2,3,4,5, 6,7] [2,3,4,5, 6,7,8] [2,3,4,5, 6,7,8,9] [2,3] [2,3,4] [2,3,4,5] ] ,7] ,7,8] with attributes on five cross validation 204 Monika Lamba

4 The Transfusion service center dataset consists of 748 instances, each record in the database has five attributes. So, the class has a distribution of 177 (24%) donate the blood samples and 570 (76%) cannot donate the blood samples. The dataset consists of 336 instances, each record in the database has eight attributes. This dataset consists of eight classes. The breast Cancer dataset consists of 699 instances, each record has ten attributes. So, the class has a distribution of 458(68.46%) benign samples and 241(36.02%) malignant samples. Table 6 Using Five cross validation on different attributes of Cancer v [2,3] [2,3,4] [2,3,4,5] ] ,7] ,7, 8],7, 8,9],7, 8,9,10] Cancer with attributes on five cross validation Table 7 Using Ten cross validation on different attributes of Indian c [2,3] [2,3,4] [2,3,4,5] ],7],7,8],7,8,9] Indians with attributes on ten cross validation Table 8 Using Ten cross validation on different attributes of Iris Iris with attributes on ten cross validation Table 9 Using Ten cross validation on different attributes of Transfusion Service Center c [1,2] [1,2,3] [1,2,3,4] Transfusion Service Center with attributes on ten cross validation Table 10 Using Ten cross validation on different attributes of ci ty [2,3] [2,3,4] [2,3,4,5] ], 7], 7,8] c [1,2] [1,2,3] [1,2,3,4] with attributes on ten cross validation Table 11 Using Ten cross validation on different attributes of Cancer ci ty [2,3] [2,3,4] [2,3,4,5] ] ,7] ,7,8] ,7,8,9],7,8,9,10] Cancer with attributes on ten cross validation 205 Monika Lamba

5 4. RESULT As in Indian diabetes dataset, the classifier gives the best sensiti with attribute [2,3,4] are selected for training and testing the machine for 5 cross validation as shown in Table 2 and for 10 cross validation best sensiti achieved is with attribute [2,3] shown in Table 7. The best specifi is achieved for attribute,7,8,9] with 5 cross validation as shown in table 2 and for attribute,7,8,9] as shown in Table 7. The best accuracy is achieved with attribute,7] with 5 cross validation as shown in Table 2 and with attribute,7] with 10 cross validation as shown in Table 7. As in Iris dataset, the classifier gives the best sensiti 1.00 with every combination of attributes selected for training and testing the machine for 5 cross validation as shown in Table 3 and for 10 cross validation best sensiti achieved is 1.00 with every combination of attributes shown in Table 8. The best specifi is achieved 0.98 for every combination of attributes for 5 cross validation as shown in table 3 and 0.98 for every combination of attributes for 10 cross validation as shown in Table Figure 1. Graph of five medical s with Accuracy vs 5 fold cross validation. 206 Monika Lamba

6 8. The best accuracy is achieved with every ccombination of attributes for 5 cross validation as shown in Table 3 and with every combination attributes for 10 cross validation as shown in Table 8. As in Transfusion Service Center dataset, the classifier gives the best sensiti with attribute [1,2,3] are selected for training and testing the machine for 5 cross validation as shown in Table 4 and for 10 cross validation best sensiti Figure 2.Graph of five medical s with Accuracy vs 10 fold cross validation. combination of attributes for 5 cross validation as shown in Table 3 and with every combination attributes for 10 cross validation as shown in Table 8. achieved is with attribute [1,2,3,4] as shown in Table 9. The best specifi is achieved with attribute [1,2,3,4] for 5 cross validation as shown in Table 4 and with attribute [1,2] for 10 cross validation as shown in Table 9. The best 207 Monika Lamba

7 accuracy is achieved with attribute [1,2,3,4] for 5 cross validation as shown in Table 4 and with attribute [1,2,3,4] for 10 cross validation as shown in Table 9. As in dataset, the classifier gives the best sensiti with attribute,7] are selected for training and testing the machine for 5 cross validation as shown in Table 5 and for 10 cross validation best sensiti achieved is with attribute,7] as shown in Table 10. The best specifi is achieved with attribute,7] for 5 cross validation as shown in Table 5 and with attribute,7] for 10 cross validation as shown in Table 10. The best accuracy is achieved with attribute,7] for 5 cross validation as shown in Table 5 and with attribute,7] for 10 cross validation as shown in Table 10. As in Cancer dataset, classifier gives the best sensiti with attribute [2,3,4] are selected for training and testing the machine for 5 cross validation as shown in Table 6 and for 10 cross validation best sensiti achieved is with attribute [2,3,4] as shown in Table 11. The best specifi is achieved with attribute,7,8,9,10] for 5 cross validation as shown in Table 6 and with attribute,7,8,9,10] for 10 cross validation as shown in Table 11. The best accuracy is achieved with attribute,7] for 5 cross validation as shown in Table 6 and with attributes [2,3,4,5] and,7] for 10 cross validation as shown in Table 11. Table 12 Accuracy of Five Medical using 5 cross validation Val ue Of fold s Datase t Iris Cancer Transfu sion Indians Table 13 Accuracy of Five Medical dataset using 10 cross validation Value Of Folds Iris cancer Tranfusion Indian The result obtained using the support vector machine classifiers by selecting variables attribute selection is shown in Table 14. Table 14 Summarized result of 5 and 10 cross validation Different types of datasets Indians 5-Cross Validation 10-Cross Validation Accuracy Accuracy Iris Accuracy Accuracy Transfusion Service Center Cancer Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Monika Lamba

8 5. CONCLUSION This paper describes the potency of SVM in the field of computational biology for which SVM are known to perform well as compared to other machine learning or statistical methods. 5-fold and 10-fold cross validations are approaches for solving medical data classification problems. In this paper, I explore the use of 5-fold and 10-fold cross 1 validations for classification. Based on the results of the current datasets of the UCI database, I conclude that various selection of features results different accuracy. This indicates that the proposed 5-fold and 10-fold cross validation methods help in selecting those variables that will result in highest accuracy. The result may be much better for larger set of real data Iris 0.85 Iris cancer cancer Transfusio n Indains Transfusio n Indian Figure 3. Graph of 5 Different with Accuracy vs 5 fold cross Validation Figure 4. Graph of 5 Different with Accuracy vs 10 fold cross Validation. 209 Monika Lamba

9 6. REFERENCES B.E, I.M, V.N A Training Algorithm for Optimal Margin Classifiers, ACM, New York, NY, USA. Chen, Wang, Tsai, Wang, Adrian, Cheng, Yang, Teng, Tan, & Chang Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinformatics. Chao, Horng The Construction of support vector machine classifier using the firefly algorithm. Comput Intell Neurosci. Cortes, Vapnik Support -Vector Networks. Mach. Learn. 20 (3) Springer Fayyad, Piatetsky-Shapiro, Smyth From Data Mining to knowledge Discovery in Database - AI magazine. aaai.org Ioanna, Evangelos, George Automatic detection of abnormal tissue in mammography ICIP (2) K.S, T.S, H An application of support vector machines in bankruptcy prediction model, Expert Syst. Appl Lei, Huan Efficient Feature Selection via Analysis of Relevance and Redundancy The Journal of Machine Learning Research Volume 5, Pages Marcano-Cedeño, Quintanilla-Domínguez, D WBCD breast cancer database classification P.S,applying artificial metaplasti neural Network Expert Systems with Applications Shen, Chen, Yu, Kang, Zhang, Li, Yang, Liu Evolving support vector machines using fruit fly optimization for medical data classification. Science Direct. R, & J.S Non-Extensive Entropy for CAD Systems of Cancer Images. Pp T. S, Vennila, & S mass classification based on cytological patterns using RBFNN and SVM Expert Syst. Appl. 36(3): Tzanis, Berberidis, Vlahavas Machine Learning and Data Mining in Bioinformatics. Vapnik The Nature of Statistical Learning Theory, Springer, New York. Zhang, Schwartz, Wagner, Miller A greedy algorithm for aligning DNA sequences J Comput Biol Monika Lamba

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY A Medical Decision Support System based on Genetic Algorithm and Least Square Support Vector Machine for Diabetes Disease Diagnosis