International Journal of Mechanical Engineering and Technology (IJMET) Volume 8, Issue 10, October 2017, pp. 741 749, Article ID: IJMET_08_10_080 Available online at http://www.iaeme.com/ijmet/issues.asp?jtype=ijmet&vtype=8&itype=10 ISSN Print: 0976-6340 and ISSN Online: 0976-6359 IAEME Publication Scopus Indexed INVESTIGATING ILPD FOR MOST SIGNIFICANT FEATURES Jothi Lakshmi U, K.Jayanthi and M.Sathya Assistant Professor, Department of Information Technology, Veltech University, Chennai, India ABSTRACT Now a day s every human spend ample amount of his earnings for his/her health issues. As per the World Health Organization, now in India Liver disease is the tenth most common cause of death. The factors that contribute to a global spread of hepatitis B virus (HBV) and hepatitis C virus (HCV) infection are immigration, cheap air travel, and globalization [1]. While considering data mining or machine learning curse of dimensionality is a common issue that degrades the query accuracy and efficiency. This paper aims to identify vital attributes for diagnosis of liver diseases in a patient which eventually could improve future disease prediction through machine learning. This is done by exploring the dataset which includes the Indian Liver Patients data using machine learning algorithm C4.5. For this study we have used a data mining tool called Tanagra. Keywords: Machine Learning, Feature Selection, C4.5, Liver Disease, ILPD. Cite this Article: Jothi Lakshmi U, K.Jayanthi and M.Sathya, Investigating ILPD for Most Significant Features, International Journal of Mechanical Engineering and Technology 8(10), 2017, pp. 741 749. http://www.iaeme.com/ijmet/issues.asp?jtype=ijmet&vtype=8&itype=10 1. INTRODUCTION Being the largest internal organ and gland, liver plays a vital role in keeping a man brisk. Liver disease is one among several fatal disease and its symptoms are not known in earlier stage. Comprehensive causes for liver disease are immigration, cheap air travel, and globalization [1]. But to be specific alcoholic addiction, smoking, contaminated food consumption, obesity, diabetes and heredity are also major causes for liver disease [5]. Diagnosis of such disease is crucial. There are several clinical methods and procedures for diagnosis of liver disorder. Indian Liver Patients data is a publicly available dataset which is commonly used for finding new way to predict liver diseases through machine learning. Data mining in health care sector helps in the prediction of disease by looking for specific patterns in previously existing patient record. In machine learning curse of dimensionality is an issue that has negative impact on accuracy and efficiency on query. Hence reducing the dimension is relevant so as to improve performance of the classifier. http://www.iaeme.com/ijmet/index.asp 741 editor@iaeme.com
Jothi Lakshmi U, K.Jayanthi and M.Sathya C4.5 is a classifier which generates a tree based on entropy and calculated information gain and it is an extension of ID3. In this work C4.5 has been chosen for exploring the ILPD dataset. The data set was collected from north east of Andhra Pradesh, India. 2. LITERATURE SURVEY Alkaline Phosphatase Level Test An alkaline phosphatase level test (ALP test) measures the amount of alkaline phosphatase enzyme in your bloodstream. The test requires a simple blood draw and is often a routine part of other blood tests. Abnormal levels of ALP in your blood most often indicate a problem with your liver, gallbladder, or bones. Figure 1 Liver Function [2] Bilirubin Test 1. A bilirubin test is used to help determine the cause of jaundice, a yellowing of your skin and the whites of your eyes. 2. It helps diagnose conditions like liver disease, hemolytic anemia, and blocked bile ducts. 3. Normal adult values of Total bilirubin range from 0.3 1.0 mg/dl or 5.1 to 17.0 mmol/l Direct bilirubin range from 0.1 to 0.3 mg/dl or 1.0 to 5.1 mmol/l Indirect bilirubin ranges from 0.2 to 0.7 mg/dl or 3.4 to 11.9 mmol/l Albumin and liver disease Albumin is a protein made by the liver. A serum albumin test measures the amount of this protein in the clear liquid portion of the blood. Albumin can also be measured in the urine.a normal albumin range is 3.4 to 5.4 g/dl. If you have a lower albumin level, you may have malnutrition. It can also mean that you have liver disease or an inflammatory disease. Higher albumin levels may be caused by acute infections, burns, and stress from surgery or a heart attack. [9] http://www.iaeme.com/ijmet/index.asp 742 editor@iaeme.com
Investigating ILPD for Most Significant Features SGPT - Serum Glutamic Pyruvic Transaminase SGPT test measures the level of Alanine Aminotransferase (aka ALT) in your blood. It is an enzyme made by cells in your liver.as mentioned earlier, the important functions of liver includes making proteins, storing vitamins and iron removing toxins from your blood producing bile, which aids in digestion. Proteins called enzymes help the liver break down other proteins so the human body can absorb them more easily. SGPT is one of these enzymes. It plays a crucial role in metabolism, the process that turns food into energy. This enzyme is normally found inside liver cells. When the liver is damaged or inflamed, the enzyme can be released into the bloodstream. This causes the ALT levels to rise. Measuring the level of ALT in a person s blood can help doctors evaluate liver function or determine the underlying cause of a liver problem. The ALT test is often part of an initial screening for liver disease. SGOT or serum glutamic-oxaloacetic transaminase This test for the enzyme aspartate aminotransferase (AST) level in blood. AST is found in red blood cells, liver cells, and muscle cells, including the heart. It is released into the blood when these cells are damaged.the AST level is measured to check the liver, kidneys, heart, pancreas, muscles, and red blood cells. This test is also done to check medical treatments that may affect the liver. TP-Total Protein If there is some symptoms of kidney or Liver disease, then the Total Protein Test is done.this test checks for the levels of protein, specifically albumin and globulin in the blood. If total protein is abnormal, further testing must be performed to identify which specific protein is abnormally low or high so that a specific diagnosis can be made. A/G ratio This is a blood test to measure the levels of protein in your body. Your liver makes most of the proteins that are found in your blood. Albumin is one major type of protein. Albumin carries many other substances around your system, including medicines and products your body makes. Another kind of protein called globulin has other functions in your body. This test provides information about the amount of albumin you have compared with globulin. This comparison is called the A/G ratio. This test is useful when your healthcare provider suspects you have liver disease. Certain diseases tend to lower your level of albumin and raise your level of one or more types of globulins. A normal range of albumin is 39 to 51 grams per liter (g/l) of blood. The normal range for globulins varies by specific type. A normal range for total globulins is 23 to 35 g/l. C4.5 Ross Quinlan developed C4.5 algorithm in the year 1993. It is an algorithm which avoid overfitting, it can handle continuous attribute, missing data is no issue, and it converts tree to rules. Works done with ILPD So far lots of work has been done with ILPD. In [3], [4],[5],[6] different machine learning algorithm and their performance is evaluated using this dataset. http://www.iaeme.com/ijmet/index.asp 743 editor@iaeme.com
Jothi Lakshmi U, K.Jayanthi and M.Sathya Importance of Feature Selection From a dataset finding out relevance feature set is important as it may improve the performance of classifier. 3. EXPERIMENT The intension of this experiment is to find a small set of attributes by which accuracy and efficiency of classifier would be improved. For this purpose supervised learning algorithm c4.5 is executed in data mining tool Tanagra. ILPD India Liver Patients data is a dataset publicly available through UCI archive. It consists of 11 attributes viz. Age, Gender, TB, DB, AlkPhos, Sgpt, Sgot, TP, ALB, A/G ratio, and class. The number of instances in the dataset is 579 instances. All listed attributes are continues except Gender which has 2 values either M or F. The attribute class classifies the entire set either as 1 or 2. This data set contains 416 liver patient records and 167 non liver patient records. The data set was collected from north east of Andhra Pradesh, India. Selector is a class label used to divide into groups (liver patient or not). This data set contains 441 male patient records and 142 female patient records. Any patient whose age exceeded 89 is listed as being of age "90". Attribute Table 1 Attribute description Description Age Gender TB DB AlkPhos Sgpt Sgot TP ALB A/G Ratio Class Age of the patient Gender of the patient Total Billrubin Direct Billrubin Alkaline Phosphatase Alamine Aminotransferase Aspartate Aminotransferase Total Protein Albumin Albumin Globulin ratio Selector Table 2 Attribute category Attribute Target Input Age - yes Gender - yes TB - yes DB - yes AlkPhos - yes http://www.iaeme.com/ijmet/index.asp 744 editor@iaeme.com
Investigating ILPD for Most Significant Features Sgpt - yes Sgot - yes TP - yes ALB - yes A/G - yes class yes - Sampling The dataset is sampled in such a way that the 75% is taken as training set and 25 % as test set. Experiment 1: (With all attribute) Here the complete set of attribute is taken into account. Attributes Age, Gender, TB, DB, AlkPhos, Sgpt, Sgot, TP, ALB, A/B are set as input attributes and class as target attribute. Error rate 0.1475 _1_1.00 0.9154 0.1125 _2_2.00 0.6783 0.2571 _1_1.00 292 27 319 _2_2.00 37 78 115 Sum 329 105 434 Error rate 0.3517 _1_1.00 0.8316 0.3070 _2_2.00 0.3000 0.5161 _1_1.00 79 16 95 _2_2.00 35 15 50 Sum 114 31 145 Experiment 2 (with only Age and Gender) Here the two set of attribute is taken into account. Attributes Age, Gender are set as input attributes and class as target attribute. With this set up the classifiers error rate is 0.1111, recall of selector 1 is 0.9189 and of selector 2 is 0.7879. This result is a real surprise though it doesn t make any sense factually. http://www.iaeme.com/ijmet/index.asp 745 editor@iaeme.com
Jothi Lakshmi U, K.Jayanthi and M.Sathya TAINING RESULT Error rate 0.2327 _1_1.00 0.9812 0.2328 _2_2.00 0.1739 0.2308 _1_1.00 313 6 319 _2_2.00 95 20 115 Sum 408 26 434 pred_bagging_2 Error rate 0.3379 Value Recall 1-Precision _1_1.00 _2_2.00 Sum _1_1.00 0.9579 0.3309 _2_2.00 0.1000 0.4444 _1_1.00 91 4 95 _2_2.00 45 5 50 Sum 136 9 145 Experiment 3 (only with alkphos) Error rate 0.2442 _1_1.00 0.9687 0.2370 _2_2.00 0.1652 0.3448 _1_1.00 309 10 319 _2_2.00 96 19 115 Sum 405 29 434 Error rate 0.3586 http://www.iaeme.com/ijmet/index.asp 746 editor@iaeme.com
Investigating ILPD for Most Significant Features _1_1.00 0.9368 0.3407 _2_2.00 0.0800 0.6000 _1_1.00 89 6 95 _2_2.00 46 4 50 Sum 135 10 145 Experiment 4(with TB,DB,AlkPhos,Sgpt,Sgot) Error rate 0.1521 _1_1.00 0.8840 0.0932 _2_2.00 0.7478 0.3008 _1_1.00 282 37 319 _2_2.00 29 86 115 Sum 311 123 434 Error rate 0.3241 _1_1.00 0.7789 0.2600 _2_2.00 0.4800 0.4667 _1_1.00 74 21 95 _2_2.00 26 24 50 Sum 100 45 145 Experiment 5(with TP,ALB,A/G)) Error rate 0.2465 _1_1.00 0.9969 0.2500 _2_2.00 0.0783 0.1000 _1_1.00 318 1 319 _2_2.00 106 9 115 Sum 424 10 434 http://www.iaeme.com/ijmet/index.asp 747 editor@iaeme.com
Jothi Lakshmi U, K.Jayanthi and M.Sathya Error rate 0.4138 _1_1.00 0.8632 0.3643 _2_2.00 0.0600 0.8125 _1_1.00 82 13 95 _2_2.00 47 3 50 Sum 129 16 145 Experiment 5(without Age and Gender) Error rate 0.1406 _1_1.00 0.9154 0.1043 _2_2.00 0.7043 0.2500 _1_1.00 292 27 319 _2_2.00 34 81 115 Sum 326 108 434 pred_bagging_2 Error rate 0.3241 _1_1.00 0.8211 0.2778 _2_2.00 0.4000 0.4595 _1_1.00 78 17 95 _2_2.00 30 20 50 Sum 108 37 145 4. CONCLUSION In this paper the Indian Liver Patient Data is investigated for finding the significance of each feature/attribute in classifier performance. The overall result shows that each attribute has its own contribution to the classifier performance. But to be very specific contribution of TB,DB,AlkPhos,Sgpt,Sgot is higher when compare to other attributes as the error rate is lesser with the former than the later. Hence it can be concluded from the result that TB,DB,AlkPhos,Sgpt,Sgot are most significant attributes in ILPD dataset. The work done with this experiment is just a gist for the future work to be done with the ILPD dataset. http://www.iaeme.com/ijmet/index.asp 748 editor@iaeme.com
Investigating ILPD for Most Significant Features REFERENCE [1] Global challenges in liver disease. Williams R 1.PubMed.gov,NCBI [2] http://www.medicinenet.com/liver_disease/article.htm [3] Analysis of Liver Disorder Using Data mining Algorithm, P.Rajeswari1, G.Sophia Reena2, Global Journal of Computer Science and Technology Vol. 10 Issue 14 (Ver. 1.0) November 2010 [4] An Approach of Data Mining for Predicting the Chances of Liver Disease in Ectopic Pregnant Groups, A.S.Aneeshkumar, C.Jothi Venkateswaran, Special Issue of International Journal of Computer Applications (0975 8887), The International Conference on Communication, Computing and Information Technology (ICCCMIT) 2012 [5] Liver Disease Prediction using SVM and Naïve Bayes Algorithms, Dr. S. Vijayarani1, Mr.S.Dhayanand2, International Journal of Science, Engineering and Technology Research (IJSETR) Volume 4, Issue 4, April 2015, ISSN: 2278 7798 [6] Liver Disease Analysis And Accuracy Prediction Using Machine Learning Techniques, D. Sindhuja1 and R. Jemina Priyadarsini2, I J C T A, 9(26) 2016, pp. 379-384, International Science Press [7] Feature Selection in Data Mining, YongSeog Kim, W. Nick Street, and Filippo Menczer, University of Iowa, USA [8] Implementation of decision tree algorithm c4.5, 1Harvinder Chauhan, 2Anu Chauhan, International Journal of Scientific and Research Publications, Volume 3, Issue 10, October 2013, ISSN 2250-3153 [9] https://www.urmc.rochester.edu/encyclopedia/content.aspx?...167...albumin_blood [10] Er. Harpal, Dr. Gaurav Tejpal and Dr. Sonal Sharma, Machine Learning Techniques for Wormhole Attack Detection Techniques in Wireless Sensor Networks, International Journal of Mechanical Engineering and Technology 8(9), 2017, pp. 337 348 [11] Padmakumari P and Umamakeswari.A, Hybrid Statistical and Machine Learning Methods for Failure Prediction in Cloud, International Journal of Mechanical Engineering and Technology 8(8), 2017, pp. 714 719. [12] Taran Singh Bharati and R. Kumar. Intrusion Intrusion Detection System for Manet Using Machine Learning and State Transition Analysis. International Journal of Computer Engineering and Technology, 6(12), 2015, pp. 01-08. [13] C.R. Cyril Anthoni Dr. A. Christy, Integration of Feature Sets with Machine Learning Techniques for Spam Filtering, International Journal of Computer Engineering and Technology (IJCET), Volume 2 Number 1, Jan - April (2011), pp. 47-52 [14] Goverdhan Reddy Jidiga and Dr. P Sammulal, Machine Learning Approach to Anomaly Detection In Cyber Security with A Case Study of Spamming Attack, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 3, May-June (2013), pp. 113-122 http://www.iaeme.com/ijmet/index.asp 749 editor@iaeme.com