A Classification Model for Imbalanced Medical Data based on PCA and Farther Distance based Synthetic Minority Oversampling Technique

Similar documents
Study and Comparison of Various Techniques of Image Edge Detection

Using the Perpendicular Distance to the Nearest Fracture as a Proxy for Conventional Fracture Spacing Measures

*VALLIAPPAN Raman 1, PUTRA Sumari 2 and MANDAVA Rajeswari 3. George town, Penang 11800, Malaysia. George town, Penang 11800, Malaysia

AN ENHANCED GAGS BASED MTSVSL LEARNING TECHNIQUE FOR CANCER MOLECULAR PATTERN PREDICTION OF CANCER CLASSIFICATION

Joint Modelling Approaches in diabetes research. Francisco Gude Clinical Epidemiology Unit, Hospital Clínico Universitario de Santiago

Gene Selection Based on Mutual Information for the Classification of Multi-class Cancer

CLUSTERING is always popular in modern technology

Biomarker Selection from Gene Expression Data for Tumour Categorization Using Bat Algorithm

Lymphoma Cancer Classification Using Genetic Programming with SNR Features

Available online at ScienceDirect. Procedia Computer Science 46 (2015 )

AUTOMATED DETECTION OF HARD EXUDATES IN FUNDUS IMAGES USING IMPROVED OTSU THRESHOLDING AND SVM

A MIXTURE OF EXPERTS FOR CATARACT DIAGNOSIS IN HOSPITAL SCREENING DATA

Modeling Multi Layer Feed-forward Neural. Network Model on the Influence of Hypertension. and Diabetes Mellitus on Family History of

Survival Rate of Patients of Ovarian Cancer: Rough Set Approach

DETECTION AND CLASSIFICATION OF BRAIN TUMOR USING ML

Comparison among Feature Encoding Techniques for HIV-1 Protease Cleavage Specificity

Journal of Engineering Science and Technology Review 11 (2) (2018) Research Article

Using Past Queries for Resource Selection in Distributed Information Retrieval

Journal of Engineering Science and Technology Review 11 (2) (2018) Research Article

310 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16

JOINT SUB-CLASSIFIERS ONE CLASS CLASSIFICATION MODEL FOR AVIAN INFLUENZA OUTBREAK DETECTION

A New Machine Learning Algorithm for Breast and Pectoral Muscle Segmentation

A New Diagnosis Loseless Compression Method for Digital Mammography Based on Multiple Arbitrary Shape ROIs Coding Framework

Classification of Breast Tumor in Mammogram Images Using Unsupervised Feature Learning

Prediction of Total Pressure Drop in Stenotic Coronary Arteries with Their Geometric Parameters

Detection of Lung Cancer at Early Stage using Neural Network Techniques for Preventing Health Care

A Geometric Approach To Fully Automatic Chromosome Segmentation

Fast Algorithm for Vectorcardiogram and Interbeat Intervals Analysis: Application for Premature Ventricular Contractions Classification

Optimal Planning of Charging Station for Phased Electric Vehicle *

Improvement of Automatic Hemorrhages Detection Methods using Brightness Correction on Fundus Images

PERFORMANCE EVALUATION OF DIVERSIFIED SVM KERNEL FUNCTIONS FOR BREAST TUMOR EARLY PROGNOSIS

Balanced Query Methods for Improving OCR-Based Retrieval

An Approach to Discover Dependencies between Service Operations*

ENRICHING PROCESS OF ICE-CREAM RECOMMENDATION USING COMBINATORIAL RANKING OF AHP AND MONTE CARLO AHP

Arrhythmia Detection based on Morphological and Time-frequency Features of T-wave in Electrocardiogram ABSTRACT

Research Article Statistical Analysis of Haralick Texture Features to Discriminate Lung Abnormalities

Evaluation of Literature-based Discovery Systems

econstor Make Your Publications Visible.

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Maize Varieties Combination Model of Multi-factor. and Implement

Design of PSO Based Robust Blood Glucose Control in Diabetic Patients

Nonlinear Modeling Method Based on RBF Neural Network Trained by AFSA with Adaptive Adjustment

Using a Wavelet Representation for Classification of Movement in Bed

Non-linear Multiple-Cue Judgment Tasks

Subject-Adaptive Real-Time Sleep Stage Classification Based on Conditional Random Field

Reconstruction of gene regulatory network of colon cancer using information theoretic approach

Towards Automated Pose Invariant 3D Dental Biometrics

ARTICLE IN PRESS. computer methods and programs in biomedicine xxx (2007) xxx xxx. journal homepage:

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

FAST DETECTION OF MASSES IN MAMMOGRAMS WITH DIFFICULT CASE EXCLUSION

INITIAL ANALYSIS OF AWS-OBSERVED TEMPERATURE

administration neural network vs. induction methods for knowledge classification

AUTOMATED CHARACTERIZATION OF ESOPHAGEAL AND SEVERELY INJURED VOICES BY MEANS OF ACOUSTIC PARAMETERS

Dr.S.Sumathi 1, Mrs.V.Agalya 2 Mahendra Engineering College, Mahendhirapuri, Mallasamudram

Feature Selection for Predicting Tumor Metastases in Microarray Experiments using Paired Design

Comparison of support vector machine based on genetic algorithm with logistic regression to diagnose obstructive sleep apnea

Computing and Using Reputations for Internet Ratings

Nonstandard Machine Learning Algorithms for Microarray Data Mining. Byoung-Tak Zhang

Modeling the Survival of Retrospective Clinical Data from Prostate Cancer Patients in Komfo Anokye Teaching Hospital, Ghana

Boosting for tumor classification with gene expression data. Seminar für Statistik, ETH Zürich, CH-8092, Switzerland

A Computer-aided System for Discriminating Normal from Cancerous Regions in IHC Liver Cancer Tissue Images Using K-means Clustering*

Jurnal Teknologi USING ASSOCIATION RULES TO STUDY PATTERNS OF MEDICINE USE IN THAI ADULT DEPRESSED PATIENTS. Full Paper

Copy Number Variation Methods and Data

Introduction ORIGINAL RESEARCH

Physical Model for the Evolution of the Genetic Code

Estimation for Pavement Performance Curve based on Kyoto Model : A Case Study for Highway in the State of Sao Paulo

EXAMINATION OF THE DENSITY OF SEMEN AND ANALYSIS OF SPERM CELL MOVEMENT. 1. INTRODUCTION

Prognosis and Diagnosis of Breast Cancer Using Interactive Dashboard Through Big Data Analytics

Cancer Classification Based on Support Vector Machine Optimized by Particle Swarm Optimization and Artificial Bee Colony

Towards Prediction of Radiation Pneumonitis Arising from Lung Cancer Patients Using Machine Learning Approaches

Parameter Estimates of a Random Regression Test Day Model for First Three Lactation Somatic Cell Scores

Proceedings of the 6th WSEAS Int. Conf. on EVOLUTIONARY COMPUTING, Lisbon, Portugal, June 16-18, 2005 (pp )

What Determines Attitude Improvements? Does Religiosity Help?

Adaptive Neuro Fuzzy Inference System (ANFIS): MATLAB Simulation of Breast Cancer Experimental Data

Shape-based Retrieval of Heart Sounds for Disease Similarity Detection Tanveer Syeda-Mahmood, Fei Wang

Algorithms 2009, 2, ; doi: /a OPEN ACCESS

A-UNIFAC Modeling of Binary and Multicomponent Phase Equilibria of Fatty Esters+Water+Methanol+Glycerol

ARTICLE IN PRESS Neuropsychologia xxx (2010) xxx xxx

Experimental Study of Dielectric Properties of Human Lung Tissue in Vitro

Appendix F: The Grant Impact for SBIR Mills

A Neural Network System for Diagnosis and Assessment of Tremor in Parkinson Disease Patients

A Novel artifact for evaluating accuracies of gear profile and pitch measurements of gear measuring instruments

Drug Prescription Behavior and Decision Support Systems

Estimation of System Models by Swarm Intelligent Method

Semantics and image content integration for pulmonary nodule interpretation in thoracic computed tomography

Incorrect Beliefs. Overconfidence. Types of Overconfidence. Outline. Overprecision 4/22/2015. Econ 1820: Behavioral Economics Mark Dean Spring 2015

An Improved Time Domain Pitch Detection Algorithm for Pathological Voice

Machine Understanding - a new area of research aimed at building thinking/understanding machines

Statistical Analysis on Infectious Diseases in Dubai, UAE

Evaluation of the generalized gamma as a tool for treatment planning optimization

IDENTIFICATION AND DELINEATION OF QRS COMPLEXES IN ELECTROCARDIOGRAM USING FUZZY C-MEANS ALGORITHM

TOPICS IN HEALTH ECONOMETRICS

A deterministic approach for finding the T onset parameter of Flatten T wave in ECG

Active Affective State Detection and User Assistance with Dynamic Bayesian Networks. Xiangyang Li, Qiang Ji

WHO S ASSESSMENT OF HEALTH CARE INDUSTRY PERFORMANCE: RATING THE RANKINGS

Unobserved Heterogeneity and the Statistical Analysis of Highway Accident Data

Diagnosis of Severe Obstructive Sleep Apnea with Model Designed Using Genetic Algorithm and Ensemble Support Vector Machine

Saeed Ghanbari, Seyyed Mohammad Taghi Ayatollahi*, Najaf Zare

Project title: Mathematical Models of Fish Populations in Marine Reserves

Transcription:

A Classfcaton Model for Imbalanced Medcal Data based on PCA and Farther Dstance based Synthetc Mnorty Oversamplng Technque NADIR MUSTAFA School of Computer Scence and Engneerng Unversty of Electronc Scence and Technology of Chna, Chengdu, 611731, Chna JIAN-PING LI School of Computer Scence and Engneerng Unversty of Electronc Scence and Technology of Chna, Chengdu, 611731, Chna Engr. Raheel A. Memon Assstant Professor Computer Scence Sukkur Insttute of Busness Admnstraton Arport Road, Sukkur 65200, Sndh, Pakstan Mohammed Z. Omer School of Computer Scence and Engneerng Unversty of Electronc Scence and Technology of Chna, Chengdu, 611731, Chna Abstract Medcal data are extensvely used n the dagnoss of human health. So t has played a vtal role for physcans as well as n medcal engneerng. Accordngly, many types of research are gong on related to ths to have a better predcton of the dseases or to mprove the dagnoss qualty. However, most of the researchers work on ether dmensonalty space or mbalanced data. Due to ths, sometmes one may not have the accurate predctons or classfcatons of the malgnant dseases as both the factors are equally mportant. So t stll needs an mprovement or more work requred to address these bomedcal challenges by combng both the factors. As such ths paper proposes a new and effcent combned algorthm based on FD_SMOTE (Farther Dstance Based on Synthetc Mnorty Oversamplng Technques) and Prncple Component Analyss (PCA), whch successfully reduces the hgh dmensonalty and balances the mnorty class. Fnally, the present algorthm has been nvestgated on bomedcal data and t gves the desred results n terms of dmensonalty and data balancng. Here, In ths paper, the qualty of dmensonalty reducton and balanced data has been evaluated usng assessment metrcs lke covarance, Accuracy (ACC) and Area Under the Curve (AUC). It has been observed from the numercal results that the performance of the algorthm acheved the best accuracy wth metrcs of ACC and AUC. Keywords Prncple Component Analyss; Informaton Gan; farther Dstance based Synthetc Mnorty Oversamplng; Correlaton based Feature I. INTRODUCTION Classfcaton s an mportant task of machne learnng and data mnng. Classfcaton modelng s to learn a functon from tranng data, whch makes as few errors as possble when beng appled to data prevously unseen. A large number of classfcaton algorthms have been developed and used wth medcal applcatons, due to ts mportance for physcans n the dagnoss. Many researchers have been done to dscuss the great challenges of the medcal data. Imbalance class s the man challenge that nfluences to the classfcaton of the medcal data. In many cases, the nature of medcal data follows the skewed dstrbuton. Its nstances n the majorty and mnorty classes are not equalty represented [1, 2]. Hence, the medcal data becomes mbalanced when ts majorty class has a larger number of nstances. Wth the tradtonal classfcaton algorthms obtan a hgher accuracy over majorty whle Versa wth mnorty class. For ths reason, new technques and methods for dealng wth class mbalance have been proposed [9]. These technques can be classfed nto three methods: those that amend the data dstrbuton by resamplng technques (data level methods) [11], and those at the level of the learnng algorthm whch adapt a base classfer to deal wth class mbalance (algorthm level methods), and those at the features selecton level whch fnd an optmal features among the whole the features. In ths paper, we proposed a combned soluton to classfy mbalanced data, whch successfully reduces dmensonalty, and balances the mnorty class usng a combnaton of Prncple Component Analyss (PCA) and Synthetc Mnorty Oversamplng Technques. The nnovaton of ths proposal s the jont utlzaton of both (PCA) and FD_SMOTE technques, whch acheved superor results n our experment. In ths paper, the qualty of dmensonalty reducton and balanced data has been evaluated usng assessment metrcs lke Co-varance, Accuracy (ACC), and Area Under the Curve (AUC). It has been observed from the numercal results that the performance of the algorthm acheved the best accuracy wth metrcs (ACC) and (AUC). Fnally, the FD_SMOTE technque has been nvestgated on bomedcal data, and t realzed the desred results n terms of dmensonalty and data-balancng. Ths paper s organzed as follows. In Secton 2 background of the present study wth the lterature revew has been presented. After that n Secton 3 exstng approaches have been dscussed. Next n Secton 4, a new method has been proposed wth expermental analyss. Lastly Secton 5 ncludes the concluson part. 61 P age

II. BACKGROUNDS Imbalanced data s the most mportant ssue n all applcatons of the real world, and the classfcaton accuracy based on mnorty class can get a hgher prorty than that majorty class, so t s a sgnfcant work to enhance the classfcaton precson of mnorty class. In ths secton, we wll explan the basc concept of the problem and the assocated soluton. A. Imbalanced Data Problem Sun et. al stated that the most understandable problem n data set s the mbalance data dstrbuton between classes [10]. Nevertheless, the earler studes and research stated that the mbalanced data dstrbuton s not only the man ssue that reduces the performance of the exstng classfers n specfyng rare samples. The other nfluental ssue of the classfer performance s small samples sze, separablty and the exstence of wthn-class. B. Presented Approach of Imbalanced Data Problem There are dfferent approaches have been presented to tackle the mbalance class problem [7], [8,] [9], whch can be categorzed as a resamplng approach, algorthms approach and features selecton approach. The preprocessng approach s a combnaton of oversamplng technque and under-samplng technque. The Oversamplng s a powerful method used to add new samples, whle under-samplng s a process of removng exstng samples. These technques mostly fx the mbalance data by generatng or updatng some of the classfers algorthms. The classfcaton algorthm should nclude the cost senstvty, recognton-based approaches, and kernel-based learnng technques, whch perfectly provde an acceptable soluton for the mbalanced data problem. The support vector machne SVM s one of the most popular algorthms that embed the prevous technques [9]. Due to a large amount of bo-medcal data and class mbalance rato, applyng the algorthm alone s not a good dea. Hence new hybrd approaches are requred as a combnaton of samplng technques and algorthms [10]. The algorthms approach s the most popular technque that has been used to fx the mbalanced data problem, whch s the bas towards the majorty class and gnorng the mnorty class. The correct classfcaton of the mnorty class gves a better accuracy, whle n many applcatons, msclassfcaton of mnorty class results n serous problems [11]. The naccurate classfcaton of the bengn dsease leads to addtonal dagnoss, whle the naccurate classfcaton of malgnant dsease puts the human lfe at serous rsk. Therefore, most of the machne learnng algorthms tres to enhance the naccurate classfcaton of the mnorty class. The feature selecton approach has been presented as a good soluton for bo-medcal data wth a large amount. The sze of ths data can be reduced to a lower space dmenson usng lnear transformaton or nonlnear transformaton whch s used based on ts lnearty nature. Imbalanced data on mnorty class and hgh dmensonalty problem causes a msclassfcaton. Ths msclassfcaton of enttes that have the same attrbute value could dsturb the dagnoses of dseases. For example, the boundares between a malgnant headache and a bran tumor could be vague under some crcumstances, whch s obvously catastrophc. Therefore, t s not easy for the medcal doctors to examne the abnormaltes n human n the msclassfed data. The hybrdzed of reducton dmensonalty and balance data technque s necessary n most bo-medcal applcatons n order to enhance and recover msclassfcatons detals that may be hdden n the data [3][4]. III. THE PROPOSED METHOD The proposed method provdes an accurate classfcaton model by usng a combnaton of the PCA and SMOTE technque. The PCA s used to reduce the hgh dmensonalty of data by select an optmal feature from the orgnal data set. The PCA generate a new dmenson space of the data whch mplemented wth the FD_SMOTE to balance the data of the mnorty class, whle the mbalanced data splt nto tran and test data, and then the balanced data appled to the dfferent classfers to acheve the better classfcaton for the medcal data. A. Prncple Component Analyss In the proposed model the features selecton s used as the key technque to fnd a subset of optmal features from the orgnal data. The extracted features allow the classfer to acheve the best accuracy. Here, PCA to reduce the hgh dmensonal pont nto lower dmensonal pont and then usng flters to order the mportance of the selected attrbutes based on a rule [5]. In ths model, the dmensonalty reducton has been mplemented based some metrcs such as mean, covarance, egnvalue and Egenvectors to compute the prncple component. Fnally, the PCA provde a new transform of PCs whch generated by usng correlaton matrx of the data to fnd the best PCs among all the features. These steps well explaned n the algorthm 1. C = 1 N T T N j = 1 ϕ ϕ = ρρ j j (1) ρ = ϕ ϕ... ϕ 1 2 j (2) ϕ = υ µ (3) j µ = 1 M M = 1 υ (4) Where v s a vectors from the orgnal dataset s mean of Jth vectors of the data, where ϕ s a varance of the X, and µ vectors that subtracted from mean, and Then C s a covarance matrx whch generated by multplcaton of varance 62 P age

T wth ts varance transpose asϕ ϕ and Egenvectors υ can be easly substtuted accordng to the co-varance matrx C to acheve new prncple component.. Fnally, the egnvalue λ B. Farther Dstance based SMOTE The SMOTE technque provdes an optmal soluton for mbalanced data dstrbuton problem based on oversamplng technque. The basc assumpton of the SMOTE based on how to fnd the smlartes of the feature among the mnorty class nstance. The assumpton s acheved by calculatng the centrod [c] of the mnorty class sample and the dstance [d] between all the mnorty sample and ts centrod, then compute the average [avg] of dstance matrx and the seed sample represented as a farther dstance to the class center [c] and greater than the average dstance [avg]. The new synthetc sample has been generated randomly by select one of the N- centrod, then multply the dfference between the seed sample and centrod wth a random number σ between [0, 1] and then added to the orgnal seed. Fnally, the mathematcal steps of the algorthm llustrated as follows: n c = 1 y n = 1 d = ( y c) n avg = 1 d n = 1 { y d avg} Ss = > nss = Ss + ( Ss c) σ The FD_SMOTE work on creaton of new examples nstead of duplcatng the mnorty class samples, as shown n Fgure 1, the new synthetc examples are beng created n the neghborhood of mnorty classes. Where the synthetc examples are generated operatng n feature space rather than operatng n data space. Along the lne segment, each mnorty class has been taken and ntroducng synthetc examples to jon all mnorty class nearest neghbors. The numbers of requred synthetc example vary stuaton to stuaton so accordng to the requrement the numbers of k mnorty classes are chosen to generate the nearest neghbor synthetc example. Fnally, the pseudo code the proposed method llustrated as n algorthm 2. (5) (6) (7) (8) (9) Fg. 1. FD_SMOTE Technque Algorthm 1. Prncple Component Analyss Input: Orgnal data set {X = 1, 2,..., m}, whch each sample has m attrbutes wthout decson attrbute. Output: Prncple Component {Y = 1, 2,..., n}, 1: Vctores the data nto V. Vm 2: for j n do jth s all vectors 3: for m do th nstances of V 4: Compute the mean accordng to Eq.(7) 5: Subtract the nstances accordng to Eq.(6) 6: end for 7: Multply the varance accordng to Eq.(5) 8: Compute the convnce accordng to Eq.(4) 9: end for 10: Compute the egnvalue λ accordng to Eq.(4) 11: Compute the egenvectors υ accordng to Eq.(4) 12: Output new Prncple Component of features Algorthm 2. FD_SMOTE resamplng Input: Orgn set of mnorty, Dmn = {Y = 1, 2,..., n}, the balance factor σ Output: New et of mnorty, Dmaj = {Z = 1, 2,..., m} 1: Compute c, d and avg accordng to Eqs. (5), (6) and (7) 2: Create seed sample accordng to Eq. (8) 3: for σ do 4: fr m do 5: Generate random number γ 6: Generate new sample y accordng to Eq.(9) 8: end for 9: end for 10: Output new set of mnorty 63 P age

IV. EXPERIMENTAL ANALYSIS A. Collected Data TABLE 1. Provde the characterstc of the data used n ths work, whch descrbe the name, number of features and the number of nstances of the data. Its provdes a dfferent knd of the sze and level of mbalance data. Also, these data are nspred from bomedcal domans some of whch are propretary. Pma dabetes, Breast cancer and Thyrod dsease (whch contan a bnary class) are all avalable through the UCI repostory [1]. TABLE I. DATA CHARACTERISTICS no Name Instances Features 1. Pma dabetes 768 9 2. Breast cancer 699 11 3. Thyrod dsease 3163 27 B. ACC Evaluaton Measures The confuson matrx s most powerful metrcs that assess the performance of machne learnng algorthm as shown n TABLE 2. The confuson matrx categorzed nto columns and rows that descrbe the predcton class and actual class respectvely. The confuson matrx parameters are used to show the accuracy the classfcaton algorthm. These four parameters are classfed as follows TN (True Negatves), FP (False Postves), FN (False Negatves) and TP (True Postves). The postve nstance most of them correctly classfed, and the rest ncorrectly classfed. Furthermore, the negatve nstance most of them correctly classfed, and the rest ncorrectly classfed. Generally, the equaton of the classfcaton accuracy or the predcton accuracy s calculated as llustrated n the followng formula 6. Acc = ( TP + TN ) ( TP + FP + TN + FN ) In term of the mbalanced data there two metrcs are used as equal error costs and unequal error costs respectvely. The error rate (Er) s calculated as most mportant tool that used to nvestgate the performance of these metrcs, whch calculated as llustrated n the formula 7. Εr = 1 accuracy (10) (11) For the exstence of the mbalanced data wth unequal error cost, the area under the curve (ROC) s the most sutable metrc used to tackle the mbalance data problem. There are smlar technques are presented by (Lng & L, 1998; Drummond & Holte, 2000; Provost & Fawcett, 2001; Bradley, 1997; Turney, 1996). Fnally, many works are presented wth the term of ROC whch supports the study of decson boundares or relatve costs of TP and FP. ROC metrcs s coordnated on two axs as X-axs and Y-axs to calculate the %FP = FP/ (TN+FP) of X-axs and %TP = TP/ (TP+FN) of Y-axs respectvely. The ROC provde a better performance on the pont (0,100), whch explan the correct nstance and ncorrect nstance of the postve and negatve class. Actual TABLE II. CONFUSION MATRIX Predcton Predcted Negatve Predcted Postve Actual Negatve TN TN Actual Postve FN TP C. AUC Evaluaton Measures The ROC curve can be easly shfted by manpulatng the balance of tranng nstance for each class n the tranng set. Area under the ROC Curve (AUC) s a helpful measure for classfer performance as t s ndependent of the decson crteron specfed sand prevous probabltes. The AUC comparson can create a strong relatonshp between classfers. If the ROC curves are overlappng, the total AUC s a mean comparson among the models (Lee, 2000). But, for certan cost and class dstrbutons, the classfer have hghest AUC may realty be sub-optmal. Thus, we also calculate the ROC convex hulls, snce the ponts lyng on the ROC convex hull are possbly deal (Provost, Fawcett, & Kohav, 1998; Provost & Fawcett, 2001). The Classfcaton Performance of FD_SMOTE technque wth dfferent percentages can be observed n the Tables 1, 2 and 3. Here t can observe from the all the tables the representaton of the rows or classes n the dataset, the SMOTE technque analyze the percentage (%) of the majorty and mnorty class for all three datasets. The majorty represents the patents who are not affected by a dsease and ther features need to model. So to balance the mnorty class that requres ncreasng the mnorty sample by settng the percentage of SMOTE technque n multples of 100 as follows: TABLE III. SMOTE ( % ) OF PIMA DIABETIC SMOTE (%) Majorty Class Mnorty Class Total SMOTE % = 0 500 66% 268 34% 768 SMOTE % = 100 500 48% 536 52% 1036 SMOTE % = 200 500 38% 723 62% 1305 TABLE IV. SMOTE ( % ) OF BREAST CANCER SMOTE (%) Majorty Class Mnorty Class Total SMOTE % = 0 458 65% 241 35% 699 SMOTE % = 100 458 49% 482 51% 940 SMOTE % = 200 458 39% 723 61% 1181 TABLE V. SMOTE (% ) OF THYROID DISEASE SMOTE (%) Majorty Class Mnorty Class Total SMOTE % = 0 2559 81% 604 19% 3163 SMOTE % = 100 2559 68% 1204 32% 3767 SMOTE % = 200 2559 58% 1812 42% 4371 SMOTE % = 300 2559 58% 2416 49% 4975 The Performance evaluaton of Pma dabetes data classfcaton usng FD_SMOTE technque can be observed n the tables 5 and 6. From the relatonshp of the accuracy (ACC), area under the curve (AUC), here the Table 5 and 6 shown that the ACC, AUC metrcs generated wth PCA and FD_SMOTE technque are better than the ACC metrcs that 64 P age

based feature (CFs) and nformaton gan (InfoGs) technque n all classfers methods. It reveals that the AUC metrcs n all bomedcal data s hgher than other metrcs. (IJACSA) Internatonal Journal of Advanced Computer Scence and Applcatons, TABLE VI. ACCURACY RESULT OF PIMA DIABETIC MultPerceptron 88.1771 76.4323 76.7375 SVM 91.0156 71.0425 75.3906 N Neghbor 92.9863 76.0618 73.9583 Baggng 90.6094 74.0885 75.6510 Random Forest 91.8698 74.8698 72.7865 Naïve Bayes 89.6094 76.3672 74.8698 TABLE VII. AUC RESULT OF PIMA DIABETIC MultPerceptron 0.998 0.723 0.815 SVM 0.971 0.719 0.827 N Neghbor 0.963 0.741 0.804 Baggng 0.989 0.805 0.820 Random Forest 0.997 0.812 0.800 Naïve Bayes 0.984 0.823 0.813 Fgs. 3 and 4 llustrate the relatonshp of AUC and ACC of all classfers algorthms for Pma dabetes classfcaton. Here t can be observed that ACC and AUC metrcs of PCA combned FD_SMOTE technque has better results compared wth correlaton based feature (CFs) and nformaton gan (InfoGs) technques. Fg. 3. AUC result of FD_SMOTE, CFs and InfoGs The Performance evaluaton of breast cancer data classfcaton usng FD_SMOTE technque can be observed n the tables 7 and 8. From the relatonshp of the accuracy (ACC), area under the curve (AUC), here the Table 7 and 8 shown that the ACC, AUC metrcs generated wth SMOTE technque are better than the ACC metrcs that generated based feature (CFs) and nformaton gan (InfoGs) technques n all classfers methods. It reveals that the AUC metrcs n all bomedcal data s hgher than other metrcs. TABLE VIII. ACC RESULT OF BREAST CANCER MultPerceptron 93.8072 81.4235 74.4206 SVM 96.6809 82.9957 86.4235 N Neghbor 95.6809 80.1373 85.9943 Baggng 94.7340 86.2804 75.9943 Random Forest 89.8404 79.7082 75.4220 Naïve Bayes 92.1184 82.1373 90.7082 TABLE IX. AUC RESULT OF BREAST CANCER Fg. 2. ACC result of FD_SMOTE, CFs and InfoGs MultPerceptron 0.847 0.555 0.555 SVM 0.795 0.577 0.551 N Neghbor 0.759 0.581 0.535 Baggng 0.893 0.561 0.563 Random Forest 0.881 0.595 0.566 Naïve Bayes 0.894 0.586 0.571 65 P age

Fgs. 5 and 6 llustrate the relatonshp of AUC and ACC of all classfers algorthms for breast cancer classfcaton. Here t can be observed that ACC and AUC metrcs of PCA combned FD_SMOTE technque has better results compared wth correlaton based feature (CFs) and nformaton gan (InfoGs) technques. TABLE XI. AUC RESULT OF THYROID DISEASE MultPerceptron 0.925 0.812 0.772 SVM 0.879 0.798 0.729 N Neghbor 0.904 0.785 0.766 Baggng 0.935 0.853 0.733 Random Forest 0.919 0.867 0.804 Naïve Bayes 0.946 0.844 0.817 Fgs. 7 and 8 llustrate the relatonshp of AUC and ACC of all classfers algorthms for thyrod dsease classfcaton. Here t can be observed that ACC and AUC metrcs of PCA combned FD_SMOTE technque has better results compared wth correlaton based feature (CFs) and nformaton gan (InfoGs) technques. Fg. 4. AUC result of FD_SMOTE, CFs and InfoGs Fg. 6. AUC result of PCA and FD_SMOTE Fg. 5. AUC result of FD_SMOTE, CFs and InfoGs The Performance evaluaton of medcal thyrod dsease data classfcaton usng FD_SMOTE technque can be observed n the tables 9 and 10. From the relatonshp of the accuracy (ACC), area under the curve (AUC), here the Table 9 and 10 shown that the ACC, AUC metrcs generated wth SMOTE technque are better than the ACC metrcs that based feature (CFs) and nformaton gan (InfoGs) technques n all classfers methods. It reveals that the AUC metrcs n all medcal data s hgher than other metrcs. TABLE X. ACC RESULT OF THYROID DISEASE MultPerceptron 82.7228 56.2500 56.2500 SVM 84.1291 62.7315 65.2800 N Neghbor 77.1267 62.2685 58.7963 Baggng 84.1146 61.3426 64.3519 Random Forest 83.2176 66.2037 63.4259 Naïve Bayes 84.1291 59.9537 65.2778 Fg. 7. AUC result of PCA and FD_SMOTE V. CONCLUSIONS In ths paper a new algorthm has been proposed for generatng an accurate classfcaton of bomedcal data. Ths ams to tackle the skewed data dstrbuton and hgh dmensonalty problem. The approach has been constructed by combng the PCA and FD_SMOTE based on farther sample. From the qualtatve and quanttatve analyss dfferent classfers based on PCA and FD_SMOTE has been used and t reveals that the new approach ncreases the performance of 66 P age

(AUC) metrcs and (ACC) metrcs whch used on a varety data of bomedcal feld. The present analyss shows that the combned technque s most effectve than other exstng approaches such as correlaton based feature (CFs) and nformaton gan (InfoGs). However the future plan s to nvestgate the present problem wth rough set theory ncludng the mbalanced data. ACKNOWLEDGMENTS Ths paper was supported by Natonal Natural Scence Foundaton of Chna (Grant NO: 61370073), the natonal hgh technology research and development program of chna (Grant No: 2007AA01z423). REFERENCES [1] Shuo Wang, Member, and Xn Yao, Multclass Imbalance Problems: Analyss and Potental Solutons, IEEE Transactons On Systems, Man, And Cybernetcs Part B: Cybernetcs, Vol. 42, No. 4, August 2012. [2] Chrs Seffert, Tagh M. Khoshgoftaar, Jason Van Hulse, and Amr Napoltano RUSBoost: A Hybrd Approach to Allevatng Class Imbalance IEEE Transactons On Systems, Man, And Cybernetcs Part A: Systems And Humans, Vol.40, No. 1, January 2010 [3] Björn Waske, Tagh M. Khoshgoftaar, Jason Van Hulse, and Amr Napoltano RUSBoost: A Hybrd Approach to Allevatng Class Imbalance IEEE Transactons On Systems, Man, And Cybernetcs Part A: Systems And Humans, Vol.40, No. 1, January 2010. [4] Xnjan Guo, Ylong Yn1, Calng Dong, Gongpng Yang, Guangtong Zhou, On the Class Imbalance Problem Fourth Internatonal Conference on Natural Computaton, 2008. [5] Mke Waskowsk, Member and Xue-wen Chen, Combatng the Small Sample Class Imbalance Problem Usng Feature Selecton, IEEE Transactons on Knowledge and Data Engneerng, Vol. 22, No. 10, October 2010. [6] Rukshan Batuwta and Vasle Palade, Fuzzy Support Vector Machnes for Class mbalance Learnng IEEE Transactons On Fuzzy Systems, Vol. 18, No. 3, June 2010. [7] Le Zhu, Shaonng Pang, Gang Chen, and Abdolhossen Sarrafzadeh, Class Imbalance Robust Incremental LPSVM for Data Streams Learnng WCCI 2012 IEEE World Congress on Computatonal Intellgence June, 10-15,2012 - Australa. [8] Davd P. Wllams, Member, Vncent Myers, and Mranda Schatten Slvous, Mne Classfcaton Wth Imbalanced Data, IEEE Geoscences And Remote Sensng Letters, Vol. 6, No. 3, July 2009. [9] Mkel Galar,Fransco, A revew on Ensembles for the class Imbalance Problem: Baggng,Boostng and Hybrd-Based Approaches IEEE Transactons On Systems, Man, And Cybernetcs Part C: Applcaton And Revews, Vol.42,No.4 July 2012 [10] Yuchun Tang, Yan-Qng Zhang, Ntesh V. Chawla,, and Sven Krasser Correspondence SVMs Modelng for Hghly Imbalanced Classfcaton IEEE Transactons On Systems, Man, And Cybernetcs Part B: Cybernetcs, Vol. 39, No. 1, February 2009. [11] Qun Song Jun Zhang Qan Ch Assstant Detecton of Skewed Data Streams Classfcaton n Cloud Securty, IEEE Transacton 2010. [12] Ntesh V. Chawla, Nathale Japkowcz, Aleksander Ko lcz Specal Issue on Learnng from Imbalanced Data Sets Volume 6, Issue 1 - Page 1-6. [13] S eyda Ertekn1, Jan Huang, L eon Bottou, C. Lee Gles Actve Learnng n Imbalanced Data Classfcaton [14] Sauml Hukerkar, Ashwn Tumma, Akshay Nkam, Vahda Attar SkewBoost: An Algorthm for Classfyng Imbalanced Datasets Internatonal Conference on Computer Communcaton Technology (ICCCT)-2011. [15] Chrs Seffert, Tagh M. Khoshgoftaar, Jason Van Hulse, Improvng Learner Performance wth Data Samplng and Boostng 2008 20th IEEE Internatonal Conference on Tools wth Artfcal Intellgence. [16] Benjamn X. Wang and Nathale Japkowcz Boostng Support Vector Machnes for Imbalanced Data Sets Proceedngs of the 20th Internatonal Conference on Machne Learnng-2009. [17] http://www.ejpau.meda.pl/volume17/ssue3/art-03.html (Accessed on Jan 13, 2017). [18] http://blog.sqrrl.com/an-ntroducton-to-machne-learnng-forcybersecurty-and-threat-huntng (Accessed on Jan 13, 2017). [19] Beckmann, M., Ebecken, N.F.F. and de Lma, B.S.L.P. (2015) A KNN Undersamplng Approach for Data Balancng. Journal of Intellgent Learnng Systems and Applcatons, 7, 104-116. [20] Hu, Y., Guo, D.F., Fan, Z.W., Dong, C., Huang, Q.H., Xe, S.K., Lu, G.F., Tan, J., L, B.P. and Xe, Q.W.(2015) An Improved Algorthm for Imbalanced Data and Small Sample Sze Classfcaton. Journal of Data Analyss and Informaton Processng, 3, 27-33. http://dx.do.org/10.4236/jdap.2015.33004 [21] Beckmann, M., Ebecken, N.F.F. and de Lma, B.S.L.P. (2015) A KNN Undersamplng Approach for Data Balancng. Journal of Intellgent Learnng Systems and Applcatons, 7, 104-116. [22] http://www.ejpau.meda.pl/volume17/ssue3/art-03.html (Accessed on Jan 13, 2017). [23] http://blog.sqrrl.com/an-ntroducton-to-machne-learnng-forcybersecurty-and-threat-huntng (Accessed on Jan 13, 2017). 67 P age