An Improved Algorithm To Predict Recurrence Of Breast Cancer

Similar documents
Analysis of Classification Algorithms towards Breast Tissue Data Set

Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets

Rajiv Gandhi College of Engineering, Chandrapur

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

International Journal of Advance Engineering and Research Development A THERORETICAL SURVEY ON BREAST CANCER PREDICTION USING DATA MINING TECHNIQUES

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

SVM-Kmeans: Support Vector Machine based on Kmeans Clustering for Breast Cancer Diagnosis

Prediction Models of Diabetes Diseases Based on Heterogeneous Multiple Classifiers

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Juvenile Diabetes from Clinical Test Results

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Brain Tumor segmentation and classification using Fcm and support vector machine

A Novel Iterative Linear Regression Perceptron Classifier for Breast Cancer Prediction

Generalized additive model for disease risk prediction

Time-to-Recur Measurements in Breast Cancer Microscopic Disease Instances

Predicting Breast Cancer Survivability Rates

Classification of breast cancer using Wrapper and Naïve Bayes algorithms

BREAST CANCER EPIDEMIOLOGY MODEL:

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

A scored AUC Metric for Classifier Evaluation and Selection

Mining Big Data: Breast Cancer Prediction using DT - SVM Hybrid Model

Comparison of discrimination methods for the classification of tumors using gene expression data

Error Detection based on neural signals

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

Keywords Data Mining Techniques (DMT), Breast Cancer, R-Programming techniques, SVM, Ada Boost Model, Random Forest Model

Variable Features Selection for Classification of Medical Data using SVM

INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY

International Journal of Advance Research in Computer Science and Management Studies

Prediction of Malignant and Benign Tumor using Machine Learning

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

An SVM-Fuzzy Expert System Design For Diabetes Risk Classification

Weighted Naive Bayes Classifier: A Predictive Model for Breast Cancer Detection

A Deep Learning Approach to Identify Diabetes

Malignant Tumor Detection Using Machine Learning through Scikit-learn

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

COMPARISON OF DECISION TREE METHODS FOR BREAST CANCER DIAGNOSIS

Colon cancer survival prediction using ensemble data mining on SEER data

A BIOINFORMATIC TOOL FOR BREAST CANCER PREDICTION USING MACHINE LEARNING TECHNIQUES

Predicting Heart Attack using Fuzzy C Means Clustering Algorithm

An Experimental Study of Diabetes Disease Prediction System Using Classification Techniques

AN EXPERIMENTAL STUDY ON HYPOTHYROID USING ROTATION FOREST

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

A Fuzzy Improved Neural based Soft Computing Approach for Pest Disease Prediction

Survival Prediction Models for Estimating the Benefit of Post-Operative Radiation Therapy for Gallbladder Cancer and Lung Cancer

ABSTRACT I. INTRODUCTION II. HEART DISEASE

Impute vs. Ignore: Missing Values for Prediction

Panel: Machine Learning in Surgery and Cancer

Minimum Feature Selection for Epileptic Seizure Classification using Wavelet-based Feature Extraction and a Fuzzy Neural Network

Performance Analysis of Decision Tree Algorithms for Breast Cancer Classification

Binary Classification of Cognitive Disorders: Investigation on the Effects of Protein Sequence Properties in Alzheimer s and Parkinson s Disease

Downloaded from ijbd.ir at 19: on Friday March 22nd (Naive Bayes) (Logistic Regression) (Bayes Nets)

A DATA MINING APPROACH FOR PRECISE DIAGNOSIS OF DENGUE FEVER

Artificial Immunity and Features Reduction for effective Breast Cancer Diagnosis and Prognosis

Detect the Stage Wise Lung Nodule for CT Images Using SVM

Classification of Smoking Status: The Case of Turkey

An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework

CANCER PREDICTION SYSTEM USING DATAMINING TECHNIQUES

J2.6 Imputation of missing data with nonlinear relationships

Stage-Specific Predictive Models for Cancer Survivability

Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients

EECS 433 Statistical Pattern Recognition

Classification and Predication of Breast Cancer Risk Factors Using Id3

Application of Tree Structures of Fuzzy Classifier to Diabetes Disease Diagnosis

Gender Based Emotion Recognition using Speech Signals: A Review

Improved Intelligent Classification Technique Based On Support Vector Machines

Mayuri Takore 1, Prof.R.R. Shelke 2 1 ME First Yr. (CSE), 2 Assistant Professor Computer Science & Engg, Department

Predicting Kidney Cancer Survival from Genomic Data

Predictive Models for Healthcare Analytics

DETECTING DIABETES MELLITUS GRADIENT VECTOR FLOW SNAKE SEGMENTED TECHNIQUE

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

3. Model evaluation & selection

Radiotherapy Outcomes

Decision Support System for Heart Disease Diagnosing Using K-NN Algorithm

Effect of Feedforward Back Propagation Neural Network for Breast Tumor Classification

Disease predictive, best drug: big data implementation of drug query with disease prediction, side effects & feedback analysis

Enhanced Detection of Lung Cancer using Hybrid Method of Image Segmentation

PERFORMANCE EVALUATION USING SUPERVISED LEARNING ALGORITHMS FOR BREAST CANCER DIAGNOSIS

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

Applying Data Mining for Epileptic Seizure Detection

Survey on Breast Cancer Analysis using Machine Learning Techniques

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Performance Based Evaluation of Various Machine Learning Classification Techniques for Chronic Kidney Disease Diagnosis

Evaluating Classifiers for Disease Gene Discovery

Prediction of Dengue, Diabetes and Swine Flu Using Random Forest Classification Algorithm

R Jagdeesh Kanan* et al. International Journal of Pharmacy & Technology

Design of Multi-Class Classifier for Prediction of Diabetes using Linear Support Vector Machine

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Modelling and Application of Logistic Regression and Artificial Neural Networks Models

BACKPROPOGATION NEURAL NETWORK FOR PREDICTION OF HEART DISEASE

Classification of benign and malignant masses in breast mammograms

Utilizing Posterior Probability for Race-composite Age Estimation

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts

Sound Texture Classification Using Statistics from an Auditory Model

Transcription:

An Improved Algorithm To Predict Recurrence Of Breast Cancer Umang Agrawal 1, Ass. Prof. Ishan K Rajani 2 1 M.E Computer Engineer, Silver Oak College of Engineering & Technology, Gujarat, India. 2 Assistant Professor, Computer/IT Department, Silver Oak College of Engineering & Technology, Gujarat, India. ABSTRACT In today s fast growing world people are becoming more and more prone to diseases, whether they live in developed or third world countries. The tremendous advancement in technology has led medical information systems in hospitals and medical institutions become larger and larger and in turn the process of extracting useful information is becoming more and more difficult and time consuming. Breast cancer is the one of the most common cancer in women and thus the early stage detection in breast cancer can provide potential advantage in the treatment of this disease. Early treatment not only helps to cure cancer but also help in its prevention of its recurrence. Datamining algorithm can provide great assistance in prediction of early stage breast cancer that always has been a challenging research problem. The main objective of this research is to find how precisely can these datamining algorithms predict the probability of recurrence of the disease among the patients on the basis of important stated parameters. An approach is proposed for disease prediction that combines Support vector machine and bagging using Ensemble learning. The research highlights the performance of Support vector machine with using bagging method in ensemble learning. Experiments show that Support vector machine is the best predictor in all classification algorithms while combining with bagging.the result indicates that accuracy of Support Vector Machine is 81% without using ensemble learning but by combining with bagging in ensemble learning, the accuracy of Support Vector machine is 84.61% which is the highest accuracy that have predicted. Keyword: - Datamining, SVM, Bagging, Breast cancer, Classification, Ensemble and machine learning. 1. INTRODUCTION Breast cancer is the most common cancer in the world among women according the World health organization s Globocan2012 report [1]. As per the report, Indian women are the most affected by this disease and therefore, it is the most common cause of death too. Early detection of this cancer increased the survivability chances of patients suffering from this disease. Many biological techniques can be used for early detection of breast cancer so that preventive measures can be taken. In this paper, we use different data mining algorithms to predict all those cases of breast cancer that are recurrent using Wisconsin Prognostic Breast Cancer (WPBC) dataset from the UCI machine learning repository [2]. Data mining and knowledge discovery from data (KDD) is the process of extracting knowledge from large amounts of data and have been successfully applied to different classification tasks including, but not limited to, decision making, fault detection, pattern recognition, weather forecasting and image processing. Extracting knowledge from data aims at building a model from the data to predict the future behavior. 8279 www.ijariie.com 4188

Fig -1: Most Common Cancer Types Prevalent in Humans 2. BACKGROUND 2.1 Overview of related work The past and current research reports on medical data using data mining techniques have been studied. All these reports are taken as a base for this paper. Ojha U, Goel S. [3] have compared various classification and clustering algorithms on Wisconsin Breast Cancer Prognosis dataset. Their result demonstrate that Support vector machine and C5.0 produce 81% accuracy. Jacob et al. [4] have compared various classifier algorithms on Wisconsin Breast Cancer diagnosis dataset. Their results demonstrate that Random Tree andc4.5 classification algorithm produce 100% accuracy. However they have used, Time attribute (Time to recur/ Disease-free Survival) along with other parameters to predict the outcome of recurrence or non-recurrence of breast cancer among patients. In this paper, Time attribute has not been relied upon for prediction of recurrence of the disease. Delen et al. [5] used the SEER data (period of 1973-2000 with 202,932 records) of breast cancer to predict the survivability of a patient using 10fold cross validation method. The result indicated that the decision tree (C5) is the best predictor with 93.6% accuracy on the dataset, artificial neural network (ANN) also showed good performance with 91.2% accuracy The logistic regression model was less successful with 89.2% accuracy as compared to other two. Chih-Lin Chi et al. [6] used the ANN model for Breast Cancer Prognosis on two dataset. They predicted recurrence probability of breast cancer and grouped patients with good (>5years) and bad(<5 years) prognoses. Falk et al. [7] has explored the results of Gaussian Mixture Regression (GMR) on WPBC dataset and has concluded that the GMR performance is better than the performance of Classification and Regression Trees (CART) in predicting breast cancer recurrence in patients. Pendharkar et al.[8] used several data mining algorithms for discovering patterns in breast cancer. They 8279 www.ijariie.com 4189

showed that data mining could be used in discovering similar patterns in breast cancer cases, which could be a great help in early detection and prevention of this disease. 2.2 Support Vector Machine (SVM) Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very well (look at following below snapshot). Fig -2: Support Vector Machine SVM works really well with clear margin of separation and it is effective in high dimensional spaces. Also it uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. 2.3 Bagging Bootstrap aggregating also known as bagging, is one of the most widely used Machine Learning algorithm which is basically used for improving accuracy and stability of algorithms and is deployed in statistical classification and regression. It works by reducing variance for avoiding Overfitting. Mostly deployed with decision tree methods but can be applied with any machine learning techniques. It is one of the special cases of model averaging approach. Suppose there is Dataset (training dataset) D of size n, bagging approach creates m new training datasets each of size n, suppose newly created datasets are designated as D and generated after uniform sampling from D with replacement. When we generate new training examples by sampling with replacement some observations may be repeated in Di. In case n' = n, for large value of n D is equivalent to have fraction (1-1/e) ( 64%) that is only 64% contains unique examples rest being duplicates. This type of sampling is known as Bootstrap Sampling. Total m bootstrap samples are used to fit m models and combined by averaging the output (for regression) or voting (for classification) [9]. 3. EXPERIMENTS 3.1 Data Source In order to find the best predictor model that can predict recurrent cases of breast cancer, the authentic dataset has been used. In WPBC dataset, Out of 35 attributes, the Outcome is the target attribute (class label); and, all other 32attributes (except ID) are decisive attributes whose value helps in predicting the recurrence of the disease. This data set consists of 198 records of patients out of which, the value of the attribute Lymph node status was missing in 4 records. Since lymph node value is an important factor in determining the breast cancer status. Thus the records rather than removing this attribute itself. Thus the final dataset contains 194 records in which 148 were non recurrent and 46 were recurrent cases. 8279 www.ijariie.com 4190

3.2 Evaluation Methodology In classifying an unknown case, depending on the class predicted by the classifier and the true class of the patient (Control or HCC), four possible types of results can be observed for the prediction as follows: True positive the result of the patient has been predicted as positive (disease type) and the patient has particular disease. False positive the result of the patient has been predicted as positive (disease type) but the patient does not have disease. True negative the result of the patient has been predicted as negative (Control), and indeed, the patient does not have particular disease. False negative the result of the patient has been predicted as negative (Control) but the patient has disease. Let TP, FP, TN, and FN, respectively, denote the number of true positives, false positives, true negatives, and false negatives. For each learning and evaluation experiment Accuracy defined as below is used as the fitness or performance indicators of the classification [10],[11]. Accuracy = (TP + TN) / (TP + TN + FP + FN). In this research work, Applied SVM and Bagging on given data set where dataset split ratio is 80:20.From this study it has been found that our proposed ensemble approach generates better result in terms of accuracy. It creates individuals for its ensemble by training each SVM classifier on a random subset of the training set. For a given data set, k random bootstrap samples are drawn with replacement. SVM classifiers are trained independently on each randomly selected subsets and aggregated via an aggregation technique. A test set is predicted on each of the SVM classifiers and the predicted class labels are determined using aggregated results likely in training sets. Normalized and transformed datasets are used as input to this SVM-Bagged classifier. Fig -3: Proposed system 3.2 Experimental Result Fig -4: Result of Proposed System 8279 www.ijariie.com 4191

4. CONCLUSIONS Using prediction model to classify recurrent or non-recurrent cases of breast cancer is a research that is statistical in nature. Still this work can be linked to bio medical evidences. In this paper, WPBC dataset is used for finding an efficient predictor algorithm to predict the recurring or non-recurring nature of disease. This might help Oncologists to differentiate a good prognosis (non-recurrent) from a bad one (recurrent) and can treat the patients more effectively. Hence in this research work we focus to improve accuracy by using ensemble learning in which an improved svm algorithm has shown 84.61% accuracy in classifying the recurrence of the disease. 5. REFERENCES [1]J. Ferlay, Globocan 2012 v1.0 Cancer Incidence and Mortality Worldwide: Iarc Cancer base no. 11, 2014, [online] Available: http://globocan.iarc.fr. [2]A.Frank and A.Asuncion, "UCI machine learning repository," 2010.[online] Available: http://archive.ics.uci.edu/ml. [3]Ojha U, Goel S. A study on prediction of breast cancer recurrence using data mining techniques. In Cloud Computing, Data Science & Engineering Confluence, 2017 7th International Conference on 2017 Jan 12 (pp. 527-530). IEEE. [4]Shomona G. Jacob and R. Geetha Ramani, "Efficient Classifier for Classification of Prognosis Breast Cancer Data Through Data Mining Techniques," Proceedings of the World Congress on Engineering and Computer Science 2012, Vol. I, October 2012. [5]D. Delen, G. Walker, A. Kadam, "Predicting breast cancer survivability: a comparison of three data mining methods", Artificial Intelligence in Medicine, vol. 34, no. 2, pp. 113127, 2004. [6]C.L. Chi, W. N. Street, W. H. Wolberg, "Application of artificial neural network-based survival analysis on two breast cancer datasets", American Medical Informatics Association Annual Symposium, pp. 130-134, Nov. 2007. [7]T. H. Falk, H. Shatkay, and W.-Y. Chan, "Breast cancer prognosis via gaussian mixture regression," in Canadian conference on Electrical and Computer Engineering, CCECE'06, 2006. [8]Breiman, Bagging predictors, Machine Learning, Vol. 24, 1996,. pp. 123-140. [9]Efron B., and Tibshirani R., An Introduction to the Bootstrap, Chapman & Hall, 1993. [10]Vapnik V., Statistical Learning theory. New York: Wiley, 1998. [11]Yugal K. Sahoo G., Study of Parametric Performance Evaluation of Machine Learning and Statistical Classifiers International Journal of Information Technology & Computer Science, Vol. 5 Issue 6, 2013, pp.57-64. 8279 www.ijariie.com 4192