Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients

Similar documents
Predicting Breast Cancer Survivability Rates

Analysis of Classification Algorithms towards Breast Tissue Data Set

Colon cancer survival prediction using ensemble data mining on SEER data

Evaluating Classifiers for Disease Gene Discovery

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Empirical function attribute construction in classification learning

Downloaded from ijbd.ir at 19: on Friday March 22nd (Naive Bayes) (Logistic Regression) (Bayes Nets)

Diagnosis of Breast Cancer Using Ensemble of Data Mining Classification Methods

The Significance of the Race Factor in Breast Cancer Prognosis

Data Mining with Weka

An Improved Algorithm To Predict Recurrence Of Breast Cancer

Stage-Specific Predictive Models for Cancer Survivability

Classification of Smoking Status: The Case of Turkey

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

Classification and Predication of Breast Cancer Risk Factors Using Id3

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

Classification of breast cancer using Wrapper and Naïve Bayes algorithms

A DATA MINING APPROACH FOR PRECISE DIAGNOSIS OF DENGUE FEVER

CANCER REPORTING IN CALIFORNIA: ABSTRACTING AND CODING PROCEDURES California Cancer Reporting System Standards, Volume I

Mining Big Data: Breast Cancer Prediction using DT - SVM Hybrid Model

SFMC Breast Cancer Site Study: 2011

Genomic Health, Inc. Oncotype DX Colon Cancer Assay Clinical Compendium March 30, 2012

Relevance learning for mental disease classification

Chapter 13 Cancer of the Female Breast

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Performance Evaluation of Machine Learning Algorithms in the Classification of Parkinson Disease Using Voice Attributes

Assignment Question Paper I

CHAPTER 2 MAMMOGRAMS AND COMPUTER AIDED DETECTION

A Novel Iterative Linear Regression Perceptron Classifier for Breast Cancer Prediction

Summary of the BreastScreen Aotearoa Mortality Evaluation

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Racial differences in six major subtypes of melanoma: descriptive epidemiology

Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets

Predicting Juvenile Diabetes from Clinical Test Results

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

Credal decision trees in noisy domains

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

Applications of Machine learning in Prediction of Breast Cancer Incidence and Mortality

4/10/2018. SEER EOD and Summary Stage. Overview KCR 2018 SPRING TRAINING. What is SEER EOD? Ambiguous Terminology General Guidelines

SAQ-Adult Probation III & SAQ-Short Form

Multilayer Perceptron Neural Network Classification of Malignant Breast. Mass

Evaluation of Abstracting: Cancers Diagnosed in MCSS Quality Control Report 2005:2. Elaine N. Collins, M.A., R.H.I.A., C.T.R

COMPARISON OF DECISION TREE METHODS FOR BREAST CANCER DIAGNOSIS

Surgical Management of Metastatic Colon Cancer: analysis of the Surveillance, Epidemiology and End Results (SEER) database

A Fuzzy Improved Neural based Soft Computing Approach for Pest Disease Prediction

Breast Cancer Diagnosis and Prognosis

FORECASTING TRENDS FOR PROACTIVE CRIME PREVENTION AND DETECTION USING WEKA DATA MINING TOOL-KIT ABSTRACT

Rajiv Gandhi College of Engineering, Chandrapur

Sociodemographic and Clinical Predictors of Triple Negative Breast Cancer

Improved Intelligent Classification Technique Based On Support Vector Machines

HEALTH CARE DISPARITIES. Bhuvana Ramaswamy MD MRCP The Ohio State University Comprehensive Cancer Center

Prediction of Heart Attack risk from Behavioral habits and Demographic variables: An Artificial Neural Network approach

Data Mining in Bioinformatics Day 4: Text Mining

I.2 CNExT This section was software specific and deleted in 2008.

A BIOINFORMATIC TOOL FOR BREAST CANCER PREDICTION USING MACHINE LEARNING TECHNIQUES

DAYS IN PANCREATIC CANCER

Time-to-Recur Measurements in Breast Cancer Microscopic Disease Instances

Mammographic density and risk of breast cancer by tumor characteristics: a casecontrol

Chapter 5: Epidemiology of MBC Challenges with Population-Based Statistics

This section allows identifying the facility, this information is important for data quality follow up. Source of Standard. Source of Standard

A COMPARITIVE SURVEY ON DATA MINING TECHNIQUES FOR BREAST CANCER DIAGNOSIS AND PREDICTION

Lesson 6 Learning II Anders Lyhne Christensen, D6.05, INTRODUCTION TO AUTONOMOUS MOBILE ROBOTS

Model-free machine learning methods for personalized breast cancer risk prediction -SWISS PROMPT

Probability-Utility Model for Managing Evidence-based Central Database

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Educator Navigation Guide

Automatic Detection of Epileptic Seizures in EEG Using Machine Learning Methods

Positive and Unlabeled Relational Classification through Label Frequency Estimation

Requirements for Abstracted Text

Performance Based Evaluation of Various Machine Learning Classification Techniques for Chronic Kidney Disease Diagnosis

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence

2014 Oncology Measures Group Overview

An Examination of Factors Affecting Incidence and Survival in Respiratory Cancers. Katie Frank Roberto Perez Mentor: Dr. Kate Cowles.

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

Financial Disclosure. Learning Objectives. Review and Impact of the NCDB PUF. Moderator: Sandra Wong, MD, MS, FACS, FASCO

Modeling Sentiment with Ridge Regression

QUICK-START GUIDE NCDB Participant Use File (NCDB PUF)

An Experimental Study of Diabetes Disease Prediction System Using Classification Techniques

Creating prognostic systems for cancer patients: A demonstration using breast cancer

Incorporation of Imaging-Based Functional Assessment Procedures into the DICOM Standard Draft version 0.1 7/27/2011

International Journal of Software and Web Sciences (IJSWS)

The Development and Application of Bayesian Networks Used in Data Mining Under Big Data

CS 4365: Artificial Intelligence Recap. Vibhav Gogate

Innovative Risk and Quality Solutions for Value-Based Care. Company Overview

Survival Prediction Models for Estimating the Benefit of Post-Operative Radiation Therapy for Gallbladder Cancer and Lung Cancer

BACKPROPOGATION NEURAL NETWORK FOR PREDICTION OF HEART DISEASE

Methods and Limitations Overview

Cardiac Arrest Prediction to Prevent Code Blue Situation

10CS664: PATTERN RECOGNITION QUESTION BANK

Propensity Score Analysis to compare effects of radiation and surgery on survival time of lung cancer patients from National Cancer Registry (SEER)

CANCER PREDICTION SYSTEM USING DATAMINING TECHNIQUES

Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Abstracts

PERFORMANCE EVALUATION USING SUPERVISED LEARNING ALGORITHMS FOR BREAST CANCER DIAGNOSIS

Transcription:

Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients Abstract Prognosis for stage IV (metastatic) breast cancer is difficult for clinicians to predict. This study examines the SEER data set from 1988-2003 and selects patients who were initially diagnosed with stage IV breast cancer and who have died from a direct result of the cancer. After developing a SEER conversion utility, seer2arff, we create three predictive models that use a supervised, passive, offline technique to classify prognosis (survival time). The results of the algorithms from the Weka toolkit were: Bayes Network, 64.2% accurate; J4.8 Decision Tree, 63.5% accurate; and an Artificial Neural Network, 62.9% accurate. The J4.8 Decision Tree selected attributes that confirm the rationale of ongoing clinical studies. This study is the first to apply machine learning techniques to this category of patients with the SEER data set. Introduction Breast cancer is one of the most common cancers today. Treatment options vary significantly from surgery to chemotherapy depending on many variables like tumor location, size, and patient characteristics. Once diagnosed, physicians attempt to stage or classify the patient s cancer. Stages range from Stage 0 to Stage IV where Stage IV indicates that the cancer has metastasized beyond the breast and local lymph nodes. The staging classification system also helps to predict prognosis. When breast cancer is diagnosed in Stage IV, the five year survivability is 16% whereas if detected early in Stage 1, the percentage is 97%. Through numerous clinical and research efforts, there are well-known classifiers to aid physicians in categorizing a patient s cancer into the appropriate stage. However, once in stage IV, the cancer is already advanced and the stage IV factors affecting prognosis are not as well known. For example in stage IV, 50% of patients have a survivability of two years, however less than 20% of patients survive for four years or more. In this paper we present a comparison of three machine learning methods of predicting survival time for stage IV breast cancer patents from the Surveillance Epidemiology Copyright c 2013, by Joshua Datko. This work is made available under the terms of the Creative Commons Attribution 3.0 Unported License. Joshua Datko Advanced Artificial Intelligence (CS610) Project Proposal Drexel University jbd65@drexel.edu and End Results (SEER) data set(see 2012). SEER is one of the most cited clinical cancer data sets, maintained by the National Cancer Institute and has data as far back as 1979. In the SEER data set, survival time is calculated using the date of diagnosis and one of the following: date of death, date last known to be alive, or follow-up cutoff date used for this [data]. This definition is too inclusive and therefore records will be filtered to select patients who were initially diagnosed with stage IV breast cancer and who have died as a direct result of the cancer. Background The machine learning techniques used in this study are supervised, passive, offline algorithms that seek to predict classification. Data mining is a subset of machine learning that attempts to gain insight from data that was previously unknown. In a supervised learning technique, the program is given a series of inputs, x i and outputs, y i and attempts to find a function, f(x) = y. A passive algorithm is one that does not interfere with the data. Offline indicates that the data has been collected versus online where the data is being generated at the time of analysis. Prediction is the process of computing f(x i+1 ) = y i+1. Finally, classification is selecting an output y from a finite set of values. The three algorithms chosen for this study were a J4.8 Decision Tree, Bayes Network, and an Artificial Neural Network (Multi-layer Perceptron). The following sections provides a brief overview on the algorithms. J4.8 Decision Tree The J4.8 decision tree algorithm is Weka s implementation of Ross Quinlan s C4.5 decision tree algorithm. Specifically, it is the implementation of C4.5 revision 8 (Witten, Frank, and Hall 2011). It is a recursive algorithm that seeks to split the data to maximize information gain. Information gain quantifies the insight gained from a particular attribute. The attribute that splits the data into the most distinguishing group has higher information gain than the attribute that results in an uniform distribution. The result of this algorithm is a decision tree that is used to classify future entries. Bayes Network A Bayes Network is a directed acyclic graph that encodes the conditional probability table for each node. Once con-

structed, Bayes Networks can answer questions like, given the following attributes, what is the probability being classified as X? To apply the Bayes Network, the network must be first constructed or in the case of machine learning, learned. We use the K2 Bayesian network learner which employs a hill climbing algorithm restricted by an order on the variables. Artificial Neural Network (Multi-layer Perceptron) The multi-layer perceptron emulates the understanding of how neurons in the brain function. In general, a neural network is a collection of nodes that are defined by an activation function, that when exceeded, cause the neuron to fire or enables output on an edge leaving the neuron. In a supervised environment, when given the input and the known output, the network adjusts the activation function for each of the neurons to correspond with the output. Validation Techniques In order to quantify how the model fits the data, several validation mechanism can be evaluated. The one used in this study is a k-fold-cross-validation. In this technique, the data set is split into k subsets and k rounds of learning. Each round uses 1/k of the data as the training set, from which the model is generated. The remaining k 1 subsets are used as test data. The end result is averaged from each of the k rounds. Approach Data Preparation Preparing the data for analysis is a non-trivial and time consuming process. The SEER data set facilitates analysis by independent tools since the data is in ASCII text files. The data set itself is publicly accessible, only after registration and signing of the user agreement on the SEER website. Included in distribution is a thorough data dictionary, detailing each of the 134 attributes (columns). Out of the SEER data set, only the attributes in Table 1 were selected. Some attributes were not selected due to duplicate data. For example in breast cancer, Tumor Marker 1 is the same as ER Status Recode and therefore, the more descriptively named column was selected. Some attributes were not selected due to mutual exclusion. Since AJCC Stage 3 rd Edition, was selected only data from 1988-2003 were applicable. Also some data were not selected due to lack of statistical relevance. For our data set, less than 1% of the patients were men. Therefore sex was not included in the machine learning analysis. A query to filter the data set further was developed. This query, whose SEER values are shown in Table 2, is described in natural language as follows: select all records of patients who were initially diagnosed with stage IV breast cancer, and who have died, and who have died as a direct result of the cancer. Selecting based on vital status and cause of death is similar to the query performed in (Bellaachia and Guven 2006), however we further narrowed the data by stage IV only. This reduced the total 657712 records to 8726. Table 3 shows the breakdown of the query against the total data set. SEER Attribute Marital Status at DX Age at DX Year of DX Grade EOD-Tumor Size EOD-Lymph Node Involv Reason for no surgery Race recode ER Status Recode PR Status Recode AJCC Stage 3 rd ed SEER Cause-Specific Death Classification Vital Status Recode Survival time recode Table 1: List of analyzed attributes ARFF type Numeric Numeric Numeric SEER Attribute Filter by AJCC Stage 3 rd ed 40-49 SEER Cause-Specific Death Classification 1 Vital Status Recode 4 Table 2: Query to filter the data set Only 1.3% of the available data set is being analyzed, but most of the exclusion is a result of the restriction of the time frame. There are several staging codes used throughout the SEER data set and we chose to use only one (AJCC 3 rd edition) for consistency. This restricted the data to 1988-2003. Once filtered, the columns in Table 2 were removed from the analysis since all the records contained the same value. seer2arff A conversion utility, written by the author in Python, called seer2arff was developed to transform the SEER data into ARFF for data processing by the Weka toolkit. The Weka workbench is suite of machine learning algorithms, implemented in Java, and a framework in which one can run multiple data mining experiments, developed by the University of Waikato. The ARFF format described in (Witten, Frank, and Hall 2011), requires tagging attributes by data type; the major types are: numeric, nominal and string. A nominal attribute is one with a discrete number of values. As shown in Table 1, most of the analyzed data is nominal. Survival Time Recode (STR) required conversion to a nominal value. The original data was in a format of YYMM and it was converted to Query Count Percent Total SEER breast cancer patients 657,712 100% Number of patients (1998-2003) 331,936 50.5% Diagnosed with Stage IV (1998-2003) 12,296 1.9% Stage IV and have died 11,453 1.7% Stage IV and have died from the cancer 8,726 1.3% Table 3: Population selection

Survival Time Recode Class Percent of Population Survival Time 1 year 1 44.6% Survival Time > 1 year 2 55.4% Table 4: Survival Time Recode in nominal form a nominal attribute as shown in Table 4. One year was selected because this split produced a reasonably partitioned population. Not all of the SEER data were complete. For example, ER status information was collected for only 59% of all the patients selected. Typically, when data is missing from the SEER data set, it is encoded with 9 or 99. Those instances were replaced with a question mark character, which represents missing data in ARFF. Machine learning experimentation Three Weka data mining classification algorithms were chosen for analysis: J4.8 Decision Tree, Bayesian Network and Multi-layer Perceptron. J4.8 is Weka s implementation of the C4.5 Decision Tree algoirthm, which was run with the default parameters with the exception of the minimum number of objects for a leaf node set to 100. The default setting is two, which creates a much more complex tree. Bayesian Network and Multi-layer Perceptron were both run with their default parameters. To maximize the use of the data set, cross validation with 10 folds was chosen for each algorithm. Evaluation In this section, we show and analyze the results of the three machine learning algorithms. Table 5 summarizes the results of the algorithms. Accuracy refers to the percentage of correctly classified results, across both classes (less than or equal to a year and greater than one year). Precision is defined as the number of true positives for a given class, divided by the true and false positives. Recall is the true positives divided by the number of true positives plus false negatives. Finally, F-Measure is combination of both precision and recall given by the equation: 2P R/(P + R). Overall, the accuracy of the algorithms are within 2% of each other. However, the accuracy is much less than the 86.7% reported in (Bellaachia and Guven 2006) and other research. There are two main factors affecting this discrepancy. The first is that we are only analyzing stage IV patients. In (Bellaachia and Guven 2006), the highest ranked attributes from their J4.8 decision tree were Extension of tumor, Stage of cancer and Lymph node involvement; all three of which are directly correlated to cancer stage. In fact, five of their top ranked attributes are significant factors into determining cancer stage (which is their sixth attribute). Considering these attributes, one should expect greater accuracy. Secondly, their query resulted in 151,886 records, where ours was only 8,726, 94% less. Fortunately (for the patients), a significantly smaller amount of patients are initially diagnosed in stage IV. While more data is not guaranteed to raise accuracy, our models may be improved with more records. Algorithm Acc. Class P R F-Measure BayesNet 64.2% 1.622.505.558 2.654.752.700 NeuralNet 62.9% 1.592.538.564 2.654.702.677 J4.8 Tree 63.5% 1.618.476.538 2.644.764.699 Table 5: Combined Results. (Acc. = Accuracy, P = Precision, R = Recall) Bayes Network Overall, the Bayes Network was the most accurate algorithm of the three. The network produced resulted in each attribute being a child node of survival time. Each node has a probability distribution that can be used to solve for the conditional probability of that attribute. For example, the probability of the patient surviving for greater than one year, given that she had surgery, is 58.6% but the probability of the patient surviving for less than one year, given she had surgery was only 36.7%. This attribute was selected as a distinguishing characteristic in the J4.8 decision tree algorithm, discussed in the following section. The Bayes Net was also had the highest F-Measure for classifying patients who survived for greater than a year. However, the true distribution of patients is slightly weighted to this category as Table 4 shows. J4.8 Decision Tree The J4.8 under-performed the Bayes Network in every area with the exception of recall in the patients greater than one year category, in which it fared marginally better. The resultant decision true produced provides a human-readable model of the classifier. The full tree is shown in Appendix A. The decision tree algorithm uses a recursive approach, selecting the most discriminating attribute (highest information gain) at each level. Multi-layer Perceptron (Neural Network) While the Multi-layer Perceptron had the lowest accuracy, it had the highest recall for class 1 patients. Unfortunately as is generally the case with neural networks, it is difficult to glean any additional understanding from the model. Clinical Analysis The decision tree yielded some interesting observations about the data set. The highest ranked attribute was age at diagnosis, the cutoff of which was 76 years. The mean age at diagnosis is 63 years, with a standard deviation of 14.4 years, which roughly correlates with the 76 year cutoff. This may be intuitive, as older patients are more likely to have other medical complications and experience a greater number of side effects from treatment. Age is a complex factor in this study. While studies have shown that age alone is not a dominant factor to affect prognosis, it is believed that older women are under-treated and are not prescribed as aggressive chemotherapy as younger

woman. The J4.8 decision tree only predicted a woman over the age of 76 to live greater than one year was if she had surgery and was ER positive. The next highest ranked attribute, reason for no surgery, is a controversial choice. Retrospective data seems to indicate that surgery is beneficial, however this may be confounded by other factors when the decision to perform surgery is made, such as the patient s performance status and the extent of their metastatic disease. To study this prospectively, there is an ongoing Eastern Cooperative Oncology Group study (NCT01242800) to determine if breast cancer surgery which removes the primary tumor increases the survival of stage IV patients by randomizing patients to surgery or no surgery. The results of the J4.8 Decision tree, which indicate that surgery is an imporant factor in stage IV prognosis, supports the rationale of this study. ER Status was determined to be the third most important characteristic. ER status represents whether the estrogen receptor is present in the cancer cell by an immunohistochemistry test. When the ER result is positive, different treatment options are available to the patient, specifically hormonal treatments which have less side effects than traditional chemotherapy. From the decision tree, in general, when ER was positive a prediction was made for the patient to survive greater than one year. Marital status at diagnosis was not ranked very high, but interestingly was used as a final tie breaker in one case. On this branch, patients who were married survived greater than one year by almost 2:1. In all other marital categories, widowed, divorced, not married or separated, there were 2:1 odds of dying within one year. This supports the findings in (Osborne et al. 2005), which shows that older unmarried women are at an increased risk of death from breast cancer. Oncologist s Predictions A breast cancer oncologist was consulted and asked what factors she would consider most important in developing a patient s prognosis for a stage IV diagnosis. She listed (not in any particular order): ER status, tumor grade, the extent of the spread of metastases, and patient s performance status. Performance status refers to Karnofsky Performance Status Scale, which quantifies the patients well being. Patients with a rating of 60% and below require some degree of assistance for care and show more symptoms. This data is not available in the SEER data set. However, the spread of metastasis is available (CS Mets at DX) but only on records later than 2004. For example, if the breast cancer spreads to the liver or brain, the clinician predicts survival to be much less than if the breast cancer spread to just the bone or the lymph node. Both tumor grade and ER status were selected by the decision tree, however they were not considered as distinguishing as age and reason for no surgery. Related Work and Novelty Several researchers have used the SEER database and Weka to predict survival of breast cancer patients. In (Delen, Walker, and Kadam 2005), three data mining algorithms were tested against the 1973-2000 SEER data set and a decision tree approach predicted survivability, defined by the authors as surviving greater than sixty months, with a 93.6% accuracy. (Bellaachia and Guven 2006) performed data mining on the 1973-2000 SEER database with the Weka toolkit and showed that the extension of the tumor was the most contributing factor to survivability, followed closely by the stage of the cancer. (Endo, Shibata, and Tanaka 2007) used SEER data from 1992-1997 and selected ten independent variables to predict survivability, defined as a surviving greater than sixty months. However, there has not been a study investigating only metastatic breast cancer patients. Furthermore, each study above included the stage of breast cancer in their analysis. Staging is derived information from other empirical patient data and is therefore not an independent factor. Also, the AJCC Staging system is designed as an indicator of prognosis and is an aid for clinicians to development a treatment plan, therefore staging should not be included in machine learning techniques to predict survival. By definition, using stage IV patients is stageless since it is the last stage and it is also the most volatile in terms of survivability. Conclusion While the accuracy of the model was much less than the 80% anticipated in the proposal, we consider this initial research encouraging. Three independent algorithms produced a consistent accuracy estimate and the decision tree confirms important factors in clinical research. In fact, age and reason for no surgery are two factors undergoing current study. To our knowledge this analysis is the first to focus on only stage IV breast cancer patients. Other data mining research has shown that machine learning techniques can confirm, with high accuracy, the prognosis assigned by the staging system. However our research used machine learning to gain insight on a new question, not previously asked. Future Work A similar study can be extended to include patients in the year group 2004 and later. These records would include the extent of the metastases, which may be an important prognosis indicator. The 2004 data also includes the HER2 indicator, which oncologist use in addition to ER to guide treatment options. References Bellaachia, A., and Guven, E. 2006. Predicting Breast Cancer Survivability Using Data Mining Techniques. Delen, D.; Walker, G.; and Kadam, A. 2005. Predicting breast cancer survivability: a comparison of three data mining methods. Artificial Intelligence in Medicine 34(2):113 127. Endo, A.; Shibata, T.; and Tanaka, H. 2007. Comparison of seven algorithms to predict breast cancer survival. International Journal of Biomedical Soft Computing and Human Sciences 13.

Osborne, C.; Ostir, G. V.; Du, X.; Peek, M. K.; and Goodwin, J. S. 2005. The influence of marital status on the stage at diagnosis, treatment, and survival of older women with breast cancer. Breast Cancer Research and Treatment 93(1):41 47. 2012. Surveillance, epidemiology, and end results (seer) program (www.seer.cancer.gov) research data (1973-2009). released April 2012, based on the November 2011 submission. Witten, I.; Frank, E.; and Hall, M. A. 2011. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.

Appendix A: Complete J4.8 Decision Tree age-at-dx <= 76 reason-for-no-surgery = 0: 2 (3596.58/1086.55) reason-for-no-surgery = 1 er-status-recode-breast-cancer = 1 age-at-dx <= 58: 2 (683.28/260.55) age-at-dx > 58 race-recode = 1 grade = 1: 1 (36.48/15.69) grade = 2: 2 (251.76/109.72) grade = 3 marital-status-at-dx = 1: 1 (37.86/14.13) marital-status-at-dx = 2: 2 (218.57/96.5) marital-status-at-dx = 3: 1 (5.57/2.66) marital-status-at-dx = 4: 1 (54.46/23.52) marital-status-at-dx = 5: 1 (111.76/50.33) grade = 4: 1 (37.22/15.54) race-recode = 2: 1 (106.96/41.69) race-recode = 3: 1 (0.68/0.0) race-recode = 4: 1 (50.33/20.48) race-recode = 7: 1 (0.0) er-status-recode-breast-cancer = 2: 1 (764.41/304.12) er-status-recode-breast-cancer = 3: 1 (9.01/2.68) reason-for-no-surgery = 2: 1 (132.13/52.04) reason-for-no-surgery = 6 er-status-recode-breast-cancer = 1: 2 (518.41/229.21) er-status-recode-breast-cancer = 2: 1 (217.76/87.6) er-status-recode-breast-cancer = 3: 2 (10.57/3.16) reason-for-no-surgery = 7: 1 (152.15/66.04) reason-for-no-surgery = 8: 2 (45.04/17.03) age-at-dx > 76 reason-for-no-surgery = 0 er-status-recode-breast-cancer = 1: 2 (485.98/221.5) er-status-recode-breast-cancer = 2: 1 (168.34/55.07) er-status-recode-breast-cancer = 3: 1 (5.86/2.84) reason-for-no-surgery = 1: 1 (644.15/196.38) reason-for-no-surgery = 2: 1 (55.1/7.03) reason-for-no-surgery = 6: 1 (219.39/57.13) reason-for-no-surgery = 7: 1 (104.19/32.06) reason-for-no-surgery = 8: 1 (2.0/1.0) Example: reason-for-no-surgery = 0: 2 (3596.58/1086.55) When reason for no surgery equals 0, (surgery was performed), patients were classified as a 2 (survival time greater than one year). 3596.58 instances were classified correctly, 1086.55 were classified incorrectly. Table 6 contains the descriptions of the codes and Table 4 defines the final survival time recode classifications.

SEER Attribute Code Description Reason for no surgery 0 Surgery performed 1 Surgery not recommended 2 Autopsy only case 5 Patient died before recommended surgery 6 Unknown reason for no surgery 7 Patient or patient s guardian refused ER status recode 1 Positive 2 Negative 3 Borderline Race Recode 1 White 2 Black 3 American Indian/Alaska Native 4 Asian or Pacific Islander 7 Other Grade 1 Well differentiated 2 Moderately differentiated 3 Poorly differentiated 4 Anaplastic Marital Status at DX 1 Single (never married) 2 Married (including common law) 3 Separated 4 Divorced 5 Widowed 6 Unmarried or domestic partner Table 6: J4.8 Legend