Data Mining Techniques to Predict Survival of Metastatic Breast Cancer Patients Abstract Prognosis for stage IV (metastatic) breast cancer is difficult for clinicians to predict. This study examines the SEER data set from 1988-2003 and selects patients who were initially diagnosed with stage IV breast cancer and who have died from a direct result of the cancer. After developing a SEER conversion utility, seer2arff, we create three predictive models that use a supervised, passive, offline technique to classify prognosis (survival time). The results of the algorithms from the Weka toolkit were: Bayes Network, 64.2% accurate; J4.8 Decision Tree, 63.5% accurate; and an Artificial Neural Network, 62.9% accurate. The J4.8 Decision Tree selected attributes that confirm the rationale of ongoing clinical studies. This study is the first to apply machine learning techniques to this category of patients with the SEER data set. Introduction Breast cancer is one of the most common cancers today. Treatment options vary significantly from surgery to chemotherapy depending on many variables like tumor location, size, and patient characteristics. Once diagnosed, physicians attempt to stage or classify the patient s cancer. Stages range from Stage 0 to Stage IV where Stage IV indicates that the cancer has metastasized beyond the breast and local lymph nodes. The staging classification system also helps to predict prognosis. When breast cancer is diagnosed in Stage IV, the five year survivability is 16% whereas if detected early in Stage 1, the percentage is 97%. Through numerous clinical and research efforts, there are well-known classifiers to aid physicians in categorizing a patient s cancer into the appropriate stage. However, once in stage IV, the cancer is already advanced and the stage IV factors affecting prognosis are not as well known. For example in stage IV, 50% of patients have a survivability of two years, however less than 20% of patients survive for four years or more. In this paper we present a comparison of three machine learning methods of predicting survival time for stage IV breast cancer patents from the Surveillance Epidemiology Copyright c 2013, by Joshua Datko. This work is made available under the terms of the Creative Commons Attribution 3.0 Unported License. Joshua Datko Advanced Artificial Intelligence (CS610) Project Proposal Drexel University jbd65@drexel.edu and End Results (SEER) data set(see 2012). SEER is one of the most cited clinical cancer data sets, maintained by the National Cancer Institute and has data as far back as 1979. In the SEER data set, survival time is calculated using the date of diagnosis and one of the following: date of death, date last known to be alive, or follow-up cutoff date used for this [data]. This definition is too inclusive and therefore records will be filtered to select patients who were initially diagnosed with stage IV breast cancer and who have died as a direct result of the cancer. Background The machine learning techniques used in this study are supervised, passive, offline algorithms that seek to predict classification. Data mining is a subset of machine learning that attempts to gain insight from data that was previously unknown. In a supervised learning technique, the program is given a series of inputs, x i and outputs, y i and attempts to find a function, f(x) = y. A passive algorithm is one that does not interfere with the data. Offline indicates that the data has been collected versus online where the data is being generated at the time of analysis. Prediction is the process of computing f(x i+1 ) = y i+1. Finally, classification is selecting an output y from a finite set of values. The three algorithms chosen for this study were a J4.8 Decision Tree, Bayes Network, and an Artificial Neural Network (Multi-layer Perceptron). The following sections provides a brief overview on the algorithms. J4.8 Decision Tree The J4.8 decision tree algorithm is Weka s implementation of Ross Quinlan s C4.5 decision tree algorithm. Specifically, it is the implementation of C4.5 revision 8 (Witten, Frank, and Hall 2011). It is a recursive algorithm that seeks to split the data to maximize information gain. Information gain quantifies the insight gained from a particular attribute. The attribute that splits the data into the most distinguishing group has higher information gain than the attribute that results in an uniform distribution. The result of this algorithm is a decision tree that is used to classify future entries. Bayes Network A Bayes Network is a directed acyclic graph that encodes the conditional probability table for each node. Once con-
structed, Bayes Networks can answer questions like, given the following attributes, what is the probability being classified as X? To apply the Bayes Network, the network must be first constructed or in the case of machine learning, learned. We use the K2 Bayesian network learner which employs a hill climbing algorithm restricted by an order on the variables. Artificial Neural Network (Multi-layer Perceptron) The multi-layer perceptron emulates the understanding of how neurons in the brain function. In general, a neural network is a collection of nodes that are defined by an activation function, that when exceeded, cause the neuron to fire or enables output on an edge leaving the neuron. In a supervised environment, when given the input and the known output, the network adjusts the activation function for each of the neurons to correspond with the output. Validation Techniques In order to quantify how the model fits the data, several validation mechanism can be evaluated. The one used in this study is a k-fold-cross-validation. In this technique, the data set is split into k subsets and k rounds of learning. Each round uses 1/k of the data as the training set, from which the model is generated. The remaining k 1 subsets are used as test data. The end result is averaged from each of the k rounds. Approach Data Preparation Preparing the data for analysis is a non-trivial and time consuming process. The SEER data set facilitates analysis by independent tools since the data is in ASCII text files. The data set itself is publicly accessible, only after registration and signing of the user agreement on the SEER website. Included in distribution is a thorough data dictionary, detailing each of the 134 attributes (columns). Out of the SEER data set, only the attributes in Table 1 were selected. Some attributes were not selected due to duplicate data. For example in breast cancer, Tumor Marker 1 is the same as ER Status Recode and therefore, the more descriptively named column was selected. Some attributes were not selected due to mutual exclusion. Since AJCC Stage 3 rd Edition, was selected only data from 1988-2003 were applicable. Also some data were not selected due to lack of statistical relevance. For our data set, less than 1% of the patients were men. Therefore sex was not included in the machine learning analysis. A query to filter the data set further was developed. This query, whose SEER values are shown in Table 2, is described in natural language as follows: select all records of patients who were initially diagnosed with stage IV breast cancer, and who have died, and who have died as a direct result of the cancer. Selecting based on vital status and cause of death is similar to the query performed in (Bellaachia and Guven 2006), however we further narrowed the data by stage IV only. This reduced the total 657712 records to 8726. Table 3 shows the breakdown of the query against the total data set. SEER Attribute Marital Status at DX Age at DX Year of DX Grade EOD-Tumor Size EOD-Lymph Node Involv Reason for no surgery Race recode ER Status Recode PR Status Recode AJCC Stage 3 rd ed SEER Cause-Specific Death Classification Vital Status Recode Survival time recode Table 1: List of analyzed attributes ARFF type Numeric Numeric Numeric SEER Attribute Filter by AJCC Stage 3 rd ed 40-49 SEER Cause-Specific Death Classification 1 Vital Status Recode 4 Table 2: Query to filter the data set Only 1.3% of the available data set is being analyzed, but most of the exclusion is a result of the restriction of the time frame. There are several staging codes used throughout the SEER data set and we chose to use only one (AJCC 3 rd edition) for consistency. This restricted the data to 1988-2003. Once filtered, the columns in Table 2 were removed from the analysis since all the records contained the same value. seer2arff A conversion utility, written by the author in Python, called seer2arff was developed to transform the SEER data into ARFF for data processing by the Weka toolkit. The Weka workbench is suite of machine learning algorithms, implemented in Java, and a framework in which one can run multiple data mining experiments, developed by the University of Waikato. The ARFF format described in (Witten, Frank, and Hall 2011), requires tagging attributes by data type; the major types are: numeric, nominal and string. A nominal attribute is one with a discrete number of values. As shown in Table 1, most of the analyzed data is nominal. Survival Time Recode (STR) required conversion to a nominal value. The original data was in a format of YYMM and it was converted to Query Count Percent Total SEER breast cancer patients 657,712 100% Number of patients (1998-2003) 331,936 50.5% Diagnosed with Stage IV (1998-2003) 12,296 1.9% Stage IV and have died 11,453 1.7% Stage IV and have died from the cancer 8,726 1.3% Table 3: Population selection
Survival Time Recode Class Percent of Population Survival Time 1 year 1 44.6% Survival Time > 1 year 2 55.4% Table 4: Survival Time Recode in nominal form a nominal attribute as shown in Table 4. One year was selected because this split produced a reasonably partitioned population. Not all of the SEER data were complete. For example, ER status information was collected for only 59% of all the patients selected. Typically, when data is missing from the SEER data set, it is encoded with 9 or 99. Those instances were replaced with a question mark character, which represents missing data in ARFF. Machine learning experimentation Three Weka data mining classification algorithms were chosen for analysis: J4.8 Decision Tree, Bayesian Network and Multi-layer Perceptron. J4.8 is Weka s implementation of the C4.5 Decision Tree algoirthm, which was run with the default parameters with the exception of the minimum number of objects for a leaf node set to 100. The default setting is two, which creates a much more complex tree. Bayesian Network and Multi-layer Perceptron were both run with their default parameters. To maximize the use of the data set, cross validation with 10 folds was chosen for each algorithm. Evaluation In this section, we show and analyze the results of the three machine learning algorithms. Table 5 summarizes the results of the algorithms. Accuracy refers to the percentage of correctly classified results, across both classes (less than or equal to a year and greater than one year). Precision is defined as the number of true positives for a given class, divided by the true and false positives. Recall is the true positives divided by the number of true positives plus false negatives. Finally, F-Measure is combination of both precision and recall given by the equation: 2P R/(P + R). Overall, the accuracy of the algorithms are within 2% of each other. However, the accuracy is much less than the 86.7% reported in (Bellaachia and Guven 2006) and other research. There are two main factors affecting this discrepancy. The first is that we are only analyzing stage IV patients. In (Bellaachia and Guven 2006), the highest ranked attributes from their J4.8 decision tree were Extension of tumor, Stage of cancer and Lymph node involvement; all three of which are directly correlated to cancer stage. In fact, five of their top ranked attributes are significant factors into determining cancer stage (which is their sixth attribute). Considering these attributes, one should expect greater accuracy. Secondly, their query resulted in 151,886 records, where ours was only 8,726, 94% less. Fortunately (for the patients), a significantly smaller amount of patients are initially diagnosed in stage IV. While more data is not guaranteed to raise accuracy, our models may be improved with more records. Algorithm Acc. Class P R F-Measure BayesNet 64.2% 1.622.505.558 2.654.752.700 NeuralNet 62.9% 1.592.538.564 2.654.702.677 J4.8 Tree 63.5% 1.618.476.538 2.644.764.699 Table 5: Combined Results. (Acc. = Accuracy, P = Precision, R = Recall) Bayes Network Overall, the Bayes Network was the most accurate algorithm of the three. The network produced resulted in each attribute being a child node of survival time. Each node has a probability distribution that can be used to solve for the conditional probability of that attribute. For example, the probability of the patient surviving for greater than one year, given that she had surgery, is 58.6% but the probability of the patient surviving for less than one year, given she had surgery was only 36.7%. This attribute was selected as a distinguishing characteristic in the J4.8 decision tree algorithm, discussed in the following section. The Bayes Net was also had the highest F-Measure for classifying patients who survived for greater than a year. However, the true distribution of patients is slightly weighted to this category as Table 4 shows. J4.8 Decision Tree The J4.8 under-performed the Bayes Network in every area with the exception of recall in the patients greater than one year category, in which it fared marginally better. The resultant decision true produced provides a human-readable model of the classifier. The full tree is shown in Appendix A. The decision tree algorithm uses a recursive approach, selecting the most discriminating attribute (highest information gain) at each level. Multi-layer Perceptron (Neural Network) While the Multi-layer Perceptron had the lowest accuracy, it had the highest recall for class 1 patients. Unfortunately as is generally the case with neural networks, it is difficult to glean any additional understanding from the model. Clinical Analysis The decision tree yielded some interesting observations about the data set. The highest ranked attribute was age at diagnosis, the cutoff of which was 76 years. The mean age at diagnosis is 63 years, with a standard deviation of 14.4 years, which roughly correlates with the 76 year cutoff. This may be intuitive, as older patients are more likely to have other medical complications and experience a greater number of side effects from treatment. Age is a complex factor in this study. While studies have shown that age alone is not a dominant factor to affect prognosis, it is believed that older women are under-treated and are not prescribed as aggressive chemotherapy as younger
woman. The J4.8 decision tree only predicted a woman over the age of 76 to live greater than one year was if she had surgery and was ER positive. The next highest ranked attribute, reason for no surgery, is a controversial choice. Retrospective data seems to indicate that surgery is beneficial, however this may be confounded by other factors when the decision to perform surgery is made, such as the patient s performance status and the extent of their metastatic disease. To study this prospectively, there is an ongoing Eastern Cooperative Oncology Group study (NCT01242800) to determine if breast cancer surgery which removes the primary tumor increases the survival of stage IV patients by randomizing patients to surgery or no surgery. The results of the J4.8 Decision tree, which indicate that surgery is an imporant factor in stage IV prognosis, supports the rationale of this study. ER Status was determined to be the third most important characteristic. ER status represents whether the estrogen receptor is present in the cancer cell by an immunohistochemistry test. When the ER result is positive, different treatment options are available to the patient, specifically hormonal treatments which have less side effects than traditional chemotherapy. From the decision tree, in general, when ER was positive a prediction was made for the patient to survive greater than one year. Marital status at diagnosis was not ranked very high, but interestingly was used as a final tie breaker in one case. On this branch, patients who were married survived greater than one year by almost 2:1. In all other marital categories, widowed, divorced, not married or separated, there were 2:1 odds of dying within one year. This supports the findings in (Osborne et al. 2005), which shows that older unmarried women are at an increased risk of death from breast cancer. Oncologist s Predictions A breast cancer oncologist was consulted and asked what factors she would consider most important in developing a patient s prognosis for a stage IV diagnosis. She listed (not in any particular order): ER status, tumor grade, the extent of the spread of metastases, and patient s performance status. Performance status refers to Karnofsky Performance Status Scale, which quantifies the patients well being. Patients with a rating of 60% and below require some degree of assistance for care and show more symptoms. This data is not available in the SEER data set. However, the spread of metastasis is available (CS Mets at DX) but only on records later than 2004. For example, if the breast cancer spreads to the liver or brain, the clinician predicts survival to be much less than if the breast cancer spread to just the bone or the lymph node. Both tumor grade and ER status were selected by the decision tree, however they were not considered as distinguishing as age and reason for no surgery. Related Work and Novelty Several researchers have used the SEER database and Weka to predict survival of breast cancer patients. In (Delen, Walker, and Kadam 2005), three data mining algorithms were tested against the 1973-2000 SEER data set and a decision tree approach predicted survivability, defined by the authors as surviving greater than sixty months, with a 93.6% accuracy. (Bellaachia and Guven 2006) performed data mining on the 1973-2000 SEER database with the Weka toolkit and showed that the extension of the tumor was the most contributing factor to survivability, followed closely by the stage of the cancer. (Endo, Shibata, and Tanaka 2007) used SEER data from 1992-1997 and selected ten independent variables to predict survivability, defined as a surviving greater than sixty months. However, there has not been a study investigating only metastatic breast cancer patients. Furthermore, each study above included the stage of breast cancer in their analysis. Staging is derived information from other empirical patient data and is therefore not an independent factor. Also, the AJCC Staging system is designed as an indicator of prognosis and is an aid for clinicians to development a treatment plan, therefore staging should not be included in machine learning techniques to predict survival. By definition, using stage IV patients is stageless since it is the last stage and it is also the most volatile in terms of survivability. Conclusion While the accuracy of the model was much less than the 80% anticipated in the proposal, we consider this initial research encouraging. Three independent algorithms produced a consistent accuracy estimate and the decision tree confirms important factors in clinical research. In fact, age and reason for no surgery are two factors undergoing current study. To our knowledge this analysis is the first to focus on only stage IV breast cancer patients. Other data mining research has shown that machine learning techniques can confirm, with high accuracy, the prognosis assigned by the staging system. However our research used machine learning to gain insight on a new question, not previously asked. Future Work A similar study can be extended to include patients in the year group 2004 and later. These records would include the extent of the metastases, which may be an important prognosis indicator. The 2004 data also includes the HER2 indicator, which oncologist use in addition to ER to guide treatment options. References Bellaachia, A., and Guven, E. 2006. Predicting Breast Cancer Survivability Using Data Mining Techniques. Delen, D.; Walker, G.; and Kadam, A. 2005. Predicting breast cancer survivability: a comparison of three data mining methods. Artificial Intelligence in Medicine 34(2):113 127. Endo, A.; Shibata, T.; and Tanaka, H. 2007. Comparison of seven algorithms to predict breast cancer survival. International Journal of Biomedical Soft Computing and Human Sciences 13.
Osborne, C.; Ostir, G. V.; Du, X.; Peek, M. K.; and Goodwin, J. S. 2005. The influence of marital status on the stage at diagnosis, treatment, and survival of older women with breast cancer. Breast Cancer Research and Treatment 93(1):41 47. 2012. Surveillance, epidemiology, and end results (seer) program (www.seer.cancer.gov) research data (1973-2009). released April 2012, based on the November 2011 submission. Witten, I.; Frank, E.; and Hall, M. A. 2011. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.
Appendix A: Complete J4.8 Decision Tree age-at-dx <= 76 reason-for-no-surgery = 0: 2 (3596.58/1086.55) reason-for-no-surgery = 1 er-status-recode-breast-cancer = 1 age-at-dx <= 58: 2 (683.28/260.55) age-at-dx > 58 race-recode = 1 grade = 1: 1 (36.48/15.69) grade = 2: 2 (251.76/109.72) grade = 3 marital-status-at-dx = 1: 1 (37.86/14.13) marital-status-at-dx = 2: 2 (218.57/96.5) marital-status-at-dx = 3: 1 (5.57/2.66) marital-status-at-dx = 4: 1 (54.46/23.52) marital-status-at-dx = 5: 1 (111.76/50.33) grade = 4: 1 (37.22/15.54) race-recode = 2: 1 (106.96/41.69) race-recode = 3: 1 (0.68/0.0) race-recode = 4: 1 (50.33/20.48) race-recode = 7: 1 (0.0) er-status-recode-breast-cancer = 2: 1 (764.41/304.12) er-status-recode-breast-cancer = 3: 1 (9.01/2.68) reason-for-no-surgery = 2: 1 (132.13/52.04) reason-for-no-surgery = 6 er-status-recode-breast-cancer = 1: 2 (518.41/229.21) er-status-recode-breast-cancer = 2: 1 (217.76/87.6) er-status-recode-breast-cancer = 3: 2 (10.57/3.16) reason-for-no-surgery = 7: 1 (152.15/66.04) reason-for-no-surgery = 8: 2 (45.04/17.03) age-at-dx > 76 reason-for-no-surgery = 0 er-status-recode-breast-cancer = 1: 2 (485.98/221.5) er-status-recode-breast-cancer = 2: 1 (168.34/55.07) er-status-recode-breast-cancer = 3: 1 (5.86/2.84) reason-for-no-surgery = 1: 1 (644.15/196.38) reason-for-no-surgery = 2: 1 (55.1/7.03) reason-for-no-surgery = 6: 1 (219.39/57.13) reason-for-no-surgery = 7: 1 (104.19/32.06) reason-for-no-surgery = 8: 1 (2.0/1.0) Example: reason-for-no-surgery = 0: 2 (3596.58/1086.55) When reason for no surgery equals 0, (surgery was performed), patients were classified as a 2 (survival time greater than one year). 3596.58 instances were classified correctly, 1086.55 were classified incorrectly. Table 6 contains the descriptions of the codes and Table 4 defines the final survival time recode classifications.
SEER Attribute Code Description Reason for no surgery 0 Surgery performed 1 Surgery not recommended 2 Autopsy only case 5 Patient died before recommended surgery 6 Unknown reason for no surgery 7 Patient or patient s guardian refused ER status recode 1 Positive 2 Negative 3 Borderline Race Recode 1 White 2 Black 3 American Indian/Alaska Native 4 Asian or Pacific Islander 7 Other Grade 1 Well differentiated 2 Moderately differentiated 3 Poorly differentiated 4 Anaplastic Marital Status at DX 1 Single (never married) 2 Married (including common law) 3 Separated 4 Divorced 5 Widowed 6 Unmarried or domestic partner Table 6: J4.8 Legend