A combined neural network and decision trees model for prognosis of breast cancer relapse

Size: px

Start display at page:

Download "A combined neural network and decision trees model for prognosis of breast cancer relapse"

Alban Horton
5 years ago
Views:

1 Artificial Intelligence in Medicine 27 (2003) A combined neural network and decision trees model for prognosis of breast cancer relapse José M. Jerez-Aragonés a,*, José A. Gómez-Ruiz a, Gonzalo Ramos-Jiménez a, José Muñoz-Pérez a, Emilio Alba-Conejo b a Departamento de Lenguajes y Ciencias de la Computación, Complejo Tecnológico de la Información, Campus de Teatinos, University of Malaga, Malaga, Spain b Servicio de Oncología, Hospital Clínico Universitario, Malaga, Spain Received 10 January 2002; received in revised form 16 July 2002; accepted 27 September 2002 Abstract The prediction of clinical outcome of patients after breast cancer surgery plays an important role in medical tasks such as diagnosis and treatment planning. Different prognostic factors for breast cancer outcome appear to be significant predictors for overall survival, but probably form part of a bigger picture comprising many factors. Survival estimations are currently performed by clinicians using the statistical techniques of survival analysis. In this sense, artificial neural networks are shown to be a powerful tool for analysing datasets where there are complicated non-linear interactions between the input data and the information to be predicted. This paper presents a decision support tool for the prognosis of breast cancer relapse that combines a novel algorithm TDIDT (control of induction by sample division method, CIDIM), to select the most relevant prognostic factors for the accurate prognosis of breast cancer, with a system composed of different neural networks topologies that takes as input the selected variables in order for it to reach good correct classification probability. In addition, a new method for the estimate of Bayes optimal error using the neural network paradigm is proposed. Clinical pathological data were obtained from the Medical Oncology Service of the Hospital Clínico Universitario of Málaga, Spain. The results show that the proposed system is an useful tool to be used by clinicians to search through large datasets seeking subtle patterns in prognostic factors, and that may further assist the selection of appropriate adjuvant treatments for the individual patient. # 2002 Elsevier Science B.V. All rights reserved. Keywords: Back-propagation algorithm; Bayes error; Survival analysis; Breast cancer; Decision trees; Inductive learning * Corresponding author. Tel.: þ ; fax: þ address: jja@lcc.uma.es (J.M. Jerez-Aragonés) /02/$ see front matter # 2002 Elsevier Science B.V. All rights reserved. PII: S (02)

2 46 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Introduction Prediction tasks are among the most interesting activities in which to implement intelligent systems. Specifically, prediction is an attempt to accurately forecast the outcome of a specific situation, using as input information obtained from a concrete set of variables that potentially describe the situation. A problem often faced in clinical medicine is how to reach a conclusion about the prognosis of cancer patients when presented with complex clinical and prognostic information, since specialists usually make decisions based on a simple dichotomization of variables into a favourable and unfavourable classification [18]. As we enter the new millennium, treatment modalities exist for many solid tumour types and their use is well established. Nevertheless, offset against this is the toxicity of some treatments. As there is a real risk of mortality associated with treatment, it is vital to have the possibility of offering different therapies depending on the patients. In this sense, the likelihood that the patient will suffer a recurrence of her disease is very important, so that the risks and expected benefits of specific therapies can be compared. This work analyses, on the one hand, the decision-making process existing when patients with primary breast cancer should receive a certain therapy to remove the primary tumour. On the other hand, different prognostic factors appear to be significant predictors for overall survival, but probably form part of a bigger picture comprising many, inter-related factors [11]. In order to investigate this hypothesis, studies looking at a large number of potential prognostic factors are needed. To further complicate matters, these relationships may well be non-linear in nature. These form the major difficulties in such studies. Furthermore, the statistical analysis of large datasets using standard methodologies is cumbersome and limited, especially in the case of non-linear relationships. Among prognostic modelling techniques that induce models from medical data, survival analysis methods are specific both in terms of modelling and the type of data required. Survival models attempt to determine the probability of the event occurring within a specific time, which requires classification models that classify either the occurrence or non-occurrence of the event and optionally model the outcome probabilities. Several tools successfully used in the construction of medical prognosis models have been proposed by the machine learning community [17,34]. Neural networks are a form of artificial intelligence that have found application in a wide range of problems [10,20,24] and have given, in many cases, superior results to standard statistical models [33]. Baxt [4] demonstrated the predictive reliability of an artificial neural networks model in medical diagnosis. In this case, we utilise the ability of neural networks to recognise complex and highly non-linear relationships, such as are likely to characterise medical circumstances. Some authors [14,30] have modelled systems for outcome prediction in post-surgery breast and lung carcinoma patients using neural networks to perform survival analysis. This type of modelling manages the problem of censored data handling that arises when the event related to the censor variable normally included in the survival data (like death or recurrence of a disease) has not occurred during the follow-up period for a patient, although the event may eventually occur. These authors have solved the problem by using different survival estimators to handle censored data for patients. This would imply that

3 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) prognostic factors for example, in breast cancer with adjuvant therapy after surgery are not time-dependent, but this is not really true. That is, the strength of the prognostic factor is not the same for different time intervals. Different techniques for survival estimation, such as Kaplan Meier analysis [15] and Cox Regression modelling [6] assume that the strength of a prognostic factor does not change over time. In addition, the existence of a peak of recurrence in the distribution of relapse probability [2] demonstrates that the recurrence probability is not the same over time. In this sense, if these statistical techniques are not appropriate to solve this problem, a possible solution would be to incorporate the whole set of prognostic factors pre-selected by medical experts (Section 3.1) as input to the neural networks system. This would involve removing all the patients with censor data; however, the cardinality of the resulting patient data vectors set would then become too small to constitute a significant representation of this problem. This work proposes a new system approach based on: (1) specific topologies of neural networks for different time intervals during the follow-up time of the patients, considering the events occurring in different intervals as different problems; and (2) decision trees, useful in understanding the underlying relationships in breast cancer data, for selecting the most important prognostic factors corresponding to every time interval. This is not the first attempt to combine decision trees and neural networks [1,7], but it does present different ways of integrating them. In addition, we introduce a new decision trees algorithm, control of induction by sample division method (CIDIM), for reducing the number of rules and improving the selection of attributes from the database to become significant prognostic factors. Furthermore, a new upper-bound estimate of the problem-difficulty level, based on the correct classification Bayes probability, is also proposed. 2. Breast cancer overview Breast cancer is a malignant tumour that has developed from cells of the breast. Although scientists know some of the risk factors (i.e. ageing, genetic risk factors, family history, menstrual periods, not having children, obesity) that increase a woman s chance of developing breast cancer, they do not yet know what causes most breast cancers or exactly how some of these risk factors cause cells to become cancerous. Research is under way to learn more and scientists are making great progress in understanding how certain changes in DNA can cause normal breast cells to become cancerous. Breast cancer is the most common cancer among women, excluding nonmelanoma skin cancers. The American Cancer Society estimated that in 2001 about 192,200 new cases of invasive breast cancer (Stages I IV) were diagnosed among women in the US. Ductal carcinoma in situ (DCIS) accounts for about 39,900 new cases each year. Breast cancer also occurs in men. In 2001, there were about 40,600 deaths from breast cancer in the US (40,200 among women, and 400 among men). Breast cancer is the second leading cause of cancer death in women, exceeded only by lung cancer, although death rates declined significantly during These decreases are probably the result of earlier detection and improved treatment. Breast cancer has a very high cure rate, with 97% of women surviving for 5 years if the cancer is diagnosed early.

4 48 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Staging is the process of gathering information about the tumour from certain examinations and diagnostic tests to determine how widespread the cancer is. The stage of a cancer is one of the most important factors in selecting treatment options. The TNM system is a standardised way in which the cancer care team describes the extent to which the cancer spread, where the letter T followed by a number from 0 to 4 describes the tumour s size and spread to the skin or chest wall under the breast, the letter N followed by a number from 0 to 3 indicates whether the cancer has spread to lymph nodes near the breast, and the letter M followed by a 0 or 1 indicates whether or not the cancer has spread to distant organs. Once a patient s T, N, and M categories have been determined, this information is combined in a process called stage grouping to determine a woman s disease stage. This is expressed in Roman numerals from Stage 0 (the least advanced stage) to Stage IV (the most advanced stage). 3. Methods 3.1. Patient data Data from 1035 patients with breast cancer disease from the Medical Oncology Service of the Hospital Clínico Universitario of Málaga, Spain were collected and recorded during the period Data corresponding to every patient were structured in 85 fields containing information about post-surgical measurements, personal data, and type of treatment. Part of this information regarding patients is not relevant for predicting outcome, so that only 14 independent input variables pre-selected from all these data fields, and targeted by medical experts as probably being risk factors for breast cancer prognosis were incorporated in the model, becoming inputs to the CIDIM algorithm. Fig. 1 shows the data pre-processing stages from the original database in the system construction phase. All variables and their units or modes of representation, mean, standard deviation, and median are shown in Table 1 where survival status appears as a supervisory variable to be predicted by the prognosis system. Table 2 shows the underlying medical and statistical meaning of risk factors proposed as important prognostic variables Censoring data handling One of the most common problems in survival analysis is the lack of information in the form of missing data values. To properly address censoring data in the modelling process, patients for whom the event did not occur require special treatment. Different methods have been proposed to solve this problem (see review in [23]). The simplest solution is to remove those patient cases with missing values, which would involve the rejection of a large number of them. Another approach is to reject those prognostic factors for which there is no data; however, this approach is very difficult to control, since it could lead the system to make weak predictions if a great number of significant prognostic factors are eliminated. Other authors propose a technique that assigns a distribution of outcomes instead of a single outcome. The distribution would be assessed through the outcome probability

5 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Fig. 1. The prognosis system based on neural networks and decision trees.

6 50 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Table 1 Summary of patient data: range, mean, S.D. and median Prognostic variables (mnemonic) Range Mean S.D. Median Age (Ag) Menarchy age (Ma) Menopause age (Mg) First pregnancy age (Fp) No. of miscarriages (Mn) No. of axillary lymph nodes (An) Grade (Gr) 1, 2, NA Tumour size (Ts) No. of pregnancies (Pn) Estrogen receptors (Er) 1, NA Progesteron receptors (Pr) 1, NA P53 1, NA Ploidy (Pl) 1, 2, NA S-Phase (Ps) Supervisory variable Survival status 0 (non-relapse) 1 (relapse) NA estimate based on the Kaplan Meier method using weighted examples to implement the schema [31,34], but, as mentioned before, prognostic factors in breast cancer with adjuvant therapy after surgery are time dependent. That is, the strength of the prognostic factor is not the same for the first 10 months than, for example, the months interval. Techniques for survival estimation, such as Kaplan Meier analysis [15] and Cox Regression modelling [6] assume that the strength of a prognostic factor does not change over time, although this is not the case in the real world. On the other hand, the recurrence probability is not the same over time, since the existence of a peak of recurrence in the distribution of relapse probability has been demonstrated empirically [2]. Some authors [16,31] mention that trivial solutions to the problem, such as removing the censor data from the dataset or considering them as examples where the event will not occur, would bias the modelling. However, it has been demonstrated that good results were achieved in [22] by using only complete data cases with no missing data values. In this work, we reject patient cases containing missing data values for each interval of follow-up time through a classification rule analysed below. Data subsets corresponding to each time interval were selected from the original 1035 patients from the Oncology Service database and classified into relapse and nonrelapse classes for each time interval I i. This classification process was performed according to the status survival and time interval variables from each patient data. Let C ij be the class j of the interval i, where j ¼ 1 identifies the class relapse and j ¼ 2 the class non-relapse. Then, for the interval I i, the patients selected for classes C i1 and C i2 are chosen according to the following rules: (a) C i1 : patients with time interval ¼ i and survival status ¼ relapse. (b) C i2 : patients with time interval ¼ j (j < i) and survival status ¼ relapse, and all the patients with time interval ¼ k (k > i).

7 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Table 2 Medical meaning of risk prognostic factors proposed as important prognostic variables Prognostic variables Age Menarchy age and menopause age First pregnancy age No. of miscarriages No. of axillary nodes Grade Tumour size No. of pregnancies Estrogen and progesteron receptors p53 Ploidy S-phase Description A woman s risk of developing breast cancer increases with age. About 77% of women with breast cancer are over age 50 at the time of diagnosis. Women younger than 30 years account for only 0.3% of breast cancer cases. Women in their thirties account for about 3.5% of cases. Women who started menstruating at an early age (before age 12) or who went through menopause at a late age (after age 50) have a slightly higher risk of breast cancer. Women who delay their first pregnancy into their thirties have almost a doubled risk of breast cancer compared to those who have babies in their late teens or early 1920s. Miscarriages (spontaneous abortions) do not seem to increase the risk of breast cancer, and many of the studies concerning induced or spontaneous pregnancy losses and breast cancer are controversial. When breast cancer cells reach the axillary lymph nodes, they can continue to grow, often causing swelling of the lymph nodes in the underarm area. If breast cancer cells have grown in the axillary lymph nodes, they are more likely to have spread to other organs of the body as well. This is why finding out whether breast cancer has spread to axillary lymph nodes is important in selecting the best mode of treatment and predicting the patient outcome. Histologic tumour grade is based on the arrangement of the cells in relation to each other, as well as features of individual cells. The grade helps predict the patient s prognosis because cancers that closely resemble normal breast tissue tend to grow and spread more slowly. In general, a lower grade number indicates a slower-growing cancer while a higher number indicates a faster-growing cancer. Tumour size is one of the most important prognostic variables and is related to the breast cancer stage. Stage I: the tumour is 2.0 cm or less; Stage II: the tumour size is between 2.0 and 5 cm. Stage III: the tumour is larger than 5 cm. Women who have had no children or who had their first child after age 30 have a slightly higher breast cancer risk. Receptors are molecules that are a part of cells. They recognise certain substances such as hormones that circulate in the blood. Normal breast cells and some breast cancer cells have receptors that recognise estrogen and progesterone. Breast cancers that contain estrogen and progesterone receptors tend to have a better prognosis than cancers without these receptors. Tests to identify other acquired changes in oncogenes or tumour suppressor genes (such as p53) may help doctors more accurately predict the prognosis of some women with breast cancer. The ploidy of cancer cells refers to the amount of DNA they contain. If there s a normal amount of DNA, the cells are said to be diploid. If the amount is abnormal, then the cells are described as aneuploid. Some studies have found that aneuploid breast cancers tend to be more aggressive. The S-phase test counts the percentage of tumour cells that are making copies of their DNA, and thus provides an estimate of the speed of tumour growth. A high S- phase level would indicate that the tumour is aggressive. Tumours that have normal DNA ploidy levels and are slow growing (low S-phase) indicate a better patient prognosis than a tumour with abnormal DNA Ploidy results and a high S-phase.

8 52 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Selection of prognostic factors In order to select the most important prognostic factors (from those pre-selected by medical experts as being significant risk factors) for predicting overall survival, several methods have been studied. First, consulting clinicians about their importance is the simplest way, but this would introduce a significant bias in the selected attributes set. Another approach trains neural networks with different sets of input data in order to select the most significant attributes, but the implementation of this method has a high computational cost. In this sense, symbolic induction techniques can help us to understand the underlying relationships in breast cancer data with low computational cost. Decision trees appear to be appropriate methods for these types of problems, because if some parameters can be shown not to be significant in the decision process, then their rejection can be recommended, which would simplify the whole system. In our work, which involves the use of neural networks, the rejection of input parameters diminishes the size of the final networks architecture. Different algorithms, such as ID3 [25 27] and C5 (updated version of C4.5 [28]), were tested in our research, but too many attributes were obtained as significant prognostic factors, which would excessively complicate the architecture of the final neural network system. An appropriate prognostic factors selection method is thus necessary. Therefore, a new method called control of induction by sample division method [29,32] has been developed to perform adaptive pruning with predictive control, significantly reducing the number of rules and improving the selection of the attributes that would better explain the patient dataset. By using CIDIM, trees smaller than those obtained with other algorithms are generated. This allows the selection of the most important attributes as the neural networks system input. The main features of the CIDIM algorithm are as follows: (1) The top down induction decision tree (TDIDT) algorithms [5,19], generally, split the experiences set into two sets: the training set and test set (two-third and one-third of the dataset, respectively). The CIDIM algorithm divides the training set into two subsets of identical size: the construction subset (called CNS) and the control subset (called CLS). For every new node of the tree, its expansion is decided by using the CNS and CLS subsets, based on the predictive capacity of the expansion in regard to the CLS set. The final tree is not the best classifier, but it has far fewer rules and generally is as good a predictor as the tree obtained with classical TDIDT algorithms (sometimes better). (2) An internal bound condition is defined. Usually, the expansion of the tree finishes when all experiences associated with a node belong to the same class, yielding too large trees. In order to avoid this overfitting, external conditions are considered by different algorithms (C5 demands that at least two branches have at least two experiences). The CIDIM algorithm uses the following as an internal condition: if the prediction is not improved then the node is not expanded, making the expansion process dependent on CNS and CLS subsets. Tree expansion supervision is driven by two indexes: the absolute index I A and the relative index I R (see expressions (1) and (2)). For every algorithmic step, a node is

9 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) expanded only if these indexes are increased. The absolute and relative indexed are defined as P N i¼1 I A ¼ CORRECTðe iþ (1) N P N i¼1 I R ¼ P Cðe i Þðe i Þ (2) N where N is the number of experiences, e a single experience, C(e) the class of the e experience, P m (e) the probability of m class for the e experience, and CORRECTðeÞ ¼1if P CðeÞ ¼ maxfp 1 ðeþ; P 2 ðeþ;...; P k ðeþg or 0 if another case. The CIDIM algorithm is presented next. 1. The CNS and CLS subsets are obtained by a random division of the experiences set used to construct the tree. 2. For each non-leaf node do: 2.1. Splitting (as standard TDIDT) by a disorder measure (for example the entropy measure) If splitting does not improve the prediction (according to I A and I R ), then the node is a leaf node (even when all the experiences do not belong to the same class) If splitting improves the prediction then the node is expanded. Next, to show the goodness of the CIDIM algorithm, we present some experimental results from a previous work [29]. In order to compare the CIDIM method with ID3 and C5 algorithms, three standard experiences sets have been used. These sets are ionosphere, pima-diabetes and wdbc, which can be obtained from MLRepository [12]. Table 3 presents a brief resume of their characteristics. Each numerical attribute has been divided into several intervals of similar size according to the range of values. Experiences with unknown values have been omitted. We used tenfold cross-validation in order to avoid bias in the results. The pruning CF parameter was set to four different values for the C5 algorithm. The success index (SI) and the number of rules averages for every dataset are shown in Table 4. This table shows how CIDIM always generates fewer rules than the other learning algorithms under comparison, with a similar success index [29]. Table 3 Characteristics of standard sets Name Ionosphere Pima-diabetes WDBC Cardinal Attributes Types Symbolic Numerical Numerical Classes Subject Ionosphere Diabetes Cancer

10 54 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Table 4 Comparative results of CIDIM and other common TDIDT algorithms Ionosphere Pima-diabetes WDBC SI No. of rules averages SI No. of rules averages SI No. of rules averages ID C5 0% C5 10% C5 20% C5 30% CIDIM On the other hand, the standard multivariate analysis methods, in spite of known disadvantages, are still in use for variable selection problem. Some of them are standard stepwise regression procedures (forward selection, backward elimination, MINR, MAXR forward selection). In Section 4, the MAXR variable selection procedure is used for selecting the most important prognostic factors for predicting the overall survival and the results are compared with those found by the CIDIM algorithm proposed in this section Approximating Bayes decision rule The problem presented is: given a patient, will she suffer a post-surgical relapse at any period during her follow-up time? We need a decision rule to solve it. That is, classifying an observation x as belonging to one of two populations is desired, such that if x belongs to the ith population, x occurs according to the density function p(x/c i ). When maximising the correct classification probability is desired, the minimum probability of error decision rule (Bayes rule [8]) isdefined by the function 8 f i ðxþ ¼ 1 if p C i p C < k 8k x x : 0 if otherwise where p(c i /x) is the a posteriori density function and f i is the probability of classifying the pattern x in class C i. Next, we determine the correct classification probability when the Bayes rule is used for this problem. Let p(c i ) be the a priori probability of class C i where i ¼ 1 identifies the class relapse and i ¼ 2 the class non-relapse and let p ii be the conditional probability of correctly classifying a pattern of C i in C i. Therefore, the correct classification probability is given by Z Z x p ¼ pðc 1 Þp 11 þ pðc 2 Þp 22 ¼ pðc 1 Þ f 1 ðxþp dx þ pðc 2 Þ R N C 1 Z Z x x ¼ pðc 1 Þp dx þ pðc 2 Þp dx A¼fx:pðC 1 =xþpðc 2 =xþg C 1 A C 2 Z Z x x x ¼ pðc 1 Þp dx þ pðc 2 Þp pðc 1 Þp R N C 1 C 2 C 1 A x f 2 ðxþp R N C 2 dx dx (3)

11 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Z ¼ pðc 1 Þþ p C 2 p C Z 1 pðxþ dx ¼ pðc 1 Þþ 1 2p C 1 pðxþ dx A x x A x Z ¼ pðc 1 Þþ 1 2p C 1 A x pðxþ dx since p C 1 1 x 2 ; 8x 2 A In an analogous way we have Z p ¼ pðc 2 Þþ 2p C Z 1 1 pðxþ dx ¼ pðc 2 Þþ 2p C 1 1 A x A x pðxþ dx (4) From expressions (3) and (4) we obtain p maxfpðc 1 Þ; pðc 2 Þg and p ¼ 1 Z 2 þ p C 1 1 R N x 2pðxÞ dx (5) Note that expression (3) provides an explicit expression for the correct classification probability in terms of the probability of assigning a pattern x to category C 1. The a posteriori density function p(c i /x) is unknown, so we have to estimate it to obtain an approximate correct classification probability. Funahashi [9] proves theoretically that three-layer neural networks with at least 2n hidden units have the capability of approximating the a posteriori probability in the two-category classification problem with arbitrary accuracy, and that it tends to the a posteriori probability as back-propagation learning proceeds ideally. Thus, we have Fðx; t; wþ ffip C 1 (6) x where Fðx; t; wþ is the network output for an input pattern x and t and w are the synaptic weight matrices. Hence, the approximate Bayes decision rule is given by the expression ( ~fðxþ ¼ 1 iffðx; t; wþ iffðx; t; wþ < 1 (7) 2 which gives the probability of classifying the pattern x in class C 1.If fðxþ ~ ¼1, pattern x is classified in class C 1 and if fðxþ ~ ¼0, it is classified in C 2. From expressions (5) and (6), we obtain an estimate ^p of the correct classification probability given by the expression ^p ¼ 1 2 þ 1 X n Fðx i ; t; wþ 1 n 2 (8) i¼1

12 56 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) where n is the number of patients. Note that p is an upper bound for the probability of making a correct classification with any given decision rule, since p has been determined by Bayes rule. Thus, we estimate p using the neural network paradigm, that is, by outputs of a multi-layer neural network Fðx; t; wþ. Note that the variance of the estimate ^p is varð^pþ ð1=nþ, and so we have an accurate estimate. Note that ^p can be used to check the degree of difficulty in a classification problem. The probabilities p 11 and p 22 can also be estimated with the multi-layer neural network as ^p 11 ¼ 1 X Fðx; t; wþ m ^p 22 ¼ 1 n m fx2c 1 :Fðx;t;wÞ1=2g X fx2c 2 :Fðx;t;wÞ1=2g ð1 Fðx; t; wþþ where m is the number of patients that suffer a relapse The prognosis system Taking into account: (1) the importance of the prognostic factors strength evolution over time, and the existence of a peak recurrence in the relapse distribution; (2) the CIDIM algorithm analysed in Section 2.3 to select the most significant prognostic factors; and (3) the justification of the proposed decision rule in expression (7), then a solution scheme is proposed based on specific topologies of neural networks combined with decision trees for different time intervals during the follow-up time of the patients, and a threshold unit to implement the decision-making process (Fig. 1) Decision trees The decision trees unit leads to the selection of the most significant prognostic factors from the patients database for every time interval. These subsets of prognostic factors constitute the kernel of the prognostic factors selector (PFS in Fig. 1). Given a new patient for whom predictions have to be made, and the corresponding time interval under study, the PFS extracts the appropriate input subset of prognostic factors to the neural networks system to obtain good prediction accuracy of the correct classification probability of patient relapse after breast cancer surgery The neural networks system The neural networks system computes an attributes set from the prognostic factors selector giving a value corresponding to the a posteriori probability of relapse for the patient under study. The main common characteristics of the networks employed are shown in Table 5. Input layers, corresponding to every neural network selected for the different time intervals under study, have as many elements as the number of selected attributes as appear in Table 6, column #2. The middle or hidden layers have 14, 19, 14, 15, 17, 13, and 10 elements, respectively, with logistic transfer functions. These numbers of elements were determined using a cascade learning constructive process, adding neurons to the hidden layer one at time until there is no further improvement in network performance. The output

13 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Table 5 Common characteristics of the ANN models used Network topology Learning algorithm Learning rule Input data Output data Multilayer perceptron full connectivity Levenberg Marquardt Generalised delta rule Attributes thought to be risk factors Relapse probability layers have one logistic element corresponding to the single dependent variable. The output elements predict the relapse probability by means of its numerical output (ranging from 0 to 1). Connection weights are changed using a Levenberg Marquardt errors back-propagation algorithm [21] and the learning constant was set to Weights initialisation is crucial in the learning process with artificial neural networks. In order to obtain a realistic estimate of the correct classification probability, 30 weights initialisations were carried out and the average and standard deviation of the runs are presented. Giving the information to the neural network input layer requires an information preprocessing process. First, it is important to normalise all the prognostic factors ranges to lie within the central range of the hidden layer transfer function in the neural network ( 1.0 and 1.0 for the hyperbolic tangent transfer function), and second, to study the range and distribution of each prognostic variable to remove all the missed values, and to lessen the impact of outliers at the extremes of the distribution. A crucial aspect of carrying out learning and prediction analysis with a neural network system is to split the database into two independent sets: the training set (80% of the dataset), which is used to train the neural network, and the test set (20% of the dataset) to validate its predictive performance. During training the data vectors of the training set are repetitively presented to the network which attempts to generate a 1 at the output unit when the survival status of the patient is relapse, and a 0 when the status is non-relapse. The networks were trained and the mean square errors between the survival status variables (supervisory variable) and the dependent output variables decreased with an Table 6 Results of the prognosis system based on neural networks and decision trees Time interval Selected attributes No. of patients PCP BCP NNCP I 1 (0 10) Ag, Ma, Fp, An, Ts, Pn, P (0.01) (0.01) I 2 (10 20) Ag, Ma, Mg, An, Ts, Gr, Er, P (0.02) (0.02) I 3 (20 30) Ag, An, Ts, Gr, Er, P (0.09) (0.10) I 4 (30 40) Ag, An, Ts, Er, Pr, Ps, P (0.05) (0.04) I 5 (40 50) Ag, An, Ts, Er, Pr, Ps, P (0.01) (0.01) I 6 (50 60) Ag, An, Ts, Er, Gr (0.06) (0.07) I 7 (>60) Ag, An, Ts, Gr (0.02) (0.02) Number of patients, selected attributes (prognostic factors) and PCP, averages (and standard deviations) of BCP and NNCP probabilities obtained in all patients follow-up time intervals (in months).

14 58 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) increasing number of epochs during training: first, it decreases rapidly and then continues to decrease slowly as the network makes its way to local minimum. With good generalisation as the goal, the network ended up overfitting the training data since the training session was not stopped at the correct point. The procedure used to avoid overfitting was the early stopping method of training [3], which leads to identifying the onset of overfitting through the use of the hold-out method, for which the training set is split into an estimation subset (80% of the training set), and a validation subset (20% of the training set). The estimation subset of examples was used to train each network of the system until 2000 epochs (the total number of training iterations depended on the number of patients selected for each time interval), but the training sessions were stopped periodically, weights matrices were saved to files, and the networks were tested on the validation subsets after each training period. The early stopping points were found by plotting together the estimation learning curve, which decreased monolithically, and the validation learning curve, which decreased monolithically to a minimum, then started to increase as the training continued. This minimum was achieved after different epochs for each time interval under study (285, 224, 192, 179, 168, 130, 163). The optimally trained neural networks were tested for their ability to predict breast cancer relapse in the test set. To evaluate the proposed model, a standard technique of stratified tenfold crossvalidation was used [13]. This technique divides the patient dataset into 10 sets of approximately equal size and equal distributions of recurrent and non-recurrent patients. Each of the 10 random subsets of the data serves as a test set for the prognostic model trained with the remaining 9 partitions. The overall prediction accuracy for the system is then assessed as an average of 10 experiments Threshold unit The threshold unit outputs a class for survival status according to the proposed decision rule in expression (7). To obtain an appropriate classification accuracy, which is expressed as the percentage of patients in the test set that were classified correctly, a cut-off prediction between 0 and 1 had to be chosen before any output of the network (ranging from 0 to 1) could be interpreted as a prediction of breast cancer relapse ROC analysis and Cox regression For medical applications, classification accuracy is not necessarily the best quality measure of a classifier. Thus, two other measures are more frequently used: sensitivity and specificity. Sensitivity measures the fraction of positives cases that are classified as positive. Specificity measures the fraction of negative cases classified as negative. For many medical problems, high classification accuracy is less important than the high sensitivity and/or specificity of a classifier system. A receiver operating characteristic curve (ROC) indicates a trade-off that one can achieve between the false alarm rate (1: specificity, plotted on the X-axis) that needs to be minimised, and the detection rate (sensitivity, plotted on the Y-axis) that needs to be maximised. Although we mentioned in Section 1 that the application of traditional statistical techniques for survival analysis is not suitable for this problem, we think that a comparison

15 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) of the proposed model against the Cox statistical technique, actually used by medical experts, seems to be appropriate in order to justify and demonstrate the usefulness and power of the proposed combined model. Cox regression modelling was performed using SPSS statistical software. 4. Results and discussion Table 6 shows the number of patients and the selection of prognostic factors corresponding to every time interval (in months) of patients follow-up that were selected for training the neural networks system. After processing the patient database through the decision trees system (CIDIM algorithm), certain attributes appear to be the most significant prognostic factors (second column in Table 6) becoming the input to the artificial neural networks system. The decision trees system makes the attributes selection process objective in comparison with the subjective process carried out by experts on clinical data suspected of being risk factors for breast cancer prognosis. The results of the application of the MAX Forward Selection Procedure to the selection of the relevant prognostic factors in comparison with those found by the proposed CIDIM algorithm are presented in Table 7. This table shows how the CIDIM algorithm chooses, for each time interval, a greater number and with more variability of attributes thought to be significant prognostic factors. This means that CIDIM performs a fine fit of the most important prognostic factors in the selection process. Table 6 also shows a comparison among the a priori classification probability (PCP) of the dataset, the estimate of the correct classification probability of Bayes (BCP), and the classification probability obtained by the application of the decision rule proposed in (4) (NNCP). Because no single network output between 0 and 1 served as a perfect cut-off prediction for breast cancer relapse, the accuracy result for NNCP have been complemented with a ROC analysis. No theoretical work defines how the appropriate cut-off prediction for network processing of a test file should be determined. Thus, 10 equally spaced cut-off predictions were examined in the range The true and false positives and negatives, the sensitivity, specificity, and positive and negative predictive values were Table 7 Comparison of MAXR procedure against CIDIM algorithm for selecting the most significant prognostic factors for each time interval Time interval MAXR forward selection procedure CIDIM algorithm I 1 (0 10) Ag, An, Ts Ag, Ma, Fp, An, Ts, Pn, P53 I 2 (10 20) Ag, An, Ts, Gr Ag, Ma, Mg, An, Ts, Gr, Er, P53 I 3 (20 30) Ag, An, Ts, Gr Ag, An, Ts, Gr, Er, P53 I 4 (30 40) Ag, An, Ts Ag, An, Ts, Er, Pr, Ps, P53 I 5 (40 50) Ag, An, Ts Ag, An, Ts, Er, Pr, Ps, P53 I 6 (50 60) Ag, An, Ts, Gr Ag, An, Ts, Er, Gr I 7 (>60) Ag, An, Ts Ag, An, Ts, Gr

16 60 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Table 8 Test results by ROC analysis for the prognosis model based on neural networks and decision trees Time interval False negative False positive Positive predictive value Negative predictive value Sensitivity Specificity I 1 (0 10) I 2 (10 20) I 3 (20 30) I 4 (30 40) I 5 (40 50) I 6 (50 60) I 7 (>60) calculated for each cut-off prediction and the point on the ROC curve that minimises the overall error was identified for each time interval of patients follow-up (Table 8). The fractional results of the Table 8 are consequence of the 10 repetitions used for each crossvalidation partition. To have a better reference of the proposed system fitness, PCP, BCP, and NNCP indexes have been simultaneously plotted in Fig. 2, the analysis of which yields some important results: first, the proposed system (NNCP) always improves the a priori probability (PCP). Here, it is important to point out the difficulty of this, given such high values of PCP for each time interval. Besides this, this improvement is greater in the most critical interval during the follow-up time of the patients (I 2 in Table 6) [2]. Second, NNCP is always smaller than BCP and it follows the BCP shape, as was expected. In addition, we can observe that the difference between the two is not significant, which means that the proposed rule in expression (5) is a good estimator of Bayes decision rule. Finally, the predictive ability of the neural network system (Fig. 1, prognosis phase) was compared to the predictive ability of Cox model. Using Cox s model for prediction, the Fig. 2. The estimate of the correct classification probability of Bayes (BCP), the correct classification probability obtained with the proposed neural networks system (NNCP) and the a priori correct classification (PCP) for each time interval under study.

17 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Table 9 Comparison of classification accuracy for each time interval between Cox regression model and the prognosis model based on neural networks and decision trees Prognosis model I 1 (0 10) I 2 (10 20) I 3 (20 30) I 4 (30 40) I 5 (40 50) I 6 (50 60) I 7 (>60) NNCP Cox model probability of recurrence in patients was estimated within seven different time intervals from the surgical intervention (Table 9). 5. Conclusions This paper presents a decision-support tool for the prognosis of breast cancer relapse using clinical pathological data. We propose a model that combines a novel algorithm TDIDT (CIDIM), with a system composed of different neural network topologies to approximate Bayes optimal error for the prediction of patient relapse after breast cancer surgery. The CIDIM algorithm selects the most relevant prognostic factors for the accurate prognosis of breast cancer, while the neural networks system takes as input these selected variables in order for it to reach good correct classification probability. We also present a new method for the estimate of Bayes optimal error using the neural network paradigm, and a new methodology to process censored data when the time of patient follow-up is discretized into different time intervals. The proposed method is useful for the medical expert mainly under the following circumstances: (1) when data present an important number of attributes with missing values; (2) when not only prediction accuracy, but also additional knowledge is required about the more significant prognostic factors for each time interval; and (3) when the prognostic factors significance is not the same over the time of patient follow-up, and the utilisation of survival estimate techniques is not very advisable. Actually, our research group works on improving the correct classification probability accuracy by different means: (1) introducing a methodology based on genetic algorithms for the automatic induction of appropriate neural networks topologies; (2) constructing modular neural networks architectures and analysing their generalisation properties; and (3) certain attributes (for example, grade, ploidy, estrogen receptors) have been converted into discrete values, although their conceptual vagueness could be quantified by the degree of membership of a numerical value in a fuzzy set. Thus, their values would be a userdefined finite set of linguistic values. Therefore, a fuzzy neural network system would be necessary to work with these special attributes. Based on the results achieved in this work, we hope that clinicians will be able to use artificial neural networks combined with decision trees to search through large datasets seeking subtle patterns in prognostic factors, and that may further assist the selection of appropriate adjuvant treatments for the individual patient.

18 62 J.M. Jerez-Aragonés et al. / Artificial Intelligence in Medicine 27 (2003) Acknowledgements We would like to thank the referees for their valuable comments and suggestions, and also the Oncology Service staff of the Hospital Clínico Universitario of Málaga for their comments and collaboration in this work. This work has been partially supported by the FRESCO project, number PB C04-01, of CICYT Spain. References [1] Abbass HA, Towsey M, Finn G. C-Net: a method for generating non-deterministic and dynamic multivariate decision trees. Know Inform Syst 2001;5(2): [2] Alba E et. al. Estructura del patron de recurrencia en el cancer de mama operable (CMO) tras el tratamiento primario. Implicaciones acerca del conocimiento de la historia natural de la enfermedad. In: Proceedings of the 7th Congreso de la Sociedad Española de Oncología Médica, Barcelona, Spain, [3] Amari S, Murata N, Muller KR, Finke M, Yang H. Statistical theory of overtraining is cross-validation asymptotically effective? Adv Neural Inform Process Syst 1996;8: [4] Baxt WG. Application of neural networks to clinical medicine. Lancet 1995;346: [5] Buntine W, Nibblett T. A further comparison of splitting rules for decision-tree induction. Mach Learn 1992;8: [6] Cox DR. Regression models and life tables. J R Stat Soc 1972;34: [7] D alche-buc F, Zwierski D, Nadal J. Trio learning: a new strategy for building hybrid neural trees. Neural Syst 1994;5(4): [8] Duda RO, Hart PE. Pattern classification and scene analysis. New York: Wiley; [9] Funahashi K. Multilayer neural networks and Bayes decision theory. Neural Networks 1998;11: [10] Gorman RP, Sejnowski TJ. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1988;1: [11] Grumett S, Snow P. Artificial neural networks: a new model for assessing prognostic factors. Ann Oncol 2000;11: [12] [13] Janssen P, et al. Model structure selection for multivariable systems by cross-validation. Int J Control 1988;47: [14] Jefferson M, Pendleton N, Lucas B, Horan M. Comparison of a genetic algorithm neural network with logistic regression for predicting outcome after surgery for patients with nonsmall cell lung carcinoma. Am. Cancer Soc. (Atlanta) [15] Kaplan SA, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958;53: [16] Kattan MW, Hess KR, Beck JR. Experiments to determine whether recursive partitioning (cart) or an artificial neural network overcomes theoretical limitations of Cox proportional hazards regression. Comput Biomed Res 1998;31(5): [17] Lucas PJF, Abu-Hanna A. Prognostic methods in medicine. Artif Intell Med 1999;15(2): (editorial). [18] McGuire WL, Tandom AT, Allred DC, Chamnes GC, Clark GM. How to use prognostic factors in axillary node-negative breast cancer patients. J Natl Cancer Inst 1990;82: [19] Michalski R, Carbonell JG, Mitchell TM. Machine learning, an artificial intelligence approach. Palo Alto: Tioga Press; [20] O Neill M. Training back-propagation neural networks to define and detect DNA-binding sites. Nucl Acids Res 1991;19: [21] Patterson DW. Artificial neural networks, theory and applications. Singapore: Prentice Hall; [22] Pesonen E, Eskelinen M, Juhola M. Comparison of different neural networks algorithms in the diagnosis of acute apendicitis. Int J Biomed Comput 1996;40:

A Model For Prognosis of Early Breast Cancer

A Model For Prognosis of Early Breast Cancer Model For Prognosis of Early Breast Cancer JEEZ, J.M. (), GOMEZ, J.. (), MUÑOZ, J. (), LB, E. () () Group of esearch in Images nalysis and rtificial Intelligence Departamento de Lenguajes y Ciencias de