How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection Esma Nur Cinicioglu * and Gülseren Büyükuğur Istanbul University, School of Business, Quantitative Methods Division, Avcilar, 34322, Istanbul, Turkey esmanurc@istanbul.edu.tr, gulsayici@gmail.com Abstract. Variable selection in Bayesian networks is necessary to assure the quality of the learned network structure. Cinicioglu & Shenoy (2012) suggested an approach for variable selection in Bayesian networks where a score, S j, is developed to assess each variable whether it should be included in the final Bayesian network. However, with this method the without parents or children are punished which affects the performance of the learned network. To eliminate that drawback, in this paper we develop a new score, NS j. We measure the performance of this new heuristic in terms of the prediction capacity of the learned network, its lift over marginal and evaluate its success by comparing it with the results obtained by the previously developed S j score. For the illustration of the developed heuristic and comparison of the results credit score data is used. Keywords: Bayesian networks, Variable selection, Heuristic. 1 Introduction The upsurge of popularity of Bayesian networks brings a parallel increase in research for structure learning algorithms of Bayesian networks from data sets. The ability of Bayesian networks to represent the probabilistic relationships between the is one of the main reasons of the rise in reputation of Bayesian networks as an inference tool. This also generates the major appeal of Bayesian networks for data mining. With the advancement and diversification of the structure learning algorithms, more may be incorporated to the learning process, bigger data sets may be used for learning, and inferences become faster even in the presence of continuous. The progress achieved on structure learning algorithms for Bayesian networks is encouraging for the increasing use of Bayesian networks as a general decision support system, a data mining tool and for probabilistic inference. On the other hand, though the quality of a learned network may be evaluated by many different aspects, the performance of the learned network very much depends on the selection of the to be included to the network. Depending on the purpose of the application, the characteristics of an application may differ and hence the expectations from a Bayesian network performance may vary. Therefore to assure to end up with a Bayesian * Corresponding author. A. Laurent et al. (Eds.): IPMU 2014, Part I, CCIS 442, pp. 527 535, 2014. Springer International Publishing Switzerland 2014

528 E.N. Cinicioglu and G. Büyükuğur network of high quality, variable selection in Bayesian networks should constitute an important dimension of the learning process. There is a considerable literature in statistics on measures like AIC, BIC, Caplow s C-p statistic etc. that are used for variable selection in statistical models. These measures have been adopted by the machine learning community for evaluating the score based methods for learning Bayesian network models (Scutari, 2010). However, these scores are used as a measure of the relative quality of the learned network and do not assist in the variable selection process. Additionally, as discussed in Cui et al. (2010), traditional methods of stepwise variable selection do not consider the interrelations among and may not identify the best subset for model building. Despite the interest for structure learning algorithms and adaptation of different measures for the evaluation of the resulting Bayesian networks, variable selection in Bayesian networks is a topic which needs further attention of the researchers. Previously Koller and Sahmi (1996) elaborate the importance of feature selection and state that the goal should be to eliminate a feature if it gives us little or no additional information. Hruschka et al. (2004) described Bayesian feature selection approach for classification problems. In their work, first a BN is created from a dataset and then the Markov blanket of the class variable is used to the feature subset selection task. Sun & Shenoy (2007) provided a heuristic method to guide the selection of in naïve Bayes models. To achieve the goal, the proposed heuristic relies on correlations and partial correlations among. Another heuristic developed for variable selection in Bayesian networks was proposed by Cinicioglu & Shenoy (2012). With this heuristic a score called S j was developed which helps to determine the to be used in the final Bayesian network. By this heuristic first an initial Bayesian network is developed with the purpose of learning the conditional probability tables (cpts) of all the in the network. The cpts indicate the association of a variable with the other in the network. Using the cpt of each variable, its corresponding S j score is calculated. In their paper Cinicioglu & Shenoy (2012) illustrate that by applying proposed heuristic the performance of the learned network in terms of the prediction capacity may be improved substantially. In this paper first we discuss the S j score, and then identify the problem that though the S j score demonstrates a sound performance on prediction capacity, its formula leads to the problem that the without parents or children in the network are punished and that in turn affects the overall performance of the heuristic. Trying to eliminate that drawback, in this paper we suggest a modified version of the S j score, which is called as NS j. We measure the performance of this new score in terms of the prediction capacity of the learned network, its lift compared to the marginal model and evaluate its success by comparing it with the results obtained by the previously developed S j score. For the illustration of the developed heuristic and comparison of the results credit score data is used. The outline of the remainder of the paper is as follow: The next section gives details about the credit data set used for the application of the proposed heuristic. In section 3 the development of the new heuristic is explained, where both S j and NS j scores are discussed in detail in subsections 3.1 and 3.2 respectively. In section 4, using both of the variable selection scores S j and NS j, different Bayesian networks are created. The performance results of these two heuristics are compared in terms of the prediction capacity and improvement rates obtained compared to the marginal model.

How to Create Better Performing Bayesian Networks 529 2 Data Set The data set used in this study is a free data set, called the German credit data, provided by the UCI Center for Machine Learning and Repository Systems. The original form of the data set contains the information of 1000 customers about 20 different attributes, 13 categorical and 7 numerical, giving the information necessary to evaluate a customer s eligibility to get a credit. Before the use of the data set for the application of the proposed heuristics several changes are made in the original data set. In this research, the German credit data set is transformed into a form where the numerical attributes Duration in month, Credit amount, Installment rate in percentage of disposable income, Present residence since, Age in years, Number of existing credits at this bank and Number of people being liable to provide maintenance for are discretized. The variable Personal status and sex is divided into two categorical as Personal status and Sex. In the original data set the categorical variable Purpose contains eleven different states. In this paper some of these states are joined together, like car and used car as car, furniture, radio and domestic appliances as appliances and retraining and business as business, resulting in seven different states at the end. The final data set used in this study constitutes of 21 columns and 1000 lines, referring the number of and cases consequently. 3 Development of the Proposed Heuristic 3.1 S j Score The heuristic developed by Cinicioglu & Shenoy (2012) is based on the principle that a good prediction capacity of a Bayesian network depends on the choice of the that have high associations with each other. A marginal variable present in a network will not have any dependencies with the remaining in the network and thus won t have any impact for the overall performance of the network. In that instance, the arcs learned using an existing structure learning algorithm shows the dependency of a child node with its parent node, hence a proof of association. However, not all which do not place themselves as marginals, can be incorporated to the final Bayesian network. The idea is to develop an efficient heuristic for variable selection where the Bayesian network created using the selected will show a superior prediction performance compared to the random inclusion of to the network. Besides, though the presence of an arc shows the dependency relationship between two in the network, the degree of association is not measured there and may vary quite differently among. A natural way to examine the association of a variable with other considered for inclusion in the final Bayesian network is to learn an initial Bayesian network structure at first and then use the conditional probability tables of each variable as a source of measurement for the degree of association.

530 E.N. Cinicioglu and G. Büyükuğur Applying the distance measure to the conditional probability table of a variable, the degree of change on the conditional probabilities of a child node depending on the states of its parents may be measured. In that instance a high average distance obtained indicate that the conditional probability of the variable considered changes a great deal depending on the states of its parents. Thus, a high average distance is an indication of the high association of a child node with its parents. The average distance of each variable may be calculated using the formula given below. Here d represents the average distance of the variable of interest with its parent. p and q stand for the conditional probabilities of this variable for the different states of its parents, i stands for the different states of the child node and n stands for the number of states of the set of parent nodes. /, (1) 2 However, there may be in the network which do not have a high level of association with its parent node but do possess a high association with its children. Basing the selection process on the average distance of each variable solely will deteriorate the performance of the network created. Besides, while the average distance obtained from the cpt of a variable shows the degree of association of a child node with its parents, the same average distance also shows the degree of association of a parent node with its child, jointly with the child s other parents. Following this logic Cinicioglu & Shenoy (2012) developed the S j score given in Equation (2) below. In this formula the S j score of a variable j is the sum of the average distance of this variable d j and the average of the average distances of its children. Here ij denotes the child variable i of the variable j and c j denotes the number of j s children. (2) Consider Table 1 given below. This table is the cpt of the variable Credit amount. Using the formula given in Equation (1) the average distance of this variable is calculated as 0.0107. Considering Figure 1 given below, we see that Credit Amount possesses three children. Hence in order to calculate the S j score of Credit Amount we need to find the average distances of the child, average them and then add it to the average distance of the Credit Amount. A high S j score is desired as an indication of the high association with other. Ideally, according to the heuristic, the variable with the lowest S j score will be excluded from the analysis and a new BN will be created with the remaining. This network will include the new cpts which will be the basis for the selection of the variable to be excluded from the network. This process is repeated until the desired number of is obtained. This repeated process is the ideal way of applying the heuristic, however if not automated will require a great deal of time. In the following, subsection 3.2, the shortcomings of the S j score are discussed. As a modification of the S j score to handle the problems involved with the old variable selection method, a new score called NS j is suggested.

How to Create Better Performing Bayesian Networks 531 Table 1. Cpt of the variable Credit Amount Credit Amount Telephone 0-4000 4000-8000 8000-12000 12000-16000 16000-20000 None 0.8286 0.1364 0.0300 0.0033 0.0017 Yes 0.6308 0.2347 0.0807 0.0489 0.0049 Fig. 1. Variable Credit Amount with its three children and calculation of S Credit Amount 3.2 A New Variable Selection Score: NS j The heuristic developed by Cinicioglu & Shenoy (2012) tries to identify the which possess a high level of association with their parent and child. With that purpose the variable selection score developed, S j, is comprised of two parts: S j is the sum of the average distance of the variable of interest and the average of the average distances of its children. This way, with the S j score the variable is evaluated by considering both the association with its parents and also with its children. However, this approach also has the drawback that the without parents or children are penalized for inclusion to the final Bayesian network. Consider the formula of the S j score given in Equation (2). A variable without parents will only have a marginal probability distribution, not a cpt, and thus its average distance will be considered as zero. Similarly, for a variable which does not have any children the S j score will be equal to its average distance. The resulting S j scores for a variable without parents and for a variable without children are given in Equations (3) and (4) respectively. For a variable j without parents (3) For a variable j without children (4) As illustrated above because of the formulation of the S j score, which do not possess parents or children will be punished in the variable selection process. If such a variable which lacks parents or children has a strong association with the present part

532 E.N. Cinicioglu and G. Büyükuğur (its parents or children depending on the case) though, then this selection process may cause to create networks with lower performance. To overcome this problem, in this research, a modified version of the S j score, NS j, is presented. For which lack either parents or children the score will remain to be the same as the old one. For which possess both parents and children on the other hand, NS j will be equal to the half of the old S j score. These two cases are formulated in Equation (5) and (6) given below. For a variable j without parents or children For a variable j both with parents and children (5) The which don t have any parents or children will be eliminated from the network. In the following section both of these heuristics will be used to learn BNs from the credit data set introduced in Section 2, their performance will be evaluated in terms of the prediction capacity and improvement obtained compared to the marginal model. (6) 4 Evaluation of the Proposed Heuristic In this section the performance of the variable selection scores S j and NS j are compared. The evaluation is made in terms of the prediction capacity and improvement of the BNs created using the suggested scores. For the application of the heuristic, first, it is necessary to learn an initial BN from the data set. For illustration and evaluation of the suggested scores the credit data set given in Section 2 will be used. For learning BNs from the data set WinMine, software (Heckerman et al., 2000) developed by Microsoft Research, is used. The main advantage of WinMine is its ability to automatically calculate log-scores and lift over marginals of the learned BNs. Log-score is a quantitative criterion to compare the quality and performance of the learned BNs. The formula for the log score is given below.,, / (7) where n is the number of, and N is the number of cases in the test set. For the calculation of the log-score, the dataset is divided into a 70/30 train and test split 1 and the accuracy of the learned model on the test set is then evaluated using the log score. Using WinMine the difference between the log scores of the provided 1 In WinMine only the percentage of the test/training test data may be determined. Using a different software in further research 10-fold cross validation will increase the validity of the results.

How to Create Better Performing Bayesian Networks 533 model and the marginal model can also be compared which is called as the lift over marginal. A positive difference signifies that the model out-performs the marginal model on the test set. The initial BN learned from the credit data set is given in Figure 2 below. Fig. 2. The initial BN learned from the credit data set containing all of the Using the cpts obtained through the initial BN we can calculate both the S j and NS j scores. Figure 3 given below depicts the graph of both S j and NS j scores for the 21 used in the initial BN. The observations made are as follows: For seven in the network the corresponding S j and NS j scores agree. These seven are the ones which either lack parents or children. Fig. 3. Graph of the S j and NS j scores calculated using the cpts obtained from the initial BN In our analysis we want to compare the performance of these two variable selection scores. With that purpose two sets of are created, one by selecting the with the highest S j scores and the second with the highest NS j scores. Using the selected the corresponding BNs are learned. The performance of the BNs

534 E.N. Cinicioglu and G. Büyükuğur are compared in terms of prediction capacity of the provided model and in terms of the improvement obtained. As the next step the same process is repeated by using the cpts of the new BNs to calculate the new S j and NS j scores. Accordingly, the to be excluded from the network is decided according their ranking on the variable selection score considered, S j or NS j. In our analysis, we repeated the steps five times and created BNs using 17, 15, 13, 11 and 8, all selected according their ranking in the corresponding variable selection scores. The results of their performance are listed in Table 2 given below. Both of the variable selection scores obtain better results compared to the marginal model and also the average distance measure. Notice that also the results of the BNs created using the average distance d j are listed in the same table. This is done for comparison purposes to illustrate that both of the variable selection scores do result in superior performance compared to the average distance measure. Additionally, in almost all the networks considered except the BN with 17, we obtained better performing networks using the NS j score both in terms of the prediction capacity and improvement obtained. Table 2. Performance results of the variable selection scores S j and NS j 2 LogScore Prediction rate Lift Over Marginal Improvement obtained initial BN 0.76 59.13% 0.19 7.13% Top 17 Top 15 Top 13 Top 11 Top 8 d j 0.83 56.38% 0.17 6.30% S j 0.73 60.30% 0.21 8.20% NS j 0.77 58.58% 0.19 7.36% d j 0.77 58.46% 0.19 7.18% S j 0.72 60.87% 0.22 8.48% NS j 0.72 60.87% 0.22 8.48% d j 0.78 58.29% 0.21 7.84% S j 0.73 60.23% 0.20 7.65% NS j 0.66 63.27% 0.22 8.95% d j 0.73 60.39% 0.19 7.35% S j 0.72 60.66% 0.19 7.41% NS j 0.65 63.74% 0.22 9.06% d j 0.76 58.87% 0.18 6.97% S j 0.67 62.65% 0.18 7.44% NS j 0.66 63.14% 0.22 9.08% 5 Results, Conclusions and Further Research In order to ensure the prediction capacity of a BN learned from the data set and to be able to discover hidden information inside a big data set it is necessary to select the right set of to be used in the BN to be learned. This problem is especially 2 The results are rounded to two decimal places.

How to Create Better Performing Bayesian Networks 535 apparent when there is a huge set of and the provided data is limited. In the last decade the research on structure learning algorithms for BNs have grown substantially. Though, there exists a wide research for variable selection in statistical models, the research conducted for variable selection in BNs remains to be limited. The variable selection measures developed for statistical models have been adapted by the machine learning community for evaluating the overall performance of the BN and do not provide guidance in variable selection for creating a good performing BN. The variable selection score S j (Cinicioglu &Shenoy, 2012), provides a sound performance for prediction capacity of the resulting network, however has the drawback that the without parents or children punished for inclusion to the network. Motivated by that problem in this research we suggest a modification to the S j score, called as NS j which fixes the problems inherent in its predecessor S j. A credit score data set is used for applying the proposed heuristic. The performance of the resulting BNs using the proposed heuristic is evaluated using logscore and lift over marginal which provides the prediction capacity of the network and the improvement obtained using the provided model compared to the marginal model. These results are compared with the results obtained using the distance measure and the S j score. Accordingly, the new developed NS j score show better performance both in terms of prediction capacity and the improvement obtained. For further research, different variable selection scores from statistical models and different data sets may be used to evaluate the results of the proposed heuristic. Acknowledgements. We are grateful for two anonymous reviewers of IPMU-2014 for comments and suggestions for improvements. This research was funded by Istanbul University Research fund, project number 27540. References 1. Cinicioglu, E.N., Shenoy, P.P.: A new heuristic for learning Bayesian networks from limited datasets: a real-time recommendation system application with RFID systems in grocery stores. Annals of Operations Research, 1 21 (2012) 2. Cui, G., Wong, M.L., Zhang, G.: In Bayesian variable selection for binary response models and direct marketing forecasting. Expert Systems with Applications 37, 7656 7662 (2010) 3. Heckerman, D., Chickering, D.M., Meek, C., Rounthwaite, R., Kadie, C.: Dependency Networks for Inference, Collaborative Filtering, and Data Visualization. Journal of Machine Learning Research 1, 49 75 (2000) 4. Hruschka Jr., E.R., Hruschka, E.R., Ebecken, N.F.F.: Feature selection by bayesian networks. In: Tawfik, A.Y., Goodwin, S.D. (eds.) Canadian AI 2004. LNCS (LNAI), vol. 3060, pp. 370 379. Springer, Heidelberg (2004) 5. Koller, D., Sahami, M.: Toward optimal feature selection (1996) 6. Murphy, P.M., Aha, D.W.: UCI Repository of Machine Learning Databases. Department of Information and Computer Science, University of California, Irvine, CA (1994) 7. Sun, L., Shenoy, P.P.: Using Bayesian networks for bankruptcy prediction: some methodological issues. European Journal of Operational Research 180(2), 738 753 (2007)