and errs as expected. The disadvantage of this approach is that it is time consuming, due to the fact that it is necessary to evaluate all algorithms,

Size: px

Start display at page:

Download "and errs as expected. The disadvantage of this approach is that it is time consuming, due to the fact that it is necessary to evaluate all algorithms,"

Leona Harvey
5 years ago
Views:

1 Data transformation and model selection by experimentation and meta-learning Pavel B. Brazdil LIACC, FEP - University of Porto Rua Campo Alegre, Porto, Portugal pbrazdil@ncc.up.pt Research in the area of ML/data mining has lead to a proliferation of many dierent algorithms. In the area of classication Michie et al. (1994) for instance, describe about two dozen of such algorithms. Previous work has shown that there does not exist a single best algorithm, which would be suited for all tasks. It is thus necessary to have a way of selecting the most promising model type. This process is often referred to as model selection. An interesting question arises as to what kind of method, or methodology, we should adopt to do that. Previous approaches can be divided basically into two groups. The rst one includes methods based on experimentation, and the second one, methods which employ meta-knowledge. Our aim here is to review both of these approaches in some detail and examine how these could be extended to encompass also the data transformation phase which often precedes learning. 1 Model Selection by Experimentation or Using Meta-Knowledge? 1.1 Model Selection by Experimentation Model selection by experimentation works, as the name suggests, by evaluating the possible alternatives experimentally on the given problem. In the context of classication one would normally consider a set of possible classiers and try to obtain reliable error estimates, which is usually done using cross-validation (CV) (Schaer, 1993). This approach has a number of advantages. First, it is quite a general and applicable in many dierent situations. The method is, as Schaer (1993) has demonstrated, quite reliable. Given certain condence level, the approach does indeed identify the best possible candidate 11

2 and errs as expected. The disadvantage of this approach is that it is time consuming, due to the fact that it is necessary to evaluate all algorithms, some of which can be quite slow. Various proposals have been presented how to speed up this process. One possibility is to pre-select some algorithms using certain criteria and then limit the experimentation to this subset. Some people have suggested that we should preferably use algorithms which behave rather dierently form one another. One criteria for deciding this is by examining whether the algorithms lead to uncorrelated errors (Ali and Pazzani, 1996). Another possibility is to try to reduce the number of cycles of cross-validation without eecting the reliability of the result. Moore and Lee (1994) have proposed a technique referred to as racing, which permits to terminate the evaluation of those algorithms which appear to be far behind others. Yet another option is by exploiting meta-knowledge which will be briey reviewed in the next section. 1.2 Model Selection Using Meta-knowledge Meta-knowledge permits to capture our knowledge about which ML algorithms should perform well in which situation. This knowledge can be either theoretical or of experimental origin, or a mixture of both. The rules described by Brodley (1993) for instance, captured the knowledge of experts concerning the applicability of certain classication algorithms. The meta-knowledge of Brazdil et al.(1994) and Gama and Brazdil (1995) was of experimental origin. The objective of the meta-rules generated with the help learning systems was to capture certain relationships between the measured dataset characteristics (such as the number of attributes, number of cases, skew, kurtosis etc.) and the error rate. As was demonstrated by the authors this meta-knowledge can be used to predict the errors of individual algorithms with a certain degree of success. One advantage of this approach when compared to model selection based on experimentation, is that it does not really need extensive experiments. This is due to the fact that meta-knowledge captures certain regularities of the situations encountered in the past. The disadvantage is that the meta-knowledge acquired need not be totally applicable to a new situation and, in consequence, this method tends to be somewhat less reliable than model selection based on experimentation. As neither solution is ideal, this suggests that we may gain by combining the two approaches. Model selection by meta-knowledge can be used to pre-select a subset of promising algorithms and then experimentation can be used to identify the best candidate. This method requires that we dene the criteria for pre-selecting the set candidate algorithms. A good criteria will somehow strike a good balance between the reliability of the outcome and the amount of experimentation we are prepared to undertake. Preselecting fewer algorithms has the advantage that there is less work to be done, but on the other hand, we may get a sub-optimal result. 2 Dierent Approaches to Model Selection by Meta-knowledge There are many dierent ways of how we can approach the problem. Our aim in this section is to describe certain options we can take when addressing the problem. Basically we need to decide: Whether the meta-knowledge should express knowledge concerning pairs of algorithms or a larger group; 12

3 What the reference point is for the comparisons of error rates; Whether the meta-knowledge should be easily updateable; Whether the predictions should be qualitative (e.g. Ai is applicable) or quantitative (the error rate of Ai is E%); Whether or not we want to condition the predictions on dataset characteristics. Let us now analyze each of the points above in some detail. 2.1 Which is the Best Reference Point? The rst important decision is whether we should consider pairs of algorithms generalize the study to N algorithms. The meta-rules of Aha (1992) were oriented towards pairs of algorithms (e.g. IB1, C4). The objectives of the meta-rules was to dene conditions under which one algorithm (e.g. IB1) achieves better results and hence is preferable to another (e.g. C4). This rst major comparative study of a set of 22 classication algorithms was carried out under the StatLog project (Michie et al., 1994). The fact that a number of algorithms were analyzed together provided a reason to establish a kind of common reference point for all comparisons involving error rates. Gama and Brazdil (1995), for instance, considered three kinds of reference points in their study and evaluated them experimentally: the best error rate achieved by one of the algorithms, the mean error rate of all algorithms (or weighted mean), the error rate associated with the majority class prediction. We note that the rst two reference points depend on the set of algorithms under consideration. That is, if we introduce new algorithms into the set, or if we eliminate some existing ones from consideration, we have to, at least in principle, repeat all steps that depend on this reference point. This of course complicates the task of updating the existing meta- knowledge, as soon as new algorithms become available. The third reference point mentioned does not suer from this disadvantage. The error rate associated with the majority class depends entirely on the dataset under consideration. 2.2 Should the Predictions of Meta-Knowledge be Qualitative or Quantitative? Another important issue is whether we want the prediction concerning error to be qualitative or quantitative. Qualitative prediction would simply divide the algorithms into two groups: Those with low error rates, which we could identify as applicable, and the remaining ones which include both the algorithms with unacceptably high error rates, and also, the algorithms which failed to run. Quantitative predictions are concerned with predicting the actual error rate (or error which has been normalized in some way). The question concerning the form of the meta-knowledge is closely related to this issue. If we are interested to obtain only qualitative predictions, then meta-knowledge can be represented in the form of rules or cases. If we are interested in qualitative predictions, then we need to use some kind of a regression model, although qualitative predictions can also be converted to quantitative predictions (i.e. by associating a numeric value with each class). 13

4 2.3 Conclusions of a Previous Comparative Analysis Let us review the results of the experimental analysis carried out by Gama and Brazdil (1995) who have collected test results of about 20 algorithms on more than 20 datasets. Each dataset was characterized using 18 dierent measures (such as the number of attributes, number of cases, skew, kurtosis etc.). The authors have considered and evaluated the three reference points discussed earlier. Besides, the following forms for metaknowledge were considered: rules (generated by C4.5 (Quinlan, 1993)); instances (a version of IB1 (Aha et al., 1991)); linear regression equations (generated by a linear discriminant procedure); piecewise linear regression equations (linear regression equations with restricted applicability generated by Quinlan's (1993b) M5.1); A separate experiment was conducted for each of the 3 reference points and each of the 4 forms of meta-knowledge. There were thus 12 separate experiments in total. In each experiment the predictive power of the meta-knowledge was evaluated using a leave-oneout method. Let us analyze one such experiment for the sake of clarity. Suppose the aim is to evaluate, for instance, the scheme involving normalization method based majority class prediction and meta-knowledge in the form of piecewise linear regression equations. In each step of the leave-one-out method, one dataset was set aside for evaluation. The remaining data was normalized with respect to the reference point chosen and supplied to the learning system to construct the model (i.e. piecewise linear regression equations in this case). The prediction was then denormalized and stored with the actual value. These pairs of values were used to calculate measures characterizing the quality of predictions, such as NMSE, after all cycles of the leave-one-out method have terminated. So, essentially the authors evaluated the possibility of obtaining reliable predictions with the help of meta-level models. This analysis showed that meta-level models were indeed quite useful, although some set-ups were more successful than others. Instance based models (more exactly 3-NN) provided more reliable predictions than some of the other model types (particularly rules and linear regression equations). Piecewise linear regression equations achieved also quite good predictions overall. The best reference point was the one related to majority class prediction. These results have quite interesting implication. The method that provides the most reliable predictions (IBL + majority class as the reference point), enables us to construct a system which is easily extensible. The system can easily accommodate new algorithms which can arise at any time. The new results can just be added to the existing instances and used immediately afterwards in decision making. There is no need to carry out extensive meta- level learning, which is an advantage. This strategy was incorporated in the system Calg (Gama, 1996). The only disadvantage is that the meta-knowledge in this form does not really provide a comprehensible model. 3 Using Meta-Knowledge to Guide Experimentation Let us consider the issue whether meta-knowledge can be used to guide also the process of experimentation. But would this guidance be really useful? 14

5 The answer is armative, if we want to avoid unnecessary work. If pre-selection is done on the basis of performance of the individual algorithms only, we cannot guarantee that the nal subset does not include algorithms which are minor variants of one another. For practical reasons it is not really worth trying them all. What kind of meta-knowledge could be useful here? One interesting and practical possibility is to use statements of the form: pf(ai(di) >> Aj(Di) j Di 2 Dataset pool) which enables us to describe the frequency with which algorithm Ai performs signicantly better than algorithm Aj for given datasets. Here, \Ai(Di) >> Aj(Di)" is used as a shorthand for \algorithm Ai performs signicantly better (considering a given condence level, say 95%) than algorithm Aj". We can use this representation to express the fact that algorithm Ltree, for instance, leads to signicantly better results than C4.5 in 10 out of 22 cases by: pf(ltree(di) >> C4.5(Di) j Di 2 UCIdatasets ofjg) = 10/22 The algorithm Ltree is a decision tree type algorithm which can introduce new terms with the help of constructive induction (Gama, 1997). The frequency can be used to estimate the probability that one algorithm performs better than another. It can help to resolve the problem we have discussed earlier: If Aj is a variant of Aj which does not really bring out any benets, then presumably the frequency of observing a signicant improvement is zero. To express this we can use: pf(aj'(di) >> Aj(Di) j Di 2 Dataset pool) = 0 4 Using Meta-Knowledge to Guide Pre-Processing and Model Selection Previous studies have shown that pre-processing, such as elimination of irrelevant features or discretization of numeric values etc., can often bring about substantial improvements. Langley and Iba (1993), for instance, have demonstrated that the performance of IBL classier can be substantially improved by eliminating irrelevant features. Kohavi and John (1997) have veried that similar improvements can be obtained also with Naive Bayes and ID3 classier. Some classication algorithms (e.g. Naive Bayes) achieve better performance if the numeric features are discretized rst (Dougherty et al., 1995). A question arises whether the system proposed in the previous section can be extended to cover also the pre-processing stage. Our view is that this can indeed be done. Let us consider, for instance, one result presented in (Dougherty et al., 1995): At 95% condence level, the Naive-Bayes with entropy-discretization is better than C4.5 on ve datasets and worse on two (there were 16 datasets in total). This statement can be expressed in the form of the following two meta-level facts: pf(naive-bayes(entropy-discr(di)) >> C4.5(Di) j Di 2 UCIdata DKS)=5/16 pf(naive-bayes(entropy-discr(di)) << C4.5(Di) j Di 2 UCIdata DKS)=2/16 The fact that backward feature selection investigated by Kohavi and John (1997) has lead to improvements on 4 datasets can be expressed as follows: pf(c4.5(back-feature-select(di)) >> C4.5(Di) j Di 2 UCIdata KJ)=4/14 pf(c4.5(back-feature-select(di)) << C4.5(Di) j Di 2 UCIdata KJ)=0/14 15

6 5 Conclusion Our proposal is to use IBL meta-knowledge to perform pre-selection of promising algorithms and then use the representation described above to guide the process of conducting experiments and evaluating candidate algorithms. The search for the best combination of pre-processing method and model type can be seen as a kind of heuristic search. The meta-knowledge rules capture the results of previous experience and are used to avoid the probable pitfalls in future. Our plan is to evaluate its eectiveness of this method. Acknowledgments Gratitude is expressed to nancial support under PRAXIS XXI project ECO and Plurianual support attributed to LIACC. References [1] Aha D. (1992): Generalizing from Case Studies: A Case Study, in ML92, Machine Learning, Proceedings of 9th Machine Learning Conference, D.Sleeman and P.Edwards (eds.), Morgan Kaufmann Publ. [2] Aha D., Kibler D., Albert M. (1991): Instance-based Learning Algorithms, in Machine Learning, Vol.6, No.1, Kluwer Academic Publ., pp [3] Ali and Pazzani (1996): Error Reduction through Learning Multiple Description, in Machine Learning, Vol.24, No.13, Kluwer Academic Publ., pp [4] Blum A., Langley P. (1997): Selection of Relevant Features and Examples in Machine Learning, Journal of Articial Intelligence, Vol. 97, Nos.1-2, pp , Elsevier. [5] Brodley C. (1993): Addressing the Selective Superiority Problem: Automatic Algorithm / Model Class Selection Problem, in Machine Learning, Proceedings of 10th Machine Learning Conference, Morgan Kaufmann. [6] Brazdil, P. (1994): Analysis of Results, Chapter 10 in Michie D. et al. (eds), Machine Learning, Neural and Statistical Classication, Ellis Horwood. [7] Brazdil, P., Gama J. and Henery B. (1994): Characterization the Applicability of Classication Algorithms, in Machine Learning, ECML-94, Proceedings of European Conference on Machine Learning, F.Bergadano and L.dea Raedt (eds.), Springer- Verlag. [8] Dougherty R., Kohavi J. and Sadami M. (1995): Supervised and unsupervised discretization of continuous features, in, Machine Learning, Proceedings of 12th Machine Learning Conference, Morgan Kaufmann Publ. [9] Gama J. (1977): Probabilistic Linear Tree, in, Machine Learning, Proceedings of 14th Machine Learning Conference (ICML-97), Morgan Kaufmann. [10] Gama J., Brazdil P. (1995): Characterization of Classication Algorithms, in C.Pinto-Ferreira, N.Mamede (eds.), Progress in Articial Intelligence, LNAI 990, Springer-Verlag. 16

7 [11] Kohavi R.and John G. (1997): Wrappers for Feature Subset Selection, Journal of Articial Intelligence, Vol. 97, Nos.1-2, pp , Elsevier. [12] Michie D., Spiegelhalter D., Taylor C. (1994): Machine Learning, Neural and Statistical Classication, Ellis Horwood. [13] Moore A. and Lee M. (1994): Ecient Algorithms for Minimizing Cross Validation Error, in ML-94, Machine Learning, Proceedings of 11th Machine Learning Conference, Morgan Kaufmann Publ. [14] Quinlan R. (1993): C4.5: Programs for Machine Learning, Morgan Kaufmann Publ. [15] Quinlan R. (1993b): Combining Instance-based and Model-based Learning, in Machine Learning, Proceedings of 10th Machine Learning Conference, Morgan Kaufmann Publ. [16] Schaer C. (1993): Selecting a Classication Method by Cross-Validation, in Machine Learning, Vol.13, No.1, Kluwer Academic Publ., pp

ate tests of conventional decision trees. Each leaf of a naive Bayesian tree contains a local naive Bayesian classier that does not consider attribute

ate tests of conventional decision trees. Each leaf of a naive Bayesian tree contains a local naive Bayesian classier that does not consider attribute Lazy Bayesian Rules: A Lazy Semi-Naive Bayesian Learning Technique Competitive to Boosting Decision Trees Zijian Zheng, Georey I. Webb, Kai Ming Ting School of Computing and Mathematics Deakin University