Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles

Size: px

Start display at page:

Download "Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles"

Sara Sharp
6 years ago
Views:

1 Subsampling for Efficient an Effective Unsupervise Outlier Detection Ensembles Arthur Zime, Matthew Gauet, Ricaro J. G. B. Campello, Jörg Saner Department of Computing Science, University of Alberta, Emonton, AB, Canaa ABSTRACT Outlier etection an ensemble learning are well establishe research irections in ata mining yet the application of ensemble techniques to outlier etection has been rarely stuie. Here, we propose an stuy subsampling as a technique to inuce iversity among iniviual outlier etectors. We show analytically an experimentally that an outlier etector base on a subsample per se, besies inucing iversity, can, uner certain conitions, alreay improve upon the results of the same outlier etector on the complete ataset. Builing an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier etectors has just been transferre from the classification literature, here we also justify analytically why ensembles are also expecte to wor in the unsupervise area of outlier etection. As a sie effect, running an ensemble of several outlier etectors on subsamples of the ataset is more efficient than ensembles base on other means of introucing iversity an, epening on the sample rate an the size of the ensemble, can be even more efficient than just the single outlier etector on the complete ata. Categories an Subject Descriptors H.2.8 [Database Applications]: Data mining Keywors outlier etection; ensemble. INTRODUCTION An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainer of that This wor was one while the author was on leave of absence from Luwig-Maximilians-Universität München, Germany. This wor was one while the author was on sabbatical leave from University of São Paulo, São Carlos, Brazil. Permission to mae igital or har copies of all or part of this wor for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies bear this notice an the full citation on the first page. Copyrights for components of this wor owne by others than ACM must be honore. Abstracting with creit is permitte. To copy otherwise, or republish, to post on servers or to reistribute to lists, requires prior specific permission an/or a fee. Request permissions from permissions@acm.org. KDD 3, August 4, 203, Chicago, Illinois, USA. Copyright 203 ACM /3/08...$5.00. set of ata [6]. Detecting outliers is an important tas in many practical applications. Some applications of outlier etection, such as etecting measurement errors, are mostly concerne with removing the outliers from the ata as a form of noise. Other applications, such as creit car abuse etection, or the ientification of unusual measurements in scientific ata, are concerne with fining outliers because their eviating behavior from the rest of the ata may require specific actions or provie opportunities for new insights. Various approaches to outlier etection have been propose, base on ifferent notions of outliers, or targete towars specific applications that require the ientification of outliers. Here, we are intereste in unsupervise, nonparametric outlier etection methos that assign a score to each ata object an thus allow a raning of objects accoring to their egree of outlierness. Parametric, statistical approaches [6, 35] fit certain istributions to the ata by estimating the parameters of these istributions from the given ata. A problem with these approaches is that istribution parameters such as mean, stanar eviation, an covariances are rather sensitive to the presence of outliers. Possible effects of outliers on the parameter estimation have been terme masing an swamping. Outliers can mas their own presence by influencing the values of the istribution parameters (resulting in false negatives), or swamp inliers to appear as outlying ue to the influence parameters (resulting in false positives) [6, 9]. Non-parametric approaches o not assume a specific istribution of the ata, but estimate (explicitly or implicitly) certain aspects of the probability ensity. Non-parametric methos inclue the well-nown istance-base an ensity-base methos. Both istance-base an ensitybase methos basically aim at proviing a rather simple estimate of the ensity aroun points, which can be seen as an approximation of statistical ernel ensity estimates. Distance-base methos such as DB-outlier [25] an its variants are base on the nearest neighbor (NN) istances [5, 34], trying to fin so-calle global outliers as points that are, roughly speaing, far away from the rest of the ata. Density-base methos such as LOF [0] an its variants try to fin so-calle local outliers as points that are, roughly speaing, locate in an area of relative low ensity compare to their NN (intene to inicate points that are outliers with respect to the nearest moe in the ata istribution). The ensity aroun points in these methos is also estimate base on NN istances. One problem with istance-base an ensity-base methos is that they can

2 also suffer from effects similar to masing an swamping, ue to the simplicity of (an thus error in) the ensity estimates. Another problem is the typically high runtime of these approaches, ue to the fact that their computation inclues at least fining the NN of each ata point (resulting in an at least quaratic complexity w.r.t. the atabase size). In this paper, we aress both problems of istance-base an ensity-base methos. We propose an stuy a general approach to improve both the quality an the performance of such outlier etection methos by combining into an ensemble results of a base metho on subsamples of the ata. Previous wor on outlier ensembles is very limite an only shows empirically that ensembles of outlier etectors have the potential to improve the quality, compare to that of their base methos [30, 36], at an increase runtime cost. Our wor is novel an avances the area of outlier etection in the following respects: We argue theoretically an emonstrate empirically that it is possible to construct ensemble members for outlier etection methos which perform iniviually alreay better than the base metho, in general. Combining those outlier etectors into an ensemble reners the performance gain not only more robust but can improve the performance even further. At the same time, when using small sample sizes for the ensemble members, we can gain consierable spee-up in runtime compare to running a stanar ensemble an, for small ensemble sizes, even compare to running the base metho on the whole ata set. The propose principle is funamental an flexible. It oes not rely on specific ata types. It can be combine with various conventional outlier etection techniques. The rest of the paper is organize as follows: We iscuss relate wor on outlier etection an ensembles for outlier etection (Section 2). We provie theoretical reasoning to support outlier etection ensembles in general an the claime properties of our metho in particular (Section 3). We provie experimental results to support our claims empirically (Section 4). We conclue the paper in Section RELATED WORK The istance-base notion of outliers (DB-outlier) [25] was the first atabase-oriente approach in the area of unsupervise outlier etection, which initiate a new line of research on this topic in the ata mining community. Variants of DBoutliers consier the istances to the nearest neighbors of each object an use these istances to ran the objects [34], or, they use the sum of istances to all points within the set of NN (calle the weight ) as an outlier egree [5]. These methos are also calle global methos in that the compute outlier scores represent global ensity scores for each point. The so-calle local methos, e.g. LOF [0], consier instea local ensity scores, which are ratios between the ensity aroun an object an the ensity aroun its neighboring objects. Variants of the local outlier moel inclue LoOP [27], an LOCI [33]. Also the istance-base metho LDOF [44] is relate in reasoning about local comparisons. It has been shown recently [37], however, that the ifferentiation between global an local methos is not strictly ichotomous but that there are egrees of locality. Much research has aime at improving the efficiency of unsupervise outlier etection by algorithmic techniques, for example base on approximations or improve pruning techniques for mining the top-n outliers [4, 7, 22, 23, 26, 42]. An analysis of such efficiency improving techniques for outlier etection algorithms has been provie by Orair et al. [32]. These techniques, however, o not aim at improving the approximations of the unerlying statistical notion of outlierness. They only approximate a specific algorithmic moel. Ensemble techniques, on the other han, have the potential to improve the performance of their components in terms of the quality of the etecte outliers, rather than in terms of runtime (but we will show in this paper that it is even possible to gain performance improvements when constructing certain types of outlier ensembles). The first approach to improve outlier etection by ensemble techniques, base on feature bagging, was propose by Lazarevic an Kumar [30], combining ifferent results of the same algorithm (namely LOF [0]) applie to ifferent, ranomly selecte feature subsets. Feature bagging is a common proceure to inuce iversity of ensemble members in ensemble classification [] or ensemble clustering [8, 4, 40]. Subsequent research on outlier etection ensembles focuse on the issue of comparability of scores for score combinations, using Sigmoi functions an mixture moeling to fit outlier scores, provie by ifferent etectors, into comparable probability values [7], or scaling by stanar eviation [3], or statistical reasoning about score istributions [28], enabling the combination of ifferent outlier etection methos into one ensemble. Schubert et al. [36] propose a similarity measure to appropriately compare ifferent outlier ranings (base on scores) an to allow for the assessment of the iversity of ifferent outlier etectors. As an application, they propose a greey ensemble approach, emonstrating the importance of iversity for the performance of an ensemble. In all these papers, although outlier etection ensembles have been iscusse an improve, no new metho of inucing iversity has been pursue. Except for feature bagging [30], all other existing ensemble methos for outlier etection [7, 28, 3, 36] are metamethos an coul be use on top of our sample-base metho (or on top of feature bagging, as in [28,3,36]). They o not propose original means to inuce iversity when using a selecte base outlier etection metho. In general, while the motivation for ensemble methos for outlier etection is borrowe from the rich traition in the literature on supervise ensemble learning [,2,2,4], the theoretical founation for ensemble learning in the unsupervise setting is far less mature. The same hols true not only for outlier etection ensembles but also for clustering ensembles espite the far more abunant literature on practical approaches in that area [8]. Although the problem setting is consierably ifferent, let us finally note that sampling has been use in ensemble clustering to inuce iversity. Different subsamples of the ata set have been clustere an the resulting clusterings were combine into a consensus clustering [3, 6, 20, 39]. 3. OUTLIER DETECTION ENSEMBLES BASED ON SUBAMPLING In this section, we will iscuss the potential benefits of using outlier etection ensembles base on subsampling. Previous approaches using ensemble learning for outlier etection [7, 28, 30, 3, 36] transferre techniques without any theoretical founation of why, what has a clear theoret-

3 ical bacgroun in supervise learning, shoul also wor in unsupervise outlier etection. Such a view can be loosely argue for when we consier outlier etection methos as classifiers. When assuming that a threshol on outlier scores is use to istinguish between outliers an inliers, we can view the outlier metho as classifying all objects into one of these two classes: outliers an inliers even though, no labels are use in the training phase when the moel (raning) is built. If we succee to construct iverse enough outlier etectors for the same ata set, we can hope to improve the overall performance over the iniviual members by combining them into an ensemble. The generic argument given is that all the ensemble members are committing errors but on ifferent cases, if the members are inepenent, i.e., iverse, or, in other wors, if the errors are uncorrelate. While such a generic view may potentially explain some of the performance gains, we will show in the following subsections that there are more specific reasons for why (uner some general assumptions) an ensemble of outlier etection methos can improve the performance over its iniviual members. 3. Benefits of Ensembles for Outlier Detection Base on Density Estimates In this paper, we are focusing on istance-base an ensity-base outlier etection methos, which, as iscusse in the introuction, compute outlier scores that are base, implicitly or explicitly, on some form of ensity estimates. One can view these methos as trying to ientify the outliers in a given ata set X with respect to an unnown probability ensity f, which represents the process that has generate the majority of the ata set (at least the inliers). The ata set X itself can be viewe as a sample rawn from the true, but unnown unerlying ensity istribution, an the methos try to estimate the ensity f(x) aroun points x using a more or less rough ensity estimate ˆf X(x) (in orer to compute outlier scores in some way). Assuming the correctness of the unerlying outlier moel of the methos, it is clear that the quality of a metho s result epens on the quality of the ensity estimate ˆf X(x) an that the results will improve if the estimate can be improve. For this case, we can show formally that a iverse ensemble of such outlier etectors oes in fact show an improve expecte performance over the iniviual ensemble members, uner some general conitions. Given a true, smooth p..f. f(x) an a ata set X, we can express an estimate ˆf X(x) of f(x) base on X as: ˆf X(x) = f(x) + v X(x) where v X(x) is a ranom variable escribing the error of the estimate ue to the finite sample. The quality of the estimate ˆf of f ecies over success an failure of the outlier etection. However, the ensity estimates use by the consiere outlier etection algorithms may not be reliable an stable in all regions of the ata space, ue to the natural intrinsic ranomness associate with a single sample that the ata set represents. If we are able to obtain multiple ensity estimates for each point x (e.g., as we propose via subsamples), we can obtain more reliable an stable ensity estimates by averaging the multiple ensity estimates for each point. The rationale for this is the following: The output of outlier methos is a raning of all points x in terms of outlier scores that, in essence, epens on the raning of the points accoring to ˆf X(x). Ieally, we want a raning of the points x accoring to f(x). If we have multiple ensity estimates for each point that we average, we can consier the estimate itself as a ranom variable an averaging these estimates for each point gives us the expectation of this variable as: E{ ˆf X(x)} = E{f(x)} + E{v X(x)} = f(x) + E{v X(x)} In this formulation, one can clearly see that the raning of objects w.r.t. E{ ˆf X(x)} is the same as the raning w.r.t. the true ensity f(x) (the ieal raning ), if just the expectation of the error v X(x) in the iniviual estimates is the same for every point x. This is obviously the case when the ranom variable that escribes the error woul not epen on x, in which case E{v X(x)} = E{v X} = µ vx, but one woul also obtain the ieal raning when the error is not inepenent on x; for instance, when the error woul vary between points but the expectation is the same for each point, we woul also have the same raning. We can even obtain the same raning as the ieal raning if the expectations E{v X(x )} an E{v X(x 2)} iffer for two points x an x 2, as long as the ifference oes not cause an inversion between the actual rans E{ ˆf X(x )} an E{ ˆf X(x 2)}, respectively. Furthermore, if we consier that for successful outlier etection, the methos only have to istinguish between outliers an inliers, we can even allow inversions between rans, as long as ran inversions occur only within outliers or within inliers. Only a ran inversion between an outlier an an inlier woul be problematic. In the next subsection, we will argue that for the propose ensemble technique using subsamples, the expectation of the error in the ensity estimate E{v X(x)} oes epen on the location x an its surrouning ensity, but that the metho has the esirable property that it can increase the gap in rans between the outliers an the inliers, maing inversions in ran between these groups of points even less liely. 3.2 Aitional Benefits of Subsampling Subsampling is theoretically well suite to introuce iversity into an ensemble of otherwise ientical istance-base or ensity-base outlier etection methos. Every member of the ensemble will etermine the outlier score of every object in the atabase, but only using a small subset of the ata to estimate the ensity aroun points. Learning ensity estimates for outlier etection on smaller samples can actually improve the etection rate of outliers, compare to learning these estimates on the whole ata set that conceptually represents just a somewhat larger sample of an unnown istribution f. We will see in the empirical evaluation that in practice, surprisingly small sample sizes (such as 20% or in many cases even just 0%) are typically not leaing to a eteriorate but to a consierably improve quality of the outlier etection for a sample-base ensemble of outlier etectors. One reason for the improve performance of an ensemble is, as expecte, just the combination of the results of multiple outlier etectors. Compare to using the ataset as the only sample rawn from f, rawing multiple subsamples X from this sample can minimize the effect of the ranomness associate with a single sample. Note that averaging the scores to buil an ensemble has been, heuristically, common practice [7, 28, 30, 3, 36], but now it fins also a theoretical justification.

4 Another, more interesting reason for the improve performance is that the base metho applie to a smaller subsample of a given ata often shows an improve outlier etection rate, compare to the same metho applie to the whole ata set. As we will argue formally in the following, this is ue to the fact that istance-base an ensity-base methos are essentially using simple (not volume normalize) nearest neighbor istances to estimate ensity. To unerstan the effects of sample base nearest neighbor istances, consier a sphere of raius r in a -imensional Eucliean space, containing n ata points uniformly istribute within the sphere. The expecte Eucliean istance from a point to its nearest neighbour (NN) is given by [9]: ( ) E{ } = r () n For a given ata set, let r be a constant value small enough so that, for two spheres having the same raius r but lying on ifferent positions of the ata space, the ata points within both spheres are approximately uniformly istribute. Now, suppose that the number of ata points within each of these spheres is ifferent, given by n an n 2 (n n 2), which means that the ensities of the ata in the respective regions of the space are ifferent (as their volumes are the same). For example, one sphere might be locate insie a ense cluster, whereas the other one might lie on a sparse area containing bacgroun noise. Then, it follows from () that the expecte NN istances in the corresponing regions of the space are given by: ( ) ( ) E{ } = r ; E{ } = r (2) n n 2 If one ranomly removes a fraction m of the ata objects with equal probability, the expecte number of remaining objects within those two spheres are given by n m an n 2m, respectively. In this case, the expecte NN istances become: ( ) ( ) E{ } = r ; E{ } = r (3) n m n 2m The ifference in the expecte istances are therefore: ( ) ( ) ( ) ( ) m = r r = r (4) n m n n m ( ) ( ) ( ) ( ) m 2 = r r = r n 2m n 2 n 2 m In relative terms, if we ivie an 2 by the original expecte istances (for the full ataset, i.e., before the subsampling), we get: ( ) 2 m ( ) = ( ) = (6) r n r m n 2 The result in (6) says that the expecte NN istances within the spheres increase proportionally as a function of the subsampling rate m. This result reflects the intuition that, in relative terms, the contrast between the ensities of the spheres is ept constant, which justifies the use of a (5) Expecte NN Distances Fraction of Data (m) Figure : Behaviour of the expecte 5-NN istances for two spheres with raius r =, in a 2D Eucliean space, containing 000m (circles) an 00m (triangles) objects uniformly istribute (m is a fraction of the ata). subsampling proceure with even sampling probabilities. In an ensemble setting, for instance, this means that one can get multiple (sub)samples that exhibit variability (iversity) in terms of their observations, but eep the same expecte ensity profile as the full ataset. The above result is important but it oes not explain all implications of subsampling when using unnormalize nearest neighbor istances. In absolute terms, Equations (4) an (5) tell us that the expecte ifference in the NN istances will be greater for a less ense sphere, i.e., > 2 if n < n 2. This means that the expecte NN istances iverge in absolute terms when the ata are ownsample to a fraction m of their original size. In other wors, the absolute ifferences between the expecte NN istances in areas of ifferent ensities ten to increase as a function of the subsampling rate. This effect is illustrate in Figure for r =, = 2, = 5, n = 00, n 2 = 000, an m ranging from 0. to. Such an effect can be beneficial for outlier etection, since it can mae it easier to istinguish between outliers an inliers. Particularly when also using an ensemble as iscusse above, the gap in the rans between outliers an inliers can increase, maing inversion of rans between these two groups less liely. 3.3 Metho an Complexity Note that the implementation of our proposal is not as simple as to tae subsamples an then run the outlier etection algorithms on these subsamples. This way we woul very liely completely miss information on the outlierness of many objects that are not containe in any subsample, an many objects woul get scores only from some of the subsamples. Instea, for each ensemble member, we raw a subsample from the atabase an compute the neighborhoo of each object in the atabase base on the subsample. This way, using subsample-base ensembles can also lea to a consierable spee-up, compare to other types of ensembles an, for small subsamples an ensemble sizes, even compare to running the base metho on the whole ata set. We will emonstrate in the experimental evaluation that sample sizes small enough to achieve substantial runtime improvements are goo choices in practice, leaing to goo outlier etection rates. In this subsection, we show the expecte runtime improvements by stuying the theoretical complexities.

5 While other ensemble methos require a multiple of the computing time compare to the base learner, the theoretical behaviour of a subsample base ensemble is faster (an requires less resources) than other types of ensembles. The typical complexity of a base metho is O(n 2 ), ue to the require NN queries over a atabase of n objects. The runtime of a stanar ensemble such as feature bagging is essentially s times the runtime of the base metho, where s is a factor that is etermine by the number of base learners use in the ensemble (i.e., the size of the ensemble). This factor is reuce in the case of feature bagging. Using only a subset of the imensions maes iniviual istance computations faster by some constant factor. For sample base ensembles, on the other han, the complete ensemble can even be faster than the base metho on the complete ataset, because of the quaratic runtime in n of the base metho. While the base metho requires NN queries for each object on the complete atabase (hence O(n 2 )), using a subsample of size m n, 0 < m <, reuces this to O(n 2 m). The runtime of a sample base ensemble is essentially s times the runtime of the base metho, using a much smaller ata set for the neighborhoo computation. For an ensemble size of 0 base learners an sample size of 0%, the sample-base ensemble woul require roughly the same runtime than a single base metho on the full ataset but 0 times less time than an ensemble with the same number s of ensemble members base on other means of iversity. For larger ensembles, the ensemble requires only a small multiple of the base metho but still only 0% (or the equivalent of the sample size m) of a stanar ensemble. For example, if we use 25 ensemble members an sample size 0%, the ensemble will require roughly 2.5 times the runtime of the base metho. 4. EVALUATION 4. Methos an Parameters For the reasons iscusse in Section 2, the canonical competitor is feature bagging (FB) [30]. As base methos we use LOF [0], LDOF [44], an LoOP [27]. For the setup of experiments, we have to consier various parameters. For both ensemble methos (feature bagging an subsampling), we choose a fixe number of 25 ensemble members. We follow the original setup of the feature bagging metho, combining the scores of the ensemble members by computing the average. For the subsampling, we consier various sample sizes. Each of the base methos requires a size of the neighborhoo. Hence we will show experimental results (i) with a fixe choice of an varying sample size; (ii) with a fixe sample size, varying ; an (iii) with fixe choices of an sample size, comparing ifferent base methos. When we fix, we choose a value that gives a reasonable result quality (i.e., better than ranom) for the base metho an compare that to the ensemble variants. Finally (iv), for the synthetic ataset collections, where the iniviual atasets follow the same general characteristics, we show an average behaviour over all atasets of the collection. We report the area uner the receiver operating characteristic curve (), which plots the true positive rate vs. the false positive rate, a common measure for evaluation of outlier etection methos [7, 28, 30, 3, 36]. The experiments are performe using ELKI [2, 3]. 4.2 Datasets For a statistical assessment, we generate two inepenent sets of 30 synthetic atasets (batch an batch2). For each ataset, we choose ranomly values for the following parameters in the given range: imensionality [20,..., 40], number of clusters c [2,..., 0], for each cluster inepenently the number of points n ci [600,..., 000]. For each cluster, the points are generate following a Gaussian moel as follows: For each cluster c i, an each attribute a, we choose a mean µ ci,a from a uniform istribution in [ 0, 0] an a stanar eviation σ ci,a from a uniform istribution in [0., ]. Then for the cluster c i, n ci cluster objects (points) are generate attribute-wise by the Gaussians N (µ ci,a, σ ci,a). The resulting cluster is rotate by a series of ranom rotations an the covariance matrix Σ corresponing to the theoretical moel is compute by the corresponing matrix operations [38]. Then, we compute for each point the Mahalanobis istance to its corresponing cluster center, using the covariance matrix Σ of the cluster. For a ataset imensionality, the Mahalanobis istances for each cluster follow a χ 2 istribution with egrees of freeom. We label as outliers those points that exhibit a istance to their cluster center larger than the theoretical 75 quantile, inepenently of the actually occurring Mahalanobis istances of the sample points. This results in an expecte amount of 2.5% outliers per ataset. As real atasets we use the atasets Satimage, Lymphography, an Segment (use also by Lazarevic an Kumar [30]). Aitionally, we chose from the UCI machine learning repository [5]: Wisconsin breast cancer (WBC) an Waveform Database Generator (waveform). While Lazarevic an Kumar consier outlier etection as equivalent to rare class etection, we argue that outliers are boun to be rare, but objects of a rare class are not necessarily outliers. Therefore, we use a ifferent preprocessing for some of the atasets: For Satimage, we combine train an test set an transforme the ataset to an outlier tas by taing a sample of 0% from class 2, evaluating the ownsample class as outliers vs. the rest. 2 For Lymphography, we merge the small classes &4 as outliers vs. the rest. For Segment, we chose classes GRASS, PATH, an SKY for ownsampling, in turn, to 0%, which reners the remaining objects of these classes outliers (resulting in three ifferent atasets). For the atasets WBC an waveform we also select a meaningful outlier class for ownsampling ( malignant, an 0, respectively). With this metho of using classification ata for evaluation of outlier etection methos we are conform with the literature [, 24, 29, 43, 44]. Overall, this results in 60 synthetic an 7 real ata sets. 4.3 Efficiency For a fair comparison, we use a preprocessing of the neighborhoo computation for all methos on equal terms, as facilitate by the framewor ELKI [2]. As in our experiments we use 25 ensemble members, we stuy the runtime of a typical base metho (LOF), the subsampling ensemble (0% sample size) an feature bagging, when scaling the number of objects in the atabase. As emonstrate in Figure 2, 2 Lazarevic an Kumar use the smallest class 4 as outlier vs. rest, but this is an example where the rare class oes not constitute outliers, as the classes 3-7 are all very similar. Accoringly, they report performance very close to a ranom result on this ataset.

6 Time (s) feature bagging ensemble subsampling ensemble base metho (LOF) Instances in ataset Figure 2: Runtime of LOF, subsampling ensemble, an feature bagging when increasing atabase size no. ensemble members Figure 3: Quality with increasing ensemble size. the subsampling ensemble is close to the base metho while feature bagging requires a multiple of the runtime. As iscusse in Section 3.3, the efficiency epens on the sample size an on the ensemble size. We o not evaluate the ensemble size further, let us just consier an example on one of the synthetic atasets to stuy the behaviour with aing more ensemble members (Figure 3). We see a strong increase in quality between 2 an 0 ensemble members, then, up to 25 ensemble members, the quality increases further, steaily but slowly. This improve performance comes at moerate runtime cost. Nevertheless, we fix the ensemble size to 25 in the following experiments. 4.4 Effectiveness For illustration of results with variances we use box plots where the box extens from the lower to upper quartile values of the ata, with a line at the meian. The whisers exten from the box to show the range of the ata. The length of the whisers exten to the most extreme ata point within.5*(75%-25%) ata range. Occasionally occurring single ata points beyon that range are plotte as flier points past the en of the whisers. Note however that the source of variance in the plots will iffer: in synthetic ata, we give the istribution over the 30 atasets, in real ata, we give the istribution over the iniviual ensemble members. Synthetic Data. First, we show as a statistical assessment the results of the subsample-base ensemble over all the synthetic atasets of batch. Here the box plots visualize the istribution of the results for the same sample size, the same base metho, an the same parametrization of the base metho for all atasets in the batch for the subsampling ensemble, the base metho (sample size ), an the feature bagging ensemble (FB). Figure 4 shows examples for a fixe = 3 for the base methos LDOF, LOF, an LoOP. The behaviour on batch2 (not shown) follows the same general FB (a) LDOF, = FB (b) LOF, = FB (c) LoOP, = 3 Figure 4: for ensembles ifferent sample sizes as well as feature bagging (FB) an base metho (sample size=), on the 30 atasets of batch. pattern. We varie from 2 to 0 an got similar results. The smaller sample size leas to larger improvements. Real Data. Having shown the ensemble performances over a set of 30 atasets for the synthetic ata, we now analyze the behaviour on iniviual real atasets. Here, we show in the whiser plots the variance in the achieve by the iniviual ensemble members base on subsamples of ifferent sample size (zero variance for sample size, which reflects the performance of the eterministic base metho on the complete ata), an feature bagging (FB). The ROC AUC of the ensembles (subsampling an feature bagging) are visualize by a iamon. Figures 5, 6, an 7 show the results for the three base methos on the atasets Lymphography, WBC, an Satimage-2, respectively. We choose the same for all base methos such that at least some of the base methos get reasonable results. For the larger ataset satimage-2, the nees to be larger as well. Comparing these plots, we see a ifferent behaviour of the base methos as some atasets are easy for some base methos while some other atasets are relatively har. In particular, LDOF oes not retrieve sensible results on all three atasets. In all cases, however, the subsampling ensemble improves. Feature bagging oes

7 FB (a) LDOF, = FB (a) LDOF, = FB (b) LOF, = FB (b) LOF, = FB (c) LoOP, = FB (c) LoOP, = 2 Figure 5: for ensemble members of the subsampling ensemble for ifferent sample sizes (boxes), the base metho (sample size=), an ensembles (iamons) on top of subsamples an feature bags (FB) on ataset Lymphography. Figure 6: for ensemble members of the subsampling ensemble for ifferent sample sizes (boxes), the base metho (sample size=), an ensembles (iamons) on top of subsamples an feature bags (FB) on ataset WBC. not perform always that convincingly, in some cases it rops to (or below) ranom quality. Only for LDOF an LoOP on Lymphography (Figures 5(a), 5(c)), feature bagging can recover from the wea performance of the base learner. As a general picture from these an other results, we see that the smaller sample size actually has the larger potential of improvement. Although the smaller sample eeps not as much information about the ataset (an the unnown unerlying ensity-istribution), from the point of view of ensemble learning, these finings mae sense, as the smaller samples will actually provie the most iverse ensemble members, an it also shows the practical applicability of the reasoning we provie in Section 3.2. In most cases, we fin the 0%-sample to wor best. However, the brea-even point between too much loss of information an too high similarity of ensemble members iffers from ataset to ataset. We have also examples where the 0%-sample is alreay too small such as in Figure 5(a). That is possibly relate to the fact that the lymphography ata are relatively small. However, we fix the sample size to 0. for the following experiments an explore the behaviour of base metho, subsampling ensemble an feature bagging ensemble over a range of. We see, as an example, in Figure 8, a slight but steay increase of the with for the base methos an the subsampling ensemble while the feature bagging ensemble appears to be much more instable. While increasing oes not, in general, increase the quality of the results, we observe the same pattern of stability of the base metho an the subsampling ensemble an higher variance of the feature bagging ensemble on other atasets as well. For the three atasets base on segment, for = 20 (again a selection that gives reasonable results for most of the base methos), we show results for all three base methos in Figure 9. Again, the subsampling ensemble compares favourably against the base metho as well as against feature bagging. 5. CONCLUSION Although we compare the sample-base ensemble against feature bagging [30], let us finally note that these two approaches are not strictly competitors. Feature bagging is liely to be an interesting approach in the context of very

8 FB (a) LDOF, = FB (b) LOF, = FB (c) LoOP, = 50 Figure 7: for ensemble members of the subsampling ensemble for ifferent sample sizes (boxes), the base metho (sample size=), an ensembles (iamons) on top of subsamples an feature bags (FB) on ataset Satimage-2. high-imensional ata [45]. Sampling shoul be helpful when the atasets are growing too large. On the other han, feature bagging is not meaningful for low-imensional ata, as the ensemble members are boun to be too similar. An sampling on too small ata is probably not too promising. However, these two problems (too small atasets with only a few imensions) are not really problems of toays research. It might be an interesting question for future wor to investigate the integration of both techniques, builing ensembles on subsets of features an subsets of ata objects simultaneously. Acnowlegments This wor has been partially supporte by NSERC (Canaa), FAPESP (Brazil), an CNPq (Brazil). 6. REFERENCES [] N. Abe, B. Zarozny, an J. Langfor. Outlier etection by active learning. In Proc. KDD, pages , Subsampling Ensemble LDOF Feature Bagging Ensemble (a) LDOF, m = 0. Subsampling Ensemble LOF Feature Bagging Ensemble (b) LOF, m = 0. Subsampling Ensemble LOOP Feature Bagging Ensemble (c) LoOP, m = 0. Figure 8: for base methos an corresponing ensembles varying on ataset waveform. segment-sky segment-path segment-grass KNN KNNW LDOF LOF LOOP LDOF LOF LOOP LDOF LOF LOOP LDOF LOF LOOP Base Subsampling FB Figure 9: for all methos, = 20, on ifferent atasets (variants of segment). [2] E. Achtert, S. Golhofer, H.-P. Kriegel, E. Schubert, an A. Zime. Evaluation of clusterings metrics an visual support. In Proc. ICDE, pages , 202. [3] E. Achtert, H.-P. Kriegel, E. Schubert, an A. Zime. Interactive ata mining with 3-parallel-coorinate-trees. In Proc. SIGMOD, 203. [4] F. Angiulli an F. Fassetti. DOLPHIN: an efficient algorithm for mining istance-base outliers in very large atasets. ACM TKDD, 3():4: 57, [5] F. Angiulli an C. Pizzuti. Fast outlier etection in high imensional spaces. In Proc. PKDD, pages 5 26, 2002.

9 [6] V. Barnett an T. Lewis. Outliers in Statistical Data. John Wiley&Sons, 3r eition, 994. [7] S. D. Bay an M. Schwabacher. Mining istance-base outliers in near linear time with ranomization an a simple pruning rule. In Proc. KDD, pages 29 38, [8] A. Bertoni an G. Valentini. Ensembles base on ranom projections to improve the accuracy of clustering algorithms. In WIRN / NAIS, pages 3 37, [9] M. M. Breunig, H.-P. Kriegel, P. Kröger, an J. Saner. Data Bubbles: Quality preserving performance boosting for hierarchical clustering. In Proc. SIGMOD, pages 79 90, 200. [0] M. M. Breunig, H.-P. Kriegel, R. Ng, an J. Saner. LOF: Ientifying ensity-base local outliers. In Proc. SIGMOD, pages 93 04, [] G. Brown, J. Wyatt, R. Harris, an X. Yao. Diversity creation methos: a survey an categorisation. Information Fusion, 6:5 20, [2] T. G. Dietterich. Ensemble methos in machine learning. In Proc. MCS, pages 5, [3] S. Duoit an J. Frilyan. Bagging to improve the accuracy of a clustering proceure. Bioinformatics, 9(9): , [4] X. Z. Fern an C. E. Broley. Ranom projection for high imensional ata clustering: A cluster ensemble approach. In Proc. ICML, pages 86 93, [5] A. Fran an A. Asuncion. UCI machine learning repository [6] A. L. N. Fre an A. K. Jain. Robust ata clustering. In Proc. CVPR, pages 28 36, [7] J. Gao an P.-N. Tan. Converting output scores from outlier etection algorithms into probability estimates. In Proc. ICDM, pages 22 22, [8] J. Ghosh an A. Acharya. Cluster ensembles. WIREs DMKD, (4):305 35, 20. [9] A. S. Hai, A. H. M. Rahmatullah Imon, an M. Werner. Detection of outliers. WIREs Comp. Stat., ():57 70, [20] S. T. Hajitoorov, L. I. Kuncheva, an L. P. Toorova. Moerate iversity for better cluster ensembles. Information Fusion, 7(3): , [2] L. K. Hansen an P. Salamon. Neural networ ensembles. IEEE TPAMI, 2(0):993 00, 990. [22] W. Jin, A. Tung, an J. Han. Mining top-n local outliers in large atabases. In Proc. KDD, pages , 200. [23] W. Jin, A. K. H. Tung, J. Han, an W. Wang. Raning outliers using symmetric neighborhoo relationship. In Proc. PAKDD, pages , [24] F. Keller, E. Müller, an K. Böhm. HiCS: high contrast subspaces for ensity-base outlier raning. In Proc. ICDE, 202. [25] E. M. Knorr an R. T. Ng. A unifie notion of outliers: Properties an computation. In Proc. KDD, pages , 997. [26] G. Kollios, D. Gunopulos, N. Kouas, an S. Berchthol. Efficient biase sampling for approximate clustering an outlier etection in large atasets. IEEE TKDE, 5(5):70 87, [27] H.-P. Kriegel, P. Kröger, E. Schubert, an A. Zime. LoOP: local outlier probabilities. In Proc. CIKM, pages , [28] H.-P. Kriegel, P. Kröger, E. Schubert, an A. Zime. Interpreting an unifying outlier scores. In Proc. SDM, pages 3 24, 20. [29] H.-P. Kriegel, M. Schubert, an A. Zime. Angle-base outlier etection in high-imensional ata. In Proc. KDD, pages , [30] A. Lazarevic an V. Kumar. Feature bagging for outlier etection. In Proc. KDD, pages 57 66, [3] H. V. Nguyen, H. H. Ang, an V. Gopalrishnan. Mining outliers with ensemble of heterogeneous etectors on ranom subspaces. In Proc. DASFAA, pages , 200. [32] G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr., an S. Parthasarathy. Distance-base outlier etection: Consoliation an renewe bearing. PVLDB, 3(2): , 200. [33] S. Papaimitriou, H. Kitagawa, P. Gibbons, an C. Faloutsos. LOCI: Fast outlier etection using the local correlation integral. In Proc. ICDE, pages , [34] S. Ramaswamy, R. Rastogi, an K. Shim. Efficient algorithms for mining outliers from large ata sets. In Proc. SIGMOD, pages , [35] P. J. Rousseeuw an M. Hubert. Robust statistics for outlier etection. WIREs DMKD, ():73 79, 20. [36] E. Schubert, R. Wojanowsi, A. Zime, an H.-P. Kriegel. On evaluation of outlier ranings an outlier scores. In Proc. SDM, pages , 202. [37] E. Schubert, A. Zime, an H.-P. Kriegel. Local outlier etection reconsiere: a generalize view on locality with applications to spatial, vieo, an networ outlier etection. Data Min. Knowl. Disc., 202. [38] T. Soler an M. Chin. On transformation of covariance matrices between local Cartesian coorinate systems an commutative iagrams. In ASP-ACSM Convention, pages , 985. [39] A. Strehl an J. Ghosh. Cluster ensembles a nowlege reuse framewor for combining multiple partitions. J. Mach. Learn. Res., 3:583 67, [40] A. Topchy, A. Jain, an W. Punch. Clustering ensembles: Moels of concensus an wea partitions. IEEE TPAMI, 27(2):866 88, [4] G. Valentini an F. Masulli. Ensembles of learning machines. In Proc. Neural Nets WIRN, pages 3 22, [42] N. H. Vu an V. Gopalrishnan. Efficient pruning schemes for istance-base outlier etection. In Proc. ECML PKDD, pages 60 75, [43] J. Yang, N. Zhong, Y. Yao, an J. Wang. Local peculiarity factor an its application in outlier etection. In Proc. KDD, pages , [44] K. Zhang, M. Hutter, an H. Jin. A new local istance-base outlier etection approach for scattere real-worl ata. In Proc. PAKDD, pages , [45] A. Zime, E. Schubert, an H.-P. Kriegel. A survey on unsupervise outlier etection in high-imensional numerical ata. Stat. Anal. Data Min., 5(5): , 202.

Review Article Statistical methods and common problems in medical or biomedical science research

Int J Physiol Pathophysiol Pharmacol 017;9(5):157-163 www.ijppp.org /ISSN:1944-8171/IJPPP006608 Review Article Statistical methos an common problems in meical or biomeical science research Fengxia Yan