On the Combination of Collaborative and Item-based Filtering

Size: px

Start display at page:

Download "On the Combination of Collaborative and Item-based Filtering"

Merryl Anthony
6 years ago
Views:

1 On the Combination of Collaborative and Item-based Filtering Manolis Vozalis 1 and Konstantinos G. Margaritis 1 University of Macedonia, Dept. of Applied Informatics Parallel Distributed Processing Laboratory Egnatia 156, P.O. 1591, 54006, Thessaloniki, Greece Abstract. In the past, there has been discussion about an approach that combines the best of the Item-based and the User-based (classic Collaborative Filtering) worlds, by first identifying a reasonably large neighborhood of similar users and then using this subset to derive the item-based recommendation model. We have taken this brief approach outline and developed a full functioning hybrid method. In this paper we first describe the execution steps of this algorithm, then we proceed with extended experiments. The first part of our experiments involves checking various parameter combinations in order to understand the algorithm behavior. The second part of the experiments compares this hybrid approach mainly to plain Item-based Filtering as far as utility and performance are concerned. keywords: personalization, prediction, machine learning 1 Introduction Recommender Systems were introduced as a computer-based intelligent technique to deal with the problem of information and product overload. They can be utilized to efficiently provide personalized services in most e-business domains, benefiting both the customer and the merchant. The two basic entities which appear in any Recommender System are the user and the item. A user is a person who utilizes the Recommender System providing his opinion about various items and receives recommendations about new items from the system. The goal of Recommender Systems is to generate suggestions about new items for a particular user. The process is based on the input provided, which most of the times is expressed in the form of the ratings of that user, and the filtering algorithm, which is applied on that input. Let m be the number of users U = {u 1, u 2,..., u m } and n the number of items I = {i 1, i 2,..., i n }. In order to execute the experiments of this work we used the original GroupLens data set. The data set consists of ratings, assigned by 943 users on 1682 movies. All ratings follow the 1:bad-5:excellent numerical scale and each {mans,kmarg}@uom.gr, URL: {mans,kmarg}

2 user was required to express his opinion for at least 20 movies in order to be considered. We have to note that the initial data set was used as the basis to generate five distinct splits into training and test data. For each split, 80% of the original set was included in the training and 20% of it was included in the test data. The test sets in all cases were disjoint. In our experiments those sets will be referred to as file=1, file=2,..., file=5. 2 The Hybrid Algorithm Karypis in [1] has briefly talked about an approach that combines the best of the Item-based and the User-based (classic Collaborative Filtering) worlds, by first identifying a reasonably large neighborhood of similar users and then using this subset to derive the item-based recommendation model. We have taken this brief approach outline and developed a full functioning hybrid method, the execution steps of which will be described right away. The first step is the Information Representation which does not differ from what we know for both Collaborative and Item-based Filtering. Its purpose is simple: to represent the data in an organized manner. To achieve that, we only require an mxn user-item matrix, R, where element r ij includes the rating that user u i (row i from matrix R) gave to item i j (column j from matrix R), or simply, a value of 1, if user u i purchased item i j, and 0 otherwise. At this point we reach the User Similarity Computation step, which can be described as the contribution of Collaborative Filtering in this hybrid approach. The aim of this step is to create a neighborhood of users most similar to a selected active user u a. We can achieve that by simply applying Pearson Correlation Similarity as follows: sim ai = corr ai = l (raj ra)(rij ri) j=1 l (r aj r a ) 2 l (r ij r i ) 2 j=1 j=1 where r aj and r ij are the ratings that item i j has received from users u a and u i, while r a and r i are the average of users u a and u i ratings. Now we can select the w users that appear to have the biggest similarity to the active user, u a, thus generating its neighborhood, AN. The size of this user neighborhood is important since it will be the base for the implementation of the Item Similarity Computation, executed in the following step. A small user neighborhood would be inadequate for any kind of item similarity computation, leading to poor results. On the other hand, a very wide user neighborhood would make the hybrid filtering approach look very much like plain Item-based Filtering. Still, we do not simply pick the best w correlates, as expressed by the highest Pearson Correlation Similarity values, but we also require that those users selected and the active user have rated in common a number of items that is higher than a specified threshold, known as Common Item Threshold. By this common item threshold we make sure that a possibly high correlation between

3 the active user and a second random user is based on an adequate number of common rated items. In the following step, the Item Similarity Computation should be calculated. The basic idea in that step is to first isolate the users who have rated two items i j and i k and then apply a similarity computation technique to determine their similarity. Various ways to compute that similarity have been proposed. We will be using the Adjusted Cosine Similarity approach, which, as shown in previous experiments [2] [3], performs better than Cosine-based Similarity or Correlation-based Similarity. This is the contribution of Item-based Filtering in the Hybrid approach. Still, there is a small but crucial difference in the way item similarities are computed in the hybrid approach, when compared to the way those calculations are carried out in plain Item-based Filtering. The similarity between two items i j and i k should be calculated if only there exist users who have rated both those items. While in plain Item-based Filtering those users could be extracted from the set of all m available users, in the hybrid approach we search for those users only in the active user neighborhood, AN, generated in the previous step. Thus, the formula for Adjusted Cosine Similarity of items i j and i k in the hybrid approach needs to be altered to the following: sim jk = adjcorr jk = q (r ij r i )(r ik r i ) i=1 q i=1 (r ij r i ) 2 q i=1 (r ik r i ) 2 where r ij and r ik are the ratings that items i j and i k have received from user u i, while r i is the average of user s u i ratings. The summations over i are calculated only for those q users, where q w, who have expressed their opinions over both items. Those users should be strictly selected from the active user neighborhood, AN. Once we have calculated the similarities between all items in the initial useritem matrix, R, the next step in the collaborative filtering procedure is to isolate the l items, i k, with k = 1, 2,..., l, that share the greatest similarity with item i a, for which we want a prediction and form its neighborhood of items, IN. Again, we do not just pick the best l correlates, expressed by the highest proximity measure values, but at the same time we require that those items selected and the active item have been rated by a number of common users that is higher than a specified threshold, known as Common User Threshold. By this common user threshold we make sure that a possibly high correlation between the active item and a second random item is based on an adequate number of commonly rating users. Now we can proceed with the Prediction Generation. Prediction Generation is the same for both plain Item-based Filtering and the hybrid approach we are currently discussing. The most common way to achieve it is through a weighted sum. Briefly, this method generates a prediction on item i j for active user u a by computing the sum of ratings given by the active user on items belonging to the neighborhood of i j. Those ratings are weighted by the corresponding similarity, sim jk, between item i j and item i k, with k = 1, 2,..., l, taken from neighborhood IN:

4 pr aj = 3 Experimental Results l sim jk r ak l k=1 sim ak k=1 In this section we will evaluate the utility of the hybrid filtering method. We will first provide a brief description of the various experiments we executed and then we will proceed and present the results of these experiments. At this point, it is necessary to note that while classic Collaborative Filtering and plain Item-based Filtering each had a couple of changing parameters (size of the user/item neighborhood and common item/user threshold, correspondingly), the hybrid approach has four free parameters, all of which can be altered during experiment execution: user neighborhood size along with common item threshold in the stage of user neighborhood formation, and item neighborhood size along with common user threshold in the stage of item neighborhood formation. Value combinations of those parameters will be utilized extensively in the following experiments. 3.1 Comparing different User Neighborhood sizes for various Common Item and User Thresholds As experiments in both Collaborative and Item-based Filtering have shown [3], neighborhood sizes and common threshold values have a serious impact in the utility of the corresponding filtering algorithm. By this experiment we wanted to monitor the impact of user neighborhood size, item and user common thresholds on the hybrid filtering approach, while also comparing it against the impact of the same factors in Collaborative and Item-based Filtering. For this reason we kept the item neighborhood size fixed and equal to 60 throughout the experiment. Regarding the selection of user neighborhood sizes, we had to keep in mind that an adequate number of users should exist in the user neighborhood for the successful generation of the item neighborhood in the subsequent stage of the hybrid approach. Thus, we avoided very low user neighborhood sizes that would perform poorly. The results from this experiment for a single data split (file=1) are displayed in the following set of three figures. Figure 1 corresponds to the mean absolute error (MAE) and coverage results for Common Item Threshold = 10. We note that the user neighborhood size affects the accuracy of the results: As the number of users in the neighborhood is increased, the error gets lower values, reaching its minimum for u-n=400. Yet, when we get to bigger user neighborhoods (u-n>400) we observe that the error stays fixed or increases. Coverage starts from very low values, when the user neighborhood sizes are small and reaches adequate values for cut=10 and cut=20, and satisf actory values for cut=30, as the user neighborhood keeps increasing. We can conclude that when common item threshold=10, for all common user thresholds tested, the best MAE and coverage values are achieved when the user neighborhood includes 400 users.

5 Fig. 1. Error and Coverage for Common Item Threshold=10 Figures 2 and 3 correspond to the mean absolute error (MAE) and coverage results for Common Item Threshold ={20,30}. Both error and coverage display a behavior reminding us of Figure 1 (cit=10). Specifically, as the user neighborhood size is increased, MAE gets lower, while coverage gets higher. Still there are a couple of significant differences: The error reaches similar low values as in cit=10, but the same is not true for coverage. This time, coverage values are lower, ranging from adequate (coverage=75,56% for cut=10, cit=20) to unacceptable (coverage=42% for cut=30, cit=30). Furthermore, the user neighborhood size after which we observe no significant changes in the behavior of MAE and coverage, shifts to lower values. Specifically, the user neighborhood size threshold has a value of 300 when cit=20, being even lower, equal to 200, for cit=30.

6 Fig. 2. Error and Coverage for Common Item Threshold=20 If we would like to compare the overall behavior displayed for the tested Common Item Threshold values (cit={10,20,30}), the following conclusions can be reached: The best MAE value (lowest error=0,7997) is achieved for common item threshold=30, common user threshold=10 and user neighborhood=350. Still this error value is accompanied by a coverage of 61,33%. As a result, because of the unacceptable coverage value, we have to reject those parameter settings For common item threshold=10 the errors are not as low as in the case of common item threshold=30. Specifically, the lowest MAE value observed is 0,8462 for common user threshold=10 and user neighborhood=400. Still this error value is accompanied by a satisfactory coverage of 87,78%.

7 Fig. 3. Error and Coverage for Common Item Threshold=30 For common item threshold=20 the coverage values lie somewhere between the previous two cases. Furthermore, the errors observed do not improve on any of those cases. As a matter of fact, they are very similar to the errors of common item threshold=10, accompanied by lower coverage. Consequently we are forced to reject this parameter setting. Taking into account all those points, we conclude that the best performing cases are achieved for common item threshold=10. If we wish to single out one best case, that would be for common item threshold=10, common user threshold=10 and user neighborhood=400.

8 3.2 Comparing the Hybrid Approach to Item-based Filtering The hybrid approach we have been discussing can be actually considered as an extension to the plain Item-based Filtering algorithm. We call it an extension since plain Item-based Filtering is enhanced by adding an intelligent way to create the pool from which users that contribute to the construction of the item neighborhood are selected. Based on this assumption, an experiment that compares the performance of Item-based Filtering against that of the hybrid approach would show us how those related algorithms contrast. For this experiment, and specifically for the Hybrid approach part, we kept the item neighborhood fixed to 60 items. Also, the user neighborhood size was set to 400. This size, as a previous experiment showed, displayed the best performing behavior for the combinations of the remaining parameters. The changing parameters were Common Item Threshold and Common User Threshold. As for the Item-based part of the experiment, there is no changing the user neighborhood - which is fixed, including all the users - and consequently, there exists no Common Item Threshold. For the purposes of the experiment, the item neighborhood was fixed to include 60 items, in accordance to the Hybrid approach. The single varying parameter was that of the Common User Threshold. The results from this experiment for two data splits (file=1,4) can be found in Table 1 and Figure 4. Table 1. Comparing Accuracy in Hybrid and Item-based Filtering hyb cit=10 hyb cit=20 hyb cit=30 item-based cut= cut= cut= Starting from the errors table, we can see that as the Common User Threshold gets bigger, MAE is, in all cases, also increased. Item-based Filtering seems to have an average performance for cut=10, the best performance for cut=20, and the worst performance for cut=30. As for the three cases of the hybrid approach that we tested, once again when common item threshold=30 we seem to get the best overall accuracy. On the other hand, when common item threshold=10 we get the worst accuracy for cut=10, and average performance for cut=20 and cut=30. Concluding, Item-based Filtering and the Hybrid approach with common item threshold=10 do not have the lowest error values among the cases we tested, but still, their accuracy values are directly comparable. Moving to coverage, that is an area where Item-based Filtering is clearly superior, when compared to the Hybrid approach, with values above 85% in all cases. The Hybrid approach for common item threshold=30, which performed better than Item-based Filtering when error was concerned, gives coverage results that rank as the worst among the three hybrid cases tested, being also unacceptably low when compared to the coverage values of Item-based Filtering. The

9 Fig. 4. Hybrid Approach vs. Item-based Filtering: Coverage best coverage observed for the hybrid method was achieved when common item threshold=10. Those coverage results lead us to the conclusion that the performance of the hybrid approach can be contrasted to that of item-based filtering, for both accuracy and coverage, only when common item threshold=10. Our final comparison concerned item-based filtering and hybrid filtering with common item threshold=10, for which accuracy and coverage experiments generated comparable results. We wanted to take into consideration a metric that evaluates performance. That metric was execution time. Our experiments showed that Item-based Filtering has clearly the best performance, its execution time being around 7 minutes for all common user thresholds we tested, cut={10,20,30}. On the other hand, for the same common user thresholds, the hybrid approach with cit=10 was disappointing, generating execution times in the range of 25 minutes. Attempting to explain this conclusion, there is one factor to consider: In Item-based Filtering, there is a single item neighborhood for each item, i k. It is calculated once and used in prediction generations for all {user-item} pairs that involve item i k as their active item. On the other hand, in the Hybrid approach there is a different item neighborhood for each {user-item} pair including item i k, since each active user, u a, has a different user neighbor, AN, which affects how item s i k neighborhood would be generated. 4 Conclusions In this work we have introduced a Hybrid filtering algorithm that combines ideas from the areas of Collaborative and Item-based Filtering, but can be better viewed as an extension to Item-based Filtering. The basic characteristic of

10 this approach is that it localizes the item-based techniques to a wide user neighborhood that is created by the implementation of collaborative filtering. Our first set of experiments aimed to display the behavior of the hybrid filtering approach for various combinations of its changing parameters and reach some optimal settings. The next step was to take an implementation of the hybrid approach that utilizes those optimal parameter settings and contrast it to the item-based filtering algorithm. This comparison provided us with disappointing results concerning the performance of the hybrid approach. Accuracy results were pretty close but item-based filtering was clearly superior when comparing coverage. Similar were the findings when comparing the two approaches in time requirements. Based on these results, we can conclude that Item-based Filtering, which utilizes a global selection of users in its prediction generation, works better than the Hybrid approach, which assumably utilizes a localized, more personalized selection of users in its prediction generation. Two factors that may participate in that unexpected difference in performance are: (a) the existence of inadequate data that would otherwise assist in the generation of better user neighborhoods (b) the effectiveness of the selected similarity metrics, which are probably not able to locate true user relations with sparse data. As a result, in our future experiments we intend to utilize a number of statistical (e.g. dimensionality reduction methods) or machine learning techniques (e.g. artificial neural networks) and explore how they assist the recommendation process. References 1. Karypis, G.: Evaluation of item-based top-n recommendation algorithms. In: CIKM (2001) 2. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.T.: Item-based collaborative filtering recommendation algorithms. In: 10th International World Wide Web Conference (WWW10), Hong Kong (2001) 3. Vozalis, E.G., Margaritis, K.G.: Recommender systems: An experimental comparison of two filtering algorithms. In: Proceedings of the 9th Panhellenic Conference in Informatics - PCI (2003) 4. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Fourteenth Conference on Uncertainty in Artificial Intelligence, Madison, WI (1998) 5. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.T.: Analysis of recommendation algorithms for e-commerce. In: Electronic Commerce. (2000) 6. Herlocker, J., Konstan, J.A., Borchers, A., Riedl, J.T.: An algorithmic frameworkd for performing collaborative filtering. In: The 1999 Conference on Research and Development in Information Retrieval. (1999) 7. Vozalis, E., Margaritis, K.G.: Analysis of recommender systems algorithms. In: Proceedings of the Sixth Hellenic-European Conference on Computer Mathematics and its Applications - HERCMA (2003)

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,