Towards More Confident Recommendations: Improving Recommender Systems Using Filtering Approach Based on Rating Variance

Towards More Confident Recommendations: Improving Recommender Systems Using Filtering Approach Based on Rating Variance Gediminas Adomavicius gedas@umn.edu Sreeharsha Kamireddy 2 skamir@cs.umn.edu YoungOk Kwon ykwon@csom.umn.edu Information and Decision Sciences, Carlson School of Management, University of Minnesota 2 Department of Computer Science and Engineering, University of Minnesota Abstract. In the present age of information overload, it is becoming increasingly harder to find relevant content. Recommender systems have been introduced to help people deal with these vast amounts of information and have been widely used in research as well as e-commerce applications. In this paper, we propose several new approaches to improve the accuracy of recommender systems by using rating variance to gauge the confidence of recommendations. We then empirically demonstrate how these approaches work with various recommendation techniques. We also show how these approaches can generate more personalized recommendations, as measured by the coverage metric. As a result, users can be given a better control to choose whether to receive recommendations with higher coverage or higher accuracy.. Introduction and Motivation In the present age of information overload, it is becoming increasingly harder to find relevant content. This problem is not only widespread but also alarming [9]. Over the last decade, recommender systems have been introduced to help people deal with these vast amounts of information [2]. Recommender systems have been widely used in research as well as e-commerce applications, such as MovieLens, Amazon, Travelocity, and Netflix. Most common formulation of the recommendation problem relies on the notion of ratings, i.e., recommender systems estimate ratings of items that are yet to be consumed by the users (based on the ratings of items already consumed). Recommendations to users are typically made based on the strength of the predicted ratings, i.e., the items with the highest predicted ratings are the ones being recommended. However, it has been argued that, in addition to recommendation strength, presenting the confidence of a recommendation (i.e., how certain the system is about the accuracy of the recommendation) to a user may be no less important [7]. Nonetheless, the issue of confidence has not been explored comprehensively, and the vast majority of recommender systems provide recommendations based solely on their strength. However, high-strength predictions may not always mean good accuracy, since, for example, they are often obtained from small amounts of data. Therefore, beyond the recommendation of simply high predicted ratings, we need to provide recommendations that users can rely on with confidence. A confidence measure is important as it can help users decide which movies to watch, products to buy, and also help an e-commerce site in making a decision on which recommendations should not be displayed, because an erratic recommendation can diminish the trust of users in the system [7]. Surveys of MovieLens users also indicate the need for a proper confidence display [6], and it has been demonstrated how the display of confidence measures has helped the users in making better decisions and the need for more sophisticated confidence measures has also been expressed []. A wide range of recommendation algorithms have been developed, but hardly any of the methods give a measure of confidence []. In this paper, we propose several new approaches to improve the accuracy of recommendations by using rating variance (which, as we show, is inversely related to the recommendation accuracy) to gauge the confidence of recommendations. We then empirically show how these approaches work with different recommendation techniques. We also show how these approaches can generate more personalized recommendations, as measured by the coverage metric (described later in the paper in more detail). As a result, users can be given a better control to choose whether to receive recommendations with higher coverage or accuracy. 2. Related Work Recommender systems are usually classified into three categories based on their approach to

recommendation: content-based, collaborative, and hybrid approaches [2]. Content-based recommender systems recommend items similar to the ones the user preferred in the past. Collaborative (or collaborative filtering) recommender systems recommend items that users with similar preferences have liked in the past. Finally, hybrid approaches can combine content-based and collaborative methods in several different ways []. Furthermore, recommender systems can also be classified based on the nature of their algorithmic technique into memory- and model-based approaches [4]. Memory-based techniques usually represent heuristics that calculate recommendations on the fly based directly on the previous user activities. In contrast, model-based techniques use previous user activities to first learn a predictive model (typically using some statistical or machine-learning methods), which is then used to make recommendations. Numerous recommendation techniques have been developed over the last few years, and several different metrics have been employed for measuring recommendation accuracy, hardly any of the techniques provide the measure of confidence []. To address this, [] have used a number of ratings submitted for each item as a simple, non-personalized measure of confidence for predictions in a recommender system. We use this as a starting point to develop richer confidence metrics and suggest ways of giving control on confidence to the user. It has also been shown that explaining recommendations was effective in helping users adopt recommendations [6, ]. In particular, in our proposed approaches we use rating variance for each user and/or item, since each user/item may exhibit different rating habit/pattern which can affect the accuracy of prediction. This motivation on using variance is shared in the decision science literature []. When a decision maker is provided with multiple opinions and needs to aggregate them, the variance of the individuals opinions is used as a convenient proxy of decision maker s confidence in their opinion. A somewhat similar approach in recommender systems literature is used in [8], where the variance in rating patterns among the users with similar interests is captured using a decoupled model. That is, they develop a specialized model-based technique to account for rating patterns. In contrast, our approaches can be used in conjunction with any existing recommendation technique. We can simply employ our new rating-variance-based filtering methods to the predictions already obtained from any existing technique, resulting in the improved recommendation accuracy. However, while better understanding of the recommendation confidence can improve the recommendation accuracy, the accuracy alone may not be enough in evaluating the performance of recommender systems; e.g., it is easy to obtain higher precision by recommending only very popular items that most users are likely to rate highly. If we recommend items based only on the popularity of items, then there is no personalization, and the recommender systems will be of no use. It has often been suggested that recommender systems must be not only accurate, but also useful [7]. Thus, various coverage-based usefulness measures (e.g., measuring the percentage of items that the recommender system is able to make recommendations for) are used to make sure that the recommendations are personalized with enough item variety [7]. For example, as a variant of coverage measure, [] proposed a new metric which computes an average dissimilarity between all pairs of recommended items. Our proposed recommendation approaches will also be evaluated with a coverage measure (in addition to accuracy) and will demonstrate how to employ the notion of confidence to assure certain levels of both accuracy and coverage.. New Rating-Variance-Based Filtering Approaches We propose several techniques that can exploit the confidence of recommendations while using any existing recommendation algorithm as a black box. We first predict the ratings of unrated items with some existing collaborative filtering techniques. We then analyzed how the accuracy of predictions is related to various user/item statistics, such as rating frequency, mean, variance, and predicted rating value for different users and items. Among the statistics, we consistently found that the recommendation accuracy monotonically decreases (i.e., the mean absolute error increases) as the rating variance of the user/item increases. As a result, we have learned that, if we recommend to users only those items whose rating variance is small, the accuracy of recommendations can improve.

Recommender systems typically recommend the most relevant N items to each user, since the users in real-world personalization applications are usually interested in looking at only several highestranked item recommendations. We used movie recommendation dataset in our experiments, and the movie ratings in this data were on a scale to ; therefore, we first needed to define what highly-ranked means. We split the ratings into a binary scale by treating ratings greater than. (i.e., ratings 4 and ) as highly-ranked and ratings less than. (i.e., ratings, 2, and ) as non-highly-ranked. Thus, the rating values of all recommended items should be greater than the threshold of.. We then evaluated the accuracy of the recommender systems based on the percentage of truly high ratings among those that were predicted to be the N most relevant items for each user (i.e., using the precision-in-top-n metric).. Simple Filtering Approach In general, top-n items for each user are obtained in two steps: filtering the predicted ratings greater than the pre-defined acceptable threshold (e.g.,. out of ), and then choosing N items above the threshold with the highest predicted ratings. In our first simple filtering approach, in addition to filtering based on the acceptable rating threshold, we also filter out the recommendations according to some user-specified ratings standard deviation threshold D. In particular, for each rating (of an item to a user) that is predicted by the recommender system, we also assign a variance score which, in our case, is calculated as the standard deviation of all known ratings for that predicted item. In our experiments, we varied threshold D from to.2 in increments of. (note, that the average rating std. dev. of items was approximately. in our dataset). E.g., in the case of D =, we recommended to users only items with a rating std. dev. of less than (and, of course, where the predicted rating was above.). We also tested the standard recommendation approach, i.e., without any constraints on the rating std. dev. (D = ), in order to compare it with the proposed approaches. With this simple filtering approach, only the highly-ranked items with low rating std. dev. will be recommended to users, and in turn, we expect that the more the rating std. dev. threshold restricted the recommended items, the better the performance of the recommendations would be in terms of the precision-in-top-n metric. However, as mentioned above, recommender systems must be not only accurate but also useful [7]. As a measure of usefulness, we use a coverage metric by computing the total number of unique items recommended across all users. This coverage metric is as useful a measurement as precision, because it reflects the level of personalization realized by the system; i.e., this coverage metric will show whether all users are recommended the same top-n items or every user is given his/her own unique top-n items (or something in-between). One can expect there to be a tradeoff between accuracy and coverage it may be easy to achieve fairly good accuracy (in terms of precision-in-top-n) by providing recommendations of the same most popular items to everybody, but the coverage (i.e., the level of personalization) will clearly suffer, and vice versa. Therefore, it is important to maintain a balance between accuracy and coverage. To address this issue, we propose the smart and safe filtering approaches that can increase the coverage of recommendations while maintaining a relatively high precision-in-top-n..2 Smart and Safe Approaches Smart and safe approaches directly use a rating std. dev. (of each item) in selecting recommendations by adjusting the predicted rating values instead of using a filtering threshold. In particular, in the smart approach, when predicting the rating for item x for a given user, we take the predicted rating generated by any traditional recommender system and subtract one standard deviation of known ratings of item x from this predicted rating. In other words, this way we try to model a worst case scenario i.e., how low the actual rating could be, if the prediction by the traditional recommender system is not very accurate. We then recommend the item, if this newly computed predicted rating is greater than the acceptable rating threshold (i.e.,.). Specifically, we chose the distance of one standard deviation from the predicted rating as a worst case because it usually accounts for a significant portion of data (e.g., if the data is normally distributed, one standard deviation from the mean accounts for 68% of the sample). That is, we reduce the rating by size of one standard deviation and check to see whether this adjusted value can still be considered as highly-ranked. If so, we recommend N items, according to the order of the original (not adjusted) predicted ratings. I.e., we rank the top-n items based on the average case scenario, but

make sure that we recommend only those items that would be above the acceptable threshold in the worst case scenario. The safe approach is a slight variation of smart, only slightly more conservative. Here, after subtracting one standard deviation from the predicted ratings, we rank top-n items based on the newly adjusted ratings (i.e., worst case scenario) instead of the original predicted ones (i.e., average case scenario). Both safe and smart approaches assure that ratings adjusted by one standard deviation still exceed the acceptable rating threshold (i.e.,.). Thus, recommendations should arguably be fairly accurate (in terms of precision-in-top-n), while their coverage does not deteriorate as much as in the simple filtering approach (as we empirically demonstrate below). All three new approaches are summarized in Figure.. PRE-PROCESSING: Predict the ratings for each user Known Ratings R Any Recommendation Technique Predicted Ratings R' 2. FILTERING: Filter R' based on rating std. dev. Simple Filtering Approach R' >= ART Std. dev. of R < D (D =,,,.2,.) Smart Approach R' ( std. dev.) >= ART Safe Approach R' ( std. dev.) >= ART R' := R' ( std. dev.). POST-PROCESSING: Recommend top N items for each user Sort R' in a descending order Choose top N items with the highest rating R' for each user Figure. Three rating-variance-based filtering approaches. (ART = Acceptable Rating Threshold =.) 4. Empirical Results To evaluate the proposed approaches, we used a MovieLens ratings dataset (available at www.grouplens.org) with, movie ratings by various users. We prepossessed the data in order to obtain enough ratings for the predictions by including only the ratings by users who rated at least 2 movies and the ratings of movies rated by at least 2 users. We ended up with 94,44 ratings, including 97 users and 97 movies (i.e., data sparsity %). As mentioned earlier, we can use any of the existing recommendation techniques in conjunction with our proposed approaches. As an example of a memory-based technique, we chose one of the most popular collaborative filtering (CF) techniques that uses an adjusted weighted sum and a cosine similarity metric [4]. A Naïve Bayesian classifier was chosen as the model-based technique; despite its simplicity, it is competitive with other learning algorithms in many cases []. For Naïve Bayesian classifier, we extended the model mentioned in [4] to multi-class rating data (i.e., scale -) in our experiment. Both user-based and item-based approaches with varying neighborhood sizes were examined for both the memory- and model-based techniques, and we obtained similar results in all cases. Here, because of space limitations, we display only the results of the item-based CF technique with the neighborhood of all users. Figure 2 shows the results of the three proposed approaches in both precision (line graph) and coverage (bar chart) for top-n items for each user (where N =,, ) using the memory- and model-based techniques. The simple filtering approach (with rating std. dev. threshold D =,,,.2, and ) clearly shows that a lower std. dev. threshold leads to higher precision, but lower coverage. In addition, the smart and safe approaches compare favorably to the simple filtering approach in terms of performance. In particular, in conjunction with the memory-based technique, the smart and safe approaches demonstrate the highest precision and have relatively high coverage. Furthermore, in conjunction with

the model-based technique, these two approaches demonstrate very high coverage and have relatively high precision. Note that, in general, the model-based technique produces higher coverage than the memory-based one across all proposed variance-based filtering approaches. This can be explained by the fact that the predicted rating values acquired from the model-based technique are discrete, not continuous. That is, many predictions will have identical values (e.g., rated as ), and the top-n items are chosen randomly from this list of highest-rated items for each user, thus resulting in higher variety. Also, as expected, because of its more conservative sorting procedure based on the worst-case scenario, the safe approach performs better in precision and is similar or worse in coverage than the smart approach. From the results, we can conclude that the smart and safe approaches have a more balanced performance in their precision and coverage metrics, as compared to the simple filtering approach. Memory-based (adjusted weighted sum and cosine similarity metric) technique 8 6 4 2 8 6 4 2 Safe Smart....2 Any std dev 8 6 4 2 7 6 4 Model-based (Naïve Bayesian Classifier) technique Safe Smart....2 Any std dev Figure 2. Precision and coverage in top-n (N =,, ) of the three proposed approaches (precision-in-top-n: line graph with the right Y-axis; coverage: bar chart with the left Y-axis ) We also tested various portions of one standard deviation in the range of to in increments of. (i.e., to 68% confidence intervals) in the smart and safe approaches, since one standard deviation was only one way to consider rating variance. We replaced the R' std. dev. formula in Figure with R' k * std. dev (k =,.,, ). If k is, then there is no rating variance constraint. Figure shows that precision increases and coverage decreases as we subtract a larger portion of one standard deviation, and that the change rate of precision and coverage are lower in the model-based technique than in the memory-based technique. In summary, this result is consistent with the result of Figure 2, i.e., the stricter the rating variance constraint is, the higher precision and the lower coverage is obtained. This is especially useful in practical recommendation applications, since users can have a better control to choose whether to receive recommendations with higher coverage (i.e., more personalized) or higher accuracy by choosing portions of the standard deviation as filtering thresholds. Smart approach with Memory-based technique.7.7 Safe approach with Model-based technique 8 6 4 2 8 6 4 2..2..4..6.7 8 6 4 2 7 6 4..2..4..6.7.7.7 Figure. The smart and safe approaches with various standard deviations (,.,, ).. Conclusions, Discussion, and Future Work In this paper, using a simple filtering approach we have demonstrated that prediction accuracy can be

significantly improved by filtering out recommendations above a minimum rating std. dev. threshold. However, there was also a corresponding decrease in the coverage of recommendations. We then proposed the smart and safe approaches which generate recommendations of greater value by providing a good balance of prediction accuracy and coverage. New approaches are especially useful, since they can confidently improve the accuracy of recommendation, and in addition to that, a user can control the balance between the accuracy and coverage of recommendations. Furthermore, this research provides some interesting opportunities for future work. In particular, the rating variance can be considered in selecting the training data for unknown rating prediction, which is typically referred to as an active learning problem. If we can choose ratings with potentially high accuracy (i.e., ratings with a low average of user rating variance and item rating variance) among all the known ratings, we may arguably reach the desired accuracy level faster (i.e., with less training data) than when we choose the data randomly. Our preliminary results show that this active learning approach (which uses only a fraction of available ratings, i.e., the ones with lower variance) increases the performance in accuracy metrics such as mean absolute error, precision, and recall in the earlier stages, although a more comprehensive further exploration is still needed. Also, we are currently extending our proposed approaches to more advanced model-based techniques, such as Flexible Mixture Model [4], and plan on testing them on additional datasets, such as the ones from Yahoo Movies and Netflix, as well as with other evaluation metrics. Finally, the proposed approaches currently can be applied as a postprocessing step to any recommendation technique. However, these approaches could be more directly integrated into some existing recommendation techniques, including traditional neighborhood-based collaborative filtering [4, ], where the variance of neighborhood ratings (as opposed to all ratings for the same item) could be exploited to further improve the recommendation quality. Acknowledgments Work reported in this paper was supported in part by the National Science Foundation grant no. 4644. References [] G. Adomavicius, A. Tuzhilin, Toward the Next Generation of Recommender Systems: A Survey of the Stateof-the-Art and Possible Extensions, IEEE Trans. on Knowledge and Data Engineering, 7(6):74-749,. [2] M. Balabanovic, Y. Shoham, Fab: Content-Based, Collaborative Recommendation, Comm. ACM, 4():66-72, 997. [] K. Bradley and B. Smyth, Improving Recommendation Diversity, Proc. of 2 th Irish Conf. on Artificial Intelligence and Cognitive Science,. [4] J.S. Breese, D. Heckerman, C. Kadie, Empirical Analysis of Predictive Algorithms for Collaborative Filtering, Proceedings of 4 th Conference on Uncertainty in Artificial Intelligence, 998. [] D.V. Budescu, A.K. Rantilla, H.T. Yu, and T.M. Karelitz, The effects of asym-metry among advisors on the aggregation of their opinions, Organizationl Behavior and Human Decision Processes, 9:78-94,. [6] J.L. Herlocker, J.A. Konstan, J. Riedl, Explaining collaborative filtering recommendations, Proc. of the Conf. Computer Supported Cooperative Work,. [7] J.L. Herlocker, J.A. Konstan, L.G. Terveen, J. Riedl, Evaluating Collaborative Filtering Recommender Systems, ACM Transactions on Information Systems, 22():-, 4. [8] R. Jin, L. Si, C.X. Zhai, J.Callan, Collaborative Filtering with Decoupled Models for Preferences and Ratings, Proc. of 2 th Intl. Conf. on Information and Knowledge management, [9] W. Knight, Info-mania' dents IQ more than marijuana, NewScientist.com news,. URL: http://www.newscientist.com/article.ns?id=dn7298. [] S.M. McNee, S.K. Lam, C. Guetzlaff, J.A. Konstan, J. Riedl, Confidence Displays and Training in Recommender Systems, Proc. of IFIP TC Intl. Conf. on Human-Computer Interaction,. [] T. Mitchell, Machine Learning, McGraw-Hill, 997. [2] P. Resnick, H. R. Varian, Recommender systems, Comm. ACM, 4():6 8, 997. [] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-Based Collaborative Filtering Recommendation Algorithms, Proc. of th International WWW Conference,. [4] L. Si, R. Jin, Flexible Mixture Model for Collaborative Filtering. Proc. of 2 th Intl. Conf. on Machine Learning,.