Rating prediction on Amazon Fine Foods Reviews

Size: px

Start display at page:

Download "Rating prediction on Amazon Fine Foods Reviews"

Ashlee Harris
5 years ago
Views:

1 Rating prediction on Amazon Fine Foods Reviews Chen Zheng University of California,San Diego Ye Zhang University of California,San Diego Yikun Huang University of California,San Diego ABSTRACT User s on-line reviews towards the products are prominent nowadays, and they play an increasingly important role in terms of e-commerce. It provides a valuable feedback to the merchant about which products are popular while which are not. It also provides hints to the business so that they can reevaluate their customer policy and make improvement towards certain direction. Specifically, user s rating usually reflect their opinion towards the products themselves, or the customer service they received. Either way, it gives worthwhile information that the market could rely on. There are widely-accepted feature-based models being introduced to predict user s rating,however their predictive performance is exclusive to the dataset. In other words,there is no universe solution. Besides, users review text can be utilized as useful feature because of the emotion they convey. Among the renowned models, our study is going to adopt some regression models, at the same time discovering the role the review text/summary can play in terms of users rating. 1. INTRODUCTION This study is based on users review for fine food on Amazon. Every review can be factorized as text feature and non text features. For the non-text feature we are using classical linear regressors (Elasticnet, Ridge, Lasso) and regressors by clusers (Random forest regressor, KNN regressor) to train on them while for the text features,unigram,bigram and mixed model are used. Readability of the review text is also introduced. 2. DATASET EXPLORATION The data source we are using in this report is the Finefood Amazon review 1. The dataset is obtained from [5] by J. McAuley and J. Leskovec. The essential statistics about the dataset is shown in table1. This dataset is a txt file in format and we processed it to made it fitted into panda dataframe. 1 Number of reviews 568,454 Number of users 256,059 Number of products 74,258 Average rating Timespan Oct Oct 2012 Table 1: Essential statistics about the dataset The datafram consists of the following fields: UserID for the review ItemID for the review review s submission time User s summary User s review text User s actual rating Number of people who think this review is helpful Number of people who rate this review As could be seen from Figure 1, users can have tendency when they give reviews. While some users(approximately 24000) prefer to give reviews at low rating(1.0). More users tend to give 5-star rating (around ). Similar pattern is discovered for the item-wise rating distribution. In Figure 2, more items tend to receive positive rating(from ). This is intuitively making sense because customer tend to give higher rating to those who already get bunch of high ratings. As Pu and Martin [4] mentioned in their study, the emotion that text contains could play an important role in user s rating. By extracting unigram from the review text and also record their number of occurrence. we could train a regressor based on the 1000-most-frequent words. By using a linear regressor with a regularization parameter of 1.0, we can obtain the parameter θ each word corresponding to. After sorting the parameter list, we obtained which word(s) give the most positive/negative indication towards rating. Those words are placed in Wordcloud. As shown in Figure 3, for instance, words such as amazing, thank, delicious, fantastic can have a positive influence on rating, while words like terrible, waste, disappointed will deteriorate the rating. This is intuitively valid. Similar pattern is found in summary of

the review,in Figure 4. In addition, comparing figure 3 and figure 4, the WordCloud on Summary are more representative than that on Text.

(a) positive review Figure 1: Average rating given by different users.

Comparing with the dataset provided in assignment 1, we found that no category feature are provided in this dataset.

2 the review,in Figure 4. In addition, comparing figure 3 and figure 4, the WordCloud on Summary are more representative than that on Text. Therefore, we also implemented regression with text feature on Summary which is expected to have better performance than Text. The details will be discussed in the model section. (a) positive review Figure 1: Average rating given by different users. (b) negative review Figure 3: Most common sentimental words in review text Figure 2: Average rating received by different products. 3. PREDICTIVE TASK Our target of this assignment is to build a prediction model of score based on this dataset. Comparing with the dataset provided in assignment 1, we found that no category feature are provided in this dataset. According to our previous experience on assignment 1, the category feature information is highly correlated to the prediction model. Therefore, in the first step, we used td-idf information extracted from each data entry s review text content and run a K-means clustering of the reviews with cluster number set to 10. The kernel density plot is included in Figure 5. We visualized the Kernel density distribution of score of different clusters obtained from previous K-mean clustering. We only include 3 clusters information in Figure5 because the qualitative difference in kernel density distribution among different clusters is what we care the most. From Figure 5, we observe a clear difference in score distribution among different clusters. Based on this observation, later analysis will be conducted cluster by cluster. We believe that the model built this way will serve better in revealing the true correlation between the score features and the rest features in the dataset. In Figure 6, we also include a Barplot figure including counts of each cluster. It is obvious that the number of entries in each cluster distribute unevenly. For example, the number of entries in cluster 1 nearly triples the number of entries in cluster 5. The uneven distribution of different entries further confirms the necessity of training by dividing entries into different clusters. Therefore, besides linear regressor, we also implemented random forest regressor and KNN regressor which consider the difference in cluster-wise. There are 568,454 entries of the dataset. We only used the first 100,000 entries in case of complex computation and overfitting. We split it into training and validation set by performing an split. We calculated the mean squared error (MSE) on the validation set to evaluate the performance of our model and investigate whether it is overfitted. As Pu and Martin [4] mentioned in their study, the emotion that text contains could play an important role in user s rating. We thought the text content feature has significant influence on the rating prediction. Therefore, our models for

(a) positive review Figure 5: Score distribution over clusters (b) negative review Figure 4: Most common sentimental words in summary text rating prediction treat the text content feature in two ways.

3 (a) positive review Figure 5: Score distribution over clusters (b) negative review Figure 4: Most common sentimental words in summary text rating prediction treat the text content feature in two ways. In the first half of the model section, we discuss different kinds of regressors that do not consider any information of text content. I the second half of the model section, we discuss ridge regressions with text feature in different kinds of representations. Thus, we can investigate how important the text feature is for rating prediction. 4. MODEL 4.1 Regressor on non-text feature From the class, we know there exist a wide variety of regressors for continuous variables prediction. During the implementation of different regressor, the hardest part is parameters tuning. In this project, we use the sklearn.grid search.gridsearchcv object to perform an exhaustive search on specified parameter values. This method allows iteration through defined sets of parameters. The detailed parameter sets tested are listed in the below subsections. To avoid overfitting issue in optimal parameter searching, cross-validation approach was adapted in our study. In our parameter value screen study, mean square error (MSE) was used as a scale for the models performance comparison. In this section, we do not use any information related to the text content. Figure 6: Entry amount for each cluster However, we included word count of both summary and text in our regression models. The number of features we used in all our regression models is 6. The features names are: HelpfulnessNumerator, HelpfulnessDenominator, category, summary word count, text word count and review time. For review time feature, we first convert the unix time into weekday, month and year. One hot encoding technique is used to encode year, month and weekday. In the other words, the year, month and weekdays are treated as categorical features in our investigation Ridge regressor Ridge regressor solves a regression model and treats the loss function as the linear least squares function. The regularization is given by the L2-norm, which is known as Ridge Regression. During our small optimal parameter searching, the regularization strength parameter, α, was changed from 1e 5 to 1e 5. The strength of regularization increases as the value of the parameter increases.

4.1.2 Lasso regressor Although the form of lasso regression is very similar to the Ridge regression, Lasso is a regression analysis which performs variable selection and regularization simultaneously.

The alpha value in elasticnet regressor is similar to the alpha parameter in Lasso and Ridge regressor. It has one additional l1 ratio parameter which allows us to adjust the L1 to L2 ratio.

9 with 0.1 interval. As shown in Figure 7, we find that the performance of ElasticNet regressor is highly related to the alpha parameterâăźs value.

4 4.1.2 Lasso regressor Although the form of lasso regression is very similar to the Ridge regression, Lasso is a regression analysis which performs variable selection and regularization simultaneously. The constant that multiplies the L1 term varies from 1e 5 to 1e 5 during our investigation Elasticnet Elasticnet regressor is the combination of Lasso and Ridge regressor. The alpha value in elasticnet regressor is similar to the alpha parameter in Lasso and Ridge regressor. It has one additional l1 ratio parameter which allows us to adjust the L1 to L2 ratio. For example, if we set l1 ratio to 1.0, the ElasticNet regressor would be equal to LASSO regression. In our investigation, the range of alpha is from 1e 5 to 1e 5. The l1 ratio varies from 0.1 to 0.9 with 0.1 interval. As shown in Figure 7, we find that the performance of ElasticNet regressor is highly related to the alpha parameterâăźs value. The l1 ratio parameter shows some effects on the performance of ElasticNet. The ElasticNet regressor model performs worse as the Lasso ratio increases. Based on the above observation, we picked ElasticNet regressor modelâăźs MSE data with l1 ratio parameter equals 0.1 when we compare itsâăź performance with Lasso and Ridge regressors in Figure 8. We observe that among all three linear regressors, Ridge regressor is the most robust one. Ridge regressor shows low MSE value across different âăÿalphaâăź values. Figure 8: Performance of Lasso regressor, Ridge regressor and ElasticNet the model to train the dataset. The n estimators represents the number of trees in the forest. The values of n estimator we use is 10, 100, 200, 500 and According to Figure 9, the MSE of Random Forest Regressor converges once the number of decision trees exceeds 100. The MSE we got from 1000 decision trees and 100 decision trees are nearly identical. In Figure 9, we observe that the overall MSE of Random Forest Regressor is much lower than the Ridge, Lasso and Elastic net regressors. It is worth mention that the training and fitting procedure was conducted on a 400,000 dataset. A 0.3 drop on the MSE of validation dataset could be considered as a significant improvement in prediction. In Random Forest Regressor, the MSE on the test dataset is much lower compared to the MSE value on the test dataset. The large gap between MSE of training data and validation data indicate the Random Forest Regressor are more prone to overfitting. Figure 7: Elasticnet parameter screen Random Forest Regressor Beside the above three simply least square fitting model, we also include random forest regressor on our dataset. The random forest model is a meta-estimator that is built upon decision trees. The random forest is an ensemble of decision tree where each tree is trained on a small subset of the dataset. The random forest uses averaging to improve the predictive accuracy and control overfitting. We only changed the n estimators parameter when we implemented Figure 9: Performance of Random forest regressor KNN regressor The last regressor model we used in this section is K-nearest neighbors regressor. The K-nearest neighbor Regressor is based on K-nearest neighbors. The concept behind K-nearest

The leaf size parameter passed into the regressor was changed from 10 to 100 with 10 as interval. From Figure 10, we see that changing the leaf size will not change the MSE values significantly.

5 neighbor regressor is that the dataset could be treated as an ensemble composed of many small data subsets. Regression is done by local interpolation of target points associated their nearest neighbors. In our investigation, we varied the number of neighbors from 5 to 50 with 5 as interval. The leaf size parameter passed into the regressor was changed from 10 to 100 with 10 as interval. From Figure 10, we see that changing the leaf size will not change the MSE values significantly. In our training process, we find that as the leaf size increase, the model fitting speed slows down. The performance of K-nearest neighbors regressor is indeed highly depended on the number of neighbors. When the number of nearest-neighbors is 5, we achieve the lowest MSE. This results once again confirms the score of a review is highly related to entries that are similar to the review. Clustering is a prerequisite for high accuracy prediction of the reviewâăźs score. common bigrams. The third one is representing the text content by 1000 most common unigrams and bigrams. And, we used the mean squared error (MSE) to test the performance od the regressions. Since, sometimes the combination of two words in specific order conveys specific information. The third one is the most powerful regression since it contains more information than first two. We speculated that the regression with mixture of unigrams and bigrams would has the lowest MSE. It was asserted in the figure 11 and figure12. There are two kinds of text feature in this dataset. One is whole review text, the other one is the summary of the review. Form figure3 and figure 4, we can see the words extracted from the Summary are more representative for the users attitude than the Text. Thus, we did ridge regression on Summary and Text separately. And we speculated that the performance of the regression with text feature of Summary would be better than the performance of the regression with text feature of Text. Comparing the figure11 and figure12, the feature extracted form Summary does lead to lower MSE. It is consisted our conclusion got from the basic property of the dataset. Figure 10: KNN regressor parameter screen 4.2 Linear regressor on Text feature Users review text can be utilized as useful feature because of the emotion they convey. In this section, we are going to explore the performance of ridge regression that performs sentiment analysis on text content. Since there dataset is very huge, which contains 568,454 entries. We only used the first 100,000 entries and split it into training set and validation set by performing an split. We removed the punctuation and stop words that do not have benefit on the regression when we count the number of words and bigrams. There are 94,451 unique single word and 1,164,185 unique bigrams amongst all the review text. There are 17,624 unique single word and 100,170 unique bigrams amongst all the summary. Although the big amount in total, we did several experiment found that the most 1,000 common unigrams or bigrams are enough to represent the text feature. More unigrams or bigrams involving does increase the complexity of the computation but does not improve the performance much. Therefore, we implemented three kinds of text feature in this section. The first one is representing the text content by first 1000 most common single words. The second one is representing the text content by first 1000 most Figure 11: Performance on review text Figure 12: Performance on Summary text

6 In addition, the readability of each review, which gauges the comprehension difficulty of each review, would contribute an effect of the helpfulness prediction. Users would generally not find reviews that are too complex or difficult to read, or at the other extreme too simple or immature diction-wise, helpful. We guessed that the complexity of the review text also effects the rating the user given. Our intuition here is the more complex text would probably have more various content. In order to further improve the performance of our regressions, we added readability of the text, that is complexity of the review text, into the features. To assess readability of each review, we used the Automated Readability Index[1] which calculates an approximate representation of the U.S. grade level needed to comprehend the text: ARI = 4.71( characters words words ) + 0.5( ) (1) sentences However, from figure13, the performance of the regression is not improved for small regularization parameter, α, but also is worse for large α comparing to the regression without readability. It might caused by the overfitting. is more than once in the review. Eugene etc. all[7] s paper provides a more general way to select feature by using ensembles and introducing artificial noise variables. In their study, the feature selection technique they adopted is shown effective,yet non-trivial to those without a solid mathematical background. The underlying information from review text is discovered by Krishnamoorthy, Srikumar[2] and Lionel Martin etc.[4]. Both of them clustered the word in terms of different emotional context. After that, they gave different weight to each cluster. The result they obtained is surprising well, when compared with numeric-only features. 6. RESULT 6.1 Performance of baseline model A simple predictor with the model shown in equation2 which we implemented in the assignment 1 is used as the baseline. As before, we also used the first 100,000 entries of the dataset and split it into training and validation set by performing an split. The performance of this model is shown in figure 14. The MSE on the validation set is about 1.6. In addition, we found this model is prone to be overfitted since the huge dataset. rating = α + β u + β i (2) Figure 13: Performance on review text with readability feature 5. RELATED LITERATURE Data scientist has been making effort in developing better model for rating prediction. Almost every attribute of the data frame can be utilized as predicting parameters. An improved model to latent factor model is introduced by Moghaddam Samaneh, and Martin Ester [6]. They propsed a probabilistic graphical model based on LDA(Latent Dirichlet allocation), Factorized LDA (FLDA), not only utilizing each item and user s tendency to receive/give rating, but also use the review. They used EM algorithm to train the parameters. In result, the performance of FLDA beats that of traditional LDA. This is a complex model that utilize almost every informatic aspect of the dataset, yet hard to implement. This model is not implemented here because of timing issue. In terms of feature selection, paper by Leon etc. [3] provide a fresh perspective. Besides the feature could be extracted from the review, they also include the feature of communites network such as Reviewer clustering and Reviewer Degree. Their method is applicable when the occurrence of the user Figure 14: Performance of the baseline 6.2 Performance of our models We implement several kinds of regressions with the feature involving text content and the regressions only focus on nontext feature. From figure 8, figure9, figure11, and figure12, we can seen that all our model with appropriate regularization parameter can beat the baseline performance. The best one of the regression that does not consider text feature, random forest regression, has the lowest MSE about 1.0. But the computation of random forest regressor is very expensive. It requires much more time when the number of decision trees exceed 100, for example, even more than half hours for 1,000 trees. It is not efficient. However, in general, the regressions with text feature beats the regressions who do not consider the text content. And they are more efficient. They only need several minutes for the ridge regression to compute even for the huge dataset. The most powerful regression with the text feature is the mixture of

7 unigrams and bigrams representation. It has the least MSE on validation set about 1.13 and it is efficient as well. 7. CONCLUSION Through a set of experiments on different kinds of regressions, we found that the text feature contributes significantly on the performance of rating predictions. It is consisted with the intuition that the emotion of review reflects the rating in stars. Users tend to give a higher rating with review consisted almost with positive words. In the future, we could try to combine the text feature and non-text feature to perform prediction. It might have better performance. 8. REFERENCES [1] W. H. DuBay. The principles of readability. Calif: Impact Information, [2] S. Krishnamoorthy. Linguistic features for review helpfulness prediction. Expert Systems with Applications, 42.7( ), [3] S. Leon, Vasant. Using properties of the amazon graph to better understand reviews [4] L. Martin and P. Pu. Prediction of helpful reviews using emotions extraction. AAAI Publications, [5] J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, [6] S. Moghaddam and M. Ester. The flda model for aspect-based opinion mining: Addressing the cold start problem. Proceedings of the 22nd international conference on World Wide Web. ACM, [7] e. a. Tuv, Eugene. Feature selection with ensembles, artificial variables, and redundancy elimination. Journal of Machine Learning Research, ( ), 2009.

Case Studies of Signed Networks

Case Studies of Signed Networks Christopher Wang December 10, 2014 Abstract Many studies on signed social networks focus on predicting the different relationships between users. However this prediction