Consumer Review Analysis with Linear Regression

Similar documents
Modeling Sentiment with Ridge Regression

Predicting Breast Cancer Survivability Rates

CSE 255 Assignment 9

Case Studies of Signed Networks

What Predicts Media Coverage of Health Science Articles?

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Sawtooth Software. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB RESEARCH PAPER SERIES. Bryan Orme, Sawtooth Software, Inc.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Simple Linear Regression One Categorical Independent Variable with Several Categories

Classifying Substance Abuse among Young Teens

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

CSE 258 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Chapter 1: Exploring Data

A Comparison of Linear Mixed Models to Generalized Linear Mixed Models: A Look at the Benefits of Physical Rehabilitation in Cardiopulmonary Patients

Chapter 3: Examining Relationships

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Mature microrna identification via the use of a Naive Bayes classifier

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Model reconnaissance: discretization, naive Bayes and maximum-entropy. Sanne de Roever/ spdrnl

Rating prediction on Amazon Fine Foods Reviews

CSE 258 Lecture 1.5. Web Mining and Recommender Systems. Supervised learning Regression

Linear Regression in SAS

Exploiting Ordinality in Predicting Star Reviews

Business Statistics Probability

Identification of Tissue Independent Cancer Driver Genes

Lesson 9: Two Factor ANOVAS

bivariate analysis: The statistical analysis of the relationship between two variables.

South Australian Research and Development Institute. Positive lot sampling for E. coli O157

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA

Parameter Estimation of Cognitive Attributes using the Crossed Random- Effects Linear Logistic Test Model with PROC GLIMMIX

Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model

CHILD HEALTH AND DEVELOPMENT STUDY

Performance of Median and Least Squares Regression for Slightly Skewed Data

Prediction of Average and Perceived Polarity in Online Journalism

Things you need to know about the Normal Distribution. How to use your statistical calculator to calculate The mean The SD of a set of data points.

Lecture (chapter 12): Bivariate association for nominal- and ordinal-level variables

Problem set 2: understanding ordinary least squares regressions

Cocktail Preference Prediction

Automatic Medical Coding of Patient Records via Weighted Ridge Regression

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

LOW-RANK DECOMPOSITION AND LOGISTIC REGRESSION METHODS FOR LINK PREDICTION IN TERRORIST NETWORKS CSE 293 MS PROJECT REPORT, FALL 2010.

Predictive Validity of a Robotic Surgery Simulator

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

Still important ideas

What is Regularization? Example by Sean Owen

MITOCW ocw f99-lec20_300k

Chapter 2. Behavioral Variability and Research

Section 3.2 Least-Squares Regression

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Multiple Regression Analysis

Sensitivity analysis for parameters important. for smallpox transmission

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Methods for Addressing Selection Bias in Observational Studies

INVESTIGATION OF ROUNDOFF NOISE IN IIR DIGITAL FILTERS USING MATLAB

Lab 8: Multiple Linear Regression

Predicting Breast Cancer Survival Using Treatment and Patient Factors

7 Statistical Issues that Researchers Shouldn t Worry (So Much) About

9 research designs likely for PSYC 2100

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Making a psychometric. Dr Benjamin Cowan- Lecture 9

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

1.4 - Linear Regression and MS Excel

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

Normal Q Q. Residuals vs Fitted. Standardized residuals. Theoretical Quantiles. Fitted values. Scale Location 26. Residuals vs Leverage

Psychology Research Methods Lab Session Week 10. Survey Design. Due at the Start of Lab: Lab Assignment 3. Rationale for Today s Lab Session

Cerebral Cortex. Edmund T. Rolls. Principles of Operation. Presubiculum. Subiculum F S D. Neocortex. PHG & Perirhinal. CA1 Fornix CA3 S D

Original Article Downloaded from jhs.mazums.ac.ir at 22: on Friday October 5th 2018 [ DOI: /acadpub.jhs ]

Simple Linear Regression the model, estimation and testing

CLINICAL BIOSTATISTICS

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Chapter 13 Estimating the Modified Odds Ratio

Applying Machine Learning Methods in Medical Research Studies

REVIEW PROBLEMS FOR FIRST EXAM

Unit 1 Exploring and Understanding Data

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Chapter 3 Software Packages to Install How to Set Up Python Eclipse How to Set Up Eclipse... 42

DO NOT OPEN THIS BOOKLET UNTIL YOU ARE TOLD TO DO SO

RISK PREDICTION MODEL: PENALIZED REGRESSIONS

Political Science 15, Winter 2014 Final Review

Discrimination and Generalization in Pattern Categorization: A Case for Elemental Associative Learning

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test February 2016

STATISTICS IN CLINICAL AND TRANSLATIONAL RESEARCH

The Effect of Guessing on Item Reliability

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Error Detection based on neural signals

SPSS Portfolio. Brittany Murray BUSA MWF 1:00pm-1:50pm

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

Predicting outcomes in online chatbot-mediated therapy

Selection and Combination of Markers for Prediction

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Discovering Identity Problems: A Case Study

A Comparative Study of Some Estimation Methods for Multicollinear Data

Transcription:

Consumer Review Analysis with Linear Regression Cliff Engle Antonio Lupher February 27, 2012 1 Introduction Sentiment analysis aims to classify people s sentiments towards a particular subject based on their opinions. There are a number of different classification methods that can be used to perform this analysis. In this paper, we perform sentiment analysis on a labeled dataset of Amazon product reviews using linear regression. The dataset consists of one million textual reviews, of which approximately 500,000 are unique. We perform 10-fold cross-validation of our linear regression method, and experiment with various configuration parameters to provide optimal performance. 2 Linear Regression Linear regression is a method to model the relationship between an output variable, y, and explanatory variables, X. The output variable is a linear sum of the explanatory variables multiplied by their corresponding coeffecients, represented by β. In our case, y represents a review s ranking from one to five, while X is the bag-of-words representation of the review text. This implies that β represents the weight of a word s correlation to a higher review. This gives us the simple regression formula: y = Xβ In this paper, we perform ordinary least squares regression. By minimizing the sum of squared residuals, we can solve for the unkown parameter vector β using a closed form solution. β = (X X) 1 X y In order to ensure that the matrix X X is invertible, we perform ridge regression. In ridge regression, we add the identity matrix, I, scaled by a factor, λ, to produce the following equation. β = (X X + λi) 1 X y 1

We calculate this β for a training set of review text along with their corresponding ratings which range from 1 to 5. We then evaluate our model β on the withheld reviews in order to perform a 10-fold cross-validation. 3 Implementation All of our code was implemented in MATLAB because of its ease-of-use and optimizations on matrix multiplications. We parsed the reviews from the tokenized binary format into a bag-of-words representation stored as a sparse matrix. Matlab uses Compressed Sparse Column format, so we actually stored the matrix X because the dictionary is very sparse. We also filtered out duplicate reviews by hashing the first ten words in a review to determine distinctness. This filtered the one million reviews to 492,130. The actual matrix multiplications were done using the sparse X matrix along with our y vector and the ridge parameter. The multiplication X X takes up too much memory if we use the entire dictionary, so we chose to limit the dictionary size. We found that a dictionary of the 10,000 most frequent words gave sufficient features while also being computable using MATLAB s builtin sparse matrix multiplications. We also tried using a dictionary of 15,000 words, but only had a marginal improvement along with much slower execution. Using our 10,000 word dictionary, we calculated the performance of our matrix multiply to be 0.123 Gflops/s by dividing the number of non-zero elements in our matrices by the time to compute the multiplication. 4 Design Decisions 4.1 Tuning the Lambda parameter Since we used ridge regression, we had the ability to tune the λ parameter in order to get the best performance. As shown in figure 1, we observed an AUC improvement of about.5% between the default value of 1 and a tuned parameter of 1500. This is a fairly major difference since the AUC is at 93% using the default value. Additionally, we also noticed a major difference in the top words when using better λ values. Using the default value of 1, our top words included obscure words including c, j, donovan, and matlock. When we used the tuned λ value of 1500, the words were much more representative, as shown in table 1. 4.2 Porter Stemming Porter Stemming is a process for removing some of the common morphological and inflexional endings from words. The ramifications of this in a classifier is that it consolidates words with similar semantics. This can often help to reduce dependencies between similar words. We ran performance tests with and without Porter Stemming. 2

5 0.9237 10 0.9239 50 0.9252 500 0.9292 1000 0.9298 1500 0.9299 2000 0.9298 3000 0.9293 5000 0.9278 AUC vs Lambda 93.10% 93.00% 92.90% 92.80% AUC 92.70% 92.60% 92.50% 92.40% 92.30% 0.1 1 10 100 1000 10000 Lambda Figure 1: AUC based on varying lambda values for Ridge Regression. Dictionary is pruned to 15,000 words. 4.3 Stop Words In order to minimize the influence of small, commonly-used words on the sentiment analysis, we tested a model that ignored a set of approximately 125 stop words. This list included words like the, and, a, and are. However, a 10- fold cross-validation across the entire data set showed that the AUC difference between the model that omitted stop words and our base model was only 0.5%. Likewise, the most heavily weighted terms remained almost exactly the same as for the original model. Considering this negligible variation and the observation that no stop words appeared in our most heavily weighted positive and negative word lists, we can conclude that the effect of these common words is not substantial. This is most likely because they are for more or less evenly distributed among positive and negative reviews. 5 Results The baseline results of our 10-fold cross-validation using a dictionary of 10,000 words and a λ of 1 are an AUC of 0.9257 with a 1% lift of 34.067. The optimal value that we achieved was using a dictionary size of 15,000 words and a λ of 1,500, which gave us an AUC of 0.9313 and a lift of 38.27. We did not observe improvements with Porter Stemming or stop word removal, as shown in in figure 3

2. We also believe that a larger dictionary size offers only modest improvements, as our AUC difference between using 10,000 and 15,000 words is only.0013 when using tuned λ parameters. 6 Analysis 6.1 Top Words As mentioned above, the lambda parameter played a major role in the top words. Having a better tuned lambda gave more reasonable top words and improved performance. Looking at the words in table 1, all of the top 10 positive and negative words seem quite reasonable and accurate. It is interesting to note that negative words have much stronger weights than positive words. The top negative word, waste, has 3.5 times more weight than the the top positive word, excellent. This is likely due to the fact that the reviews in this dataset were heavily skewed towards being positive. The mean review rating is 4.23. Top Positive Words Weights Top Negative Words Weights excellent 0.20 waste -0.71 awesome 0.18 disappointing -0.57 invaluable 0.15 poorly -0.56 highly 0.15 disappointment -0.52 amazing 0.15 worst -0.47 outstanding 0.15 boring -0.46 wonderful 0.14 disappointed -0.46 loved 0.14 useless -0.42 pleased 0.14 hoping -0.32 fantastic 0.14 lacks -0.31 Table 1: Top words and their weights using a lambda of 1500 6.2 Porter Stemming Adding Porter Stemming to our model did not have a significant impact on the AUC. The 10-fold cross-validation AUC for our stemmed model was 92.5%, which 0.5% lower than the base line AUC. This negative impact might be due to stemming causing certain positive and negative words to be reduced to the same stem, thus lowering their influence. This lack of accuracy gain with stemming is consistent with our results in Homework 1, where stemming had a negative effect on the accuracy of our Naive Bayes sentiment classifiers. The positive and negative stems with the most weights are slightly different and in a different order than our results above. As was expected, words with the same stem were consolidated. For example, our top negative words for the base model included disappointing, disappointment, and disappointed. In 4

Top Positive Words Weights Top Negative Words Weights awesom book 0.19 wast -0.64 excel 0.19 poor -0.57 highlif 0.16 disappoint -0.56 invaluab 0.15 worst/best -0.47 amaz 0.14 useless/worthless -0.41 superb 0.14 bore -0.30 outstand 0.13 sorri -0.27 terrifc 0.13 terribl -0.27 fantasti 0.13 unfortuna -0.26 thank you 0.12 mislead -0.25 Table 2: Top words with Porter Stemming and λ = 1500 the stemmed results, these separate words have been consolidated into the single stem disappoint, allowing new stems like unfortuna, terribl, and mislead to enter the top ten negative words. Interestingly, the weight of this consolidated stem does not place it any higher than it was in the original list, presumably because the next stem, poor, has also gained more weight from similar consolidations. 7 Future Improvements 5

1 0.9 Baseline Porter Stemming Stopwords 0.8 0.7 True Positive Rate 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate Figure 2: AUC for baseline, stemmed and stop word models. 6