Big Data and Sentiment Quantification: Analytical Tools and Outcomes

Similar documents
An assistive application identifying emotional state and executing a methodical healing process for depressive individuals.

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

IRIT at e-risk. 1 Introduction

Using Information From the Target Language to Improve Crosslingual Text Classification

MATH-134. Experimental Design

Asthma Surveillance Using Social Media Data

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

The Emotion Analysis on the Chinese Comments from News portal and Forums Jiawei Shen1, 2, Wenjun Wang1, 2 and Yueheng Sun1, 2, a

Author s Traits Prediction on Twitter Data using Content Based Approach

Social Media Mining for Toxicovigilance

Lecture 20: CS 5306 / INFO 5306: Crowdsourcing and Human Computation

Christopher Cairns and Elizabeth Plantan. October 9, 2016

Measuring Attitudes. Measurement and Theory of Democratic Attitudes. Introduction Measurement Summary

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

CHAPTER 2. MEASURING AND DESCRIBING VARIABLES

32.5. percent of U.S. manufacturers experiencing unfair currency manipulation in the trade practices of other countries.

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Ch. 11 Measurement. Paul I-Hai Lin, Professor A Core Course for M.S. Technology Purdue University Fort Wayne Campus

Data = collections of observations, measurements, gender, survey responses etc. Sample = collection of some members (a subset) of the population

Agents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based Models

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Math 124: Module 3 and Module 4

Ch. 11 Measurement. Measurement

Testing the Persuasiveness of the Oklahoma Academy of Science Statement on Science, Religion, and Teaching Evolution

Homework #2 is due next Friday at 5pm.

From Sentiment to Emotion Analysis in Social Networks

Psychology. January 11, 2019.

How are polls conducted? by Frank Newport, Lydia Saad, David Moore from Where America Stands, 1997 John Wiley & Sons, Inc.

CHAPTER 3 RESEARCH METHODOLOGY

Sentiment Analysis of Reviews: Should we analyze writer intentions or reader perceptions?

This is a repository copy of Measuring the effect of public health campaigns on Twitter: the case of World Autism Awareness Day.

Data collection, summarizing data (organization and analysis of data) The drawing of inferences about a population from a sample taken from

Social Studies 4 8 (118)

IAASB Main Agenda (September 2005) Page Agenda Item. Analysis of ISA 330 and Mapping Document

Chapter Five. Consumer Markets and Consumer Buyer Behavior. I t s good and good for you. Chapter 5- slide 1

RECOMMENDED CITATION: Pew Research Center, December, 2014, Perceptions of Job News Trend Upward

Making the Best of Imperfect Data: Reflections on an Ideal World. Mike Ambinder, PhD CHI PLAY 2014 October 20 th, 2014

Introduction to Econometrics

Lecture (chapter 1): Introduction

May All Your Wishes Come True: A Study of Wishes and How to Recognize Them

Chapter 4: Defining and Measuring Variables

I. Introduction and Data Collection B. Sampling. 1. Bias. In this section Bias Random Sampling Sampling Error

Chapter 3. Producing Data

Briefing for employers on Asperger Syndrome

Use of Twitter to Assess Sentiment toward Waterpipe Tobacco Smoking

Math 124: Modules 3 and 4. Sampling. Designing. Studies. Studies. Experimental Studies Surveys. Math 124: Modules 3 and 4. Sampling.

DEVELOPING EMOTIONAL INTELLIGENCE IN SALES A S TRATEGIC L EARNING, I NC. W HITEPAPER THE SALES PERFORMANCE IMPROVEMENT CHALLENGE

Emotion-Aware Machines

Signals from Text: Sentiment, Intent, Emotion, Deception

Variable Data univariate data set bivariate data set multivariate data set categorical qualitative numerical quantitative

LifelogExplorer: A Tool for Visual Exploration of Ambulatory Skin Conductance Measurements in Context

1. The Role of Sample Survey Design

An Online ADR System Using a Tool for Animated Agents

Selection bias and models in nonprobability sampling

A Cooperative Multiagent Architecture for Turkish Sign Tutors

Attitude Measurement

Statistics Mathematics 243

Psychology 466: Judgment & Decision Making

Detecting and monitoring foodborne illness outbreaks: Twitter communications and the 2015 U.S. Salmonella outbreak linked to imported cucumbers

Affect in Virtual Agents (and Robots) Professor Beste Filiz Yuksel University of San Francisco CS 686/486

Creative Commons Attribution-NonCommercial-Share Alike License

PERSONALIZED PREDICTIVE PREVENTIVE

Emotion Recognition using a Cauchy Naive Bayes Classifier

Modes of Measurement. Outline. Modes of Measurement. PSY 395 Oswald

Funnelling Used to describe a process of narrowing down of focus within a literature review. So, the writer begins with a broad discussion providing b

Social Network Data Analysis for User Stress Discovery and Recovery

Survey Research. We can learn a lot simply by asking people what we want to know... THE PREVALENCE OF SURVEYS IN COMMUNICATION RESEARCH

Observation Studies, Sampling Designs and Bias

Methodological skills

The British Psychological Society. Promoting excellence in psychology. Belonging to DHP.

A MODIFIED FREQUENCY BASED TERM WEIGHTING APPROACH FOR INFORMATION RETRIEVAL

Taking advantage of Twitter data to investigate sentiments towards environmental issues during the 2016 U.S. presidential election.

Using Social Media to Understand Cyber Attack Behavior

Research Prospectus. Your major writing assignment for the quarter is to prepare a twelve-page research prospectus.

TTI Personal Talent Skills Inventory Coaching Report

Chapter 1 Introduction to I/O Psychology

Research on Social Psychology Based on Network Big Data

Sound is the. spice of life

ADMS Sampling Technique and Survey Studies

Unconscious Bias: From Awareness to Action!

Observation and Assessment. Narratives

All Possible Regressions Using IBM SPSS: A Practitioner s Guide to Automatic Linear Modeling

PDRF About Propensity Weighting emma in Australia Adam Hodgson & Andrey Ponomarev Ipsos Connect Australia

Progress in Risk Science and Causality

GfK Verein. Detecting Emotions from Voice

What is the impact of mode effect on non-response survey usability?

Sound Off DR. GOOGLE S ROLE IN PRE-DIAGNOSIS THROUGH TREATMENT. Ipsos SMX. June 2014

Chris A. Jones, D.Phil., M.Sc. Assistant Professor of Surgery and Economics Director, Global Health Economics Unit Vermont Center for Clinical and

Experimental Validity

Understanding Science Conceptual Framework

From Codes to Conclusions: Strategies for Analyzing Qualitative Data

Americans Views on Moonshot Initiative and Cancer Research

Accessible Social Media. Danny Housley IDEAS 2018

Predictive Analytics and machine learning in clinical decision systems: simplified medical management decision making for health practitioners

Correlation Neglect in Belief Formation

A Practical Introduction to Content Analysis. Catherine Corrigall-Brown Department of Sociology

The Heart Wants What It Wants: Effects of Desirability and Body Part Salience on Distance Perceptions

Detecting Cognitive States Using Machine Learning

A Pragmatic Approach to Implementation of Emotional Intelligence in Machines

Political Science 15, Winter 2014 Final Review

Transcription:

Big Data and Sentiment Quantification: Analytical Tools and Outcomes Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche 56124 Pisa, IT E-mail: fabrizio.sebastiani@isti.cnr.it October 11, 2017 @ European University Institute, Firenze, IT Download these slides at http://bit.ly/2z31srz

Classification: A Primer Classification (aka categorization ) is the task of assigning data items to groups ( classes ) whose existence is known in advance; e.g., Assigning newspaper articles to one or more of Home News, Politics, Economy, Lifestyles, Sports Assigning comments about products to exactly one of Excellent, Good, Average, Poor, Disastrous Classification requires subjective judgment : assigning natural numbers to either Prime or NonPrime is not classification (Automatic) Classification is usually tackled via supervised machine learning : a general-purpose learning algorithm trains (using a set of manually classified items) a classifier to recognize the characteristics an item should have in order to be attributed to a given class 2 / 32

What is quantification? 1 1 Dodds, Peter et al. Temporal Patterns of Happiness and Information in a Global Social Network: Hedonometrics and Twitter. PLoS ONE, 6(12), 2011. 3 / 32

What is quantification? (cont d) 4 / 32

What is quantification? (cont d) In many applications of classification, the real goal is determining the relative frequency (or: prevalence) of each class in the unlabelled data (quantification, a.k.a. supervised prevalence estimation) E.g. Among the tweets about the next presidential elections, what is the fraction of pro-democrat ones? Among the posts about the Apple Watch 3 posted on forums, what is the fraction of very negative ones? How have these percentages evolved over time? This task has been studied within IR, ML, DM, NLP, and has given rise to learning methods and evaluation measures specific to it 5 / 32

The paradox of quantification Is classify and count the optimal quantification strategy? No! A perfect classifier is also a perfect quantifier (i.e., estimator of class prevalence), but...... a good classifier is not necessarily a good quantifier (and vice versa) : FP FN Classifier A 18 20 Classifier B 20 20 Paradoxically, we should choose quantifier B rather than quantifier A, since A is biased This means that quantification should be studied as a task in its own right 6 / 32

Vapnik s Principle Key observation: classification is a more general problem than quantification Vapnik s principle: If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem. This suggests solving quantification directly (without solving classification as an intermediate step) with the goal of achieving higher quantification accuracy than if we opted for the indirect solution 7 / 32

What is quantification? (cont d) Quantification may be also defined as the task of approximating a true distribution by a predicted distribution +,-.!6,:8324,! 6,:8324,! 6,73-89! 5;<=>?@<=! @;A<! 5012324,! +,-.!/012324,! "#"""$! "#%""$! "#&""$! "#'""$! "#(""$! "#)""$! "#*""$!! As a result, evaluation measures for quantification are divergences, which evaluate how much a predicted distribution diverges from the true distribution 8 / 32

Distribution drift The need to perform quantification arises because of distribution drift, i.e., the presence of a discrepancy between the class distribution of Tr and that of Te. Distribution drift may derive when the environment is not stationary across time and/or space and/or other variables, and the testing conditions are irreproducible at training time the process of labelling training data is class-dependent (e.g., stratified training sets) the labelling process introduces bias in the training set (e.g., if active learning is used) Distribution drift clashes with the IID assumption, on which standard ML algorithms are instead based. 9 / 32

Applications of quantification A number of fields where classification is used are not interested in individual data, but in data aggregated across spatio-temporal contexts and according to other variables (e.g., gender, age group, religion, job type,...); e.g., Social sciences : studying indicators concerning society and the relationships among individuals within it 2 [Others] may be interested in finding the needle in the haystack, but social scientists are more commonly interested in characterizing the haystack. (Hopkins and King, 2010) Political science : e.g., predicting election results by estimating the prevalence of blog posts (or tweets) supporting a given candidate or party 2 D. Hopkins and G. King, A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54(1), 2010. 10 / 32

Applications of quantification (cont d) Epidemiology : concerned with tracking the incidence and the spread of diseases; e.g., estimate pathology prevalence from clinical reports where pathologies are diagnosed estimate the prevalence of different causes of death from verbal accounts of symptoms Market Research : concerned with estimating the distribution of consumers attitudes about products, product features, or marketing strategies; e.g., quantifying customers attitudes from verbal responses to open-ended questions Others : e.g., estimating the proportion of no-shows within a set of bookings estimating the proportions of different types of cells in blood samples 11 / 32

Quantification methods Quantification methods belong to two classes 1. Aggregative : they require the classification of individual items as a basic step 2. Non-aggregative : quantification is performed without performing classification Aggregative methods may be further subdivided into 1a. Methods using general-purpose learners (i.e., originally devised for classification); can use any supervised learning algorithm that returns posterior probabilities 1b. Methods using special-purpose learners (i.e., especially devised for quantification) 12 / 32

Evaluating quantification methods Quantification accuracy is often analysed by class prevalence... Table: Accuracy as measured in terms of KLD on the 5148 test sets of RCV1-v2 grouped by class prevalence in Tr RCV1-v2 VLP LP HP VHP All SVM(KLD) 2.09E-03 4.92E-04 7.19E-04 1.12E-03 1.32E-03 PACC 2.16E-03 1.70E-03 4.24E-04 2.75E-04 1.74E-03 ACC 2.17E-03 1.98E-03 5.08E-04 6.79E-04 1.87E-03 MAX 2.16E-03 2.48E-03 6.70E-04 9.03E-05 2.03E-03 CC 2.55E-03 3.39E-03 1.29E-03 1.61E-03 2.71E-03 X 3.48E-03 8.45E-03 1.32E-03 2.43E-04 4.96E-03 PCC 1.04E-02 6.49E-03 3.87E-03 1.51E-03 7.86E-03 MM(PP) 1.76E-02 9.74E-03 2.73E-03 1.33E-03 1.24E-02 MS 1.98E-02 7.33E-03 3.70E-03 2.38E-03 1.27E-02 T50 1.35E-02 1.74E-02 7.20E-03 3.17E-03 1.38E-02 MM(KS) 2.00E-02 1.14E-02 9.56E-04 3.62E-04 1.40E-02 13 / 32

Evaluating quantification methods (cont d)... or by amount of drift... Table: Accuracy as measured in terms of KLD on the 5148 test sets of RCV1-v2 grouped into quartiles homogeneous by distribution drift RCV1-v2 VLD LD HD VHD All SVM(KLD) 1.17E-03 1.10E-03 1.38E-03 1.67E-03 1.32E-03 PACC 1.92E-03 2.11E-03 1.74E-03 1.20E-03 1.74E-03 ACC 1.70E-03 1.74E-03 1.93E-03 2.14E-03 1.87E-03 MAX 2.20E-03 2.15E-03 2.25E-03 1.52E-03 2.03E-03 CC 2.43E-03 2.44E-03 2.79E-03 3.18E-03 2.71E-03 X 3.89E-03 4.18E-03 4.31E-03 7.46E-03 4.96E-03 PCC 8.92E-03 8.64E-03 7.75E-03 6.24E-03 7.86E-03 MM(PP) 1.26E-02 1.41E-02 1.32E-02 1.00E-02 1.24E-02 MS 1.37E-02 1.67E-02 1.20E-02 8.68E-03 1.27E-02 T50 1.17E-02 1.38E-02 1.49E-02 1.50E-02 1.38E-02 MM(KS) 1.41E-02 1.58E-02 1.53E-02 1.10E-02 1.40E-02 14 / 32

Evaluating quantification methods (cont d)... or along the temporal dimension... 15 / 32

Sentiment quantification 16 / 32

Sentiment analysis Sentiment Quantification is a part of Sentiment Analysis, a set of tasks concerned with the analysing of texts according to the sentiments / opinions / emotions / judgments expressed in them SA is the Holy Grail of market research, opinion research, and online reputation management. Mostly concerned with analysing user-generated content in online media, such as product reviews or (micro-)blog posts 17 / 32

How Difficult is Sentiment Analysis? Sentiment analysis is inherently difficult, because in order to express opinions / emotions / etc. we often use a wide variety of sophisticated expressive means (e.g., metaphor, irony, sarcasm, allegation, understatement, etc.) At that time, Clint Eastwood had only two facial expressions: with the hat and without it. (from an interview with Sergio Leone) She runs the gamut of emotions from A to B (on Katharine Hepburn in The Lake, 1934) If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut. (from a 2008 review of parfum Amarige, Givenchy) Sentiment analysis characterised as an NLP-complete problem 18 / 32

Sentiment quantification An interesting instance of quantification is sentiment quantification, i.e., supervised prevalence estimation for sentiment-related classes 3 Sentiment quantification (and classification) may be binary (Positive, Negative) ternary (Positive, Negative, Neutral) ordinal (e.g., Excellent, Good, Average, Poor, Disastrous) Each such case has its own learning algorithms and evaluation measures 3 A. Esuli and F. Sebastiani. Sentiment Quantification. IEEE Intelligent Systems, 25(4):72-75, 2010. 19 / 32

Sentiment quantification (cont d) It is often the case that we perform sentiment classification with quantification in mind Example 1 (CRM): How satisfied are you with our mobile phone services? Asked by: telecom company Class of interest: MayDefectToCompetition Goal: classification (at the individual level) Example 2 (MR): How do you like the recent ad for product X? Asked by: MR agency Class of interest: LovedTheCampaign Goal: quantification (at the aggregate level) 20 / 32

Sentiment quantification (cont d) A typical medium for which sentiment classification is performed with quantification in mind is Twitter The 2016 and 2017 editions of the SemEval task Sentiment Analysis in Twitter include two subtasks each ( ternary and ordinal ) on quantification Experiments show that methods specifically aimed at quantification estimate class prevalences more accurately than standard CC in both the ternary 4 and ordinal 5 cases 4 W. Gao and F. Sebastiani. From Classification to Quantification in Tweet Sentiment Analysis. Social Network Analysis and Mining, 6(19), 2016. 5 G. Da San Martino, W. Gao, and F. Sebastiani. Ordinal Text Quantification. Proceedings of SIGIR 2016. 21 / 32

Conclusion Quantification: a relatively (yet) unexplored new task, with many research problems still open Growing awareness that (sentiment) quantification is going to be more and more important; given the advent of big data, application contexts will spring up in which we will simply be happy with analysing data at the aggregate (rather than at the individual) level 22 / 32

Questions? 23 / 32

Thank you! For any question, email me at fabrizio.sebastiani@isti.cnr.it 24 / 32