Incorporating Word Correlation Knowledge into Topic Modeling. Pengtao Xie. Joint work with Diyi Yang and Eric Xing

Similar documents
UKParl: A Data Set for Topic Detection with Semantically Annotated Text

Experimenting with Drugs (and Topic Models) Michael Paul and Mark Dredze Johns Hopkins University

Inferring Clinical Correlations from EEG Reports with Deep Neural Learning

Reader s Emotion Prediction Based on Partitioned Latent Dirichlet Allocation Model

The use of Topic Modeling to Analyze Open-Ended Survey Items

Brendan O Connor,

Summarizing Drug Experiences with Multi- Dimensional Topic Models. Michael Paul and Mark Dredze Johns Hopkins University

Building Evaluation Scales for NLP using Item Response Theory

REVIEW OF RELATED LITERATURE

Midterm project (Part 2) Due: Monday, November 5, 2018

LIFE SATISFACTION ANALYSIS FOR RURUAL RESIDENTS IN JIANGSU PROVINCE

Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Specimen. A Level Psychology H567/03 Applied psychology. Sample Question Paper Date Morning/Afternoon. Time allowed: 2 hours PMT

Bayesian (Belief) Network Models,

A PROBABILISTIC TOPIC MODELING APPROACH FOR EVENT DETECTION IN SOCIAL MEDIA. Courtland VanDam

Large Scale Analysis of Health Communications on the Social Web. Michael J. Paul Johns Hopkins University

arxiv: v3 [cs.cv] 26 May 2018

Concepts and Categories

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Bayesian Models for Combining Data Across Subjects and Studies in Predictive fmri Data Analysis

Kelvin Chan Feb 10, 2015

BAYESIAN HIERARCHICAL MODELING OF HIGH-THROUGHPUT GENOMIC DATA WITH APPLICATIONS TO CANCER BIOINFORMATICS AND STEM CELL DIFFERENTIATION

Chapter 02 Producing Knowledge About Sports in Society: What is the Role of Research and Theory?

Section Introduction

Raising the Performance Bar through a Season. Wade Gilbert, PhD

Beyond R-CNN detection: Learning to Merge Contextual Attribute

Concussions in Sport Definitions, Mechanisms, and Current Issues

Experimenting with Drugs (and Topic Models) Michael Paul and Mark Dredze Johns Hopkins University

LITTLE LEAGUE MARKETING AND COMMUNICATIONS. New District Administrator Training 2018

A Visual Latent Semantic Approach for Automatic Analysis and Interpretation of Anaplastic Medulloblastoma Virtual Slides

FITNESS TENNIS MULTI-SPORT

Generative Adversarial Networks.

Motivation and Emotion

Make Good Decisions. General Changes in the Youth Sports World. When Youth Sport Becomes too Serious 5/9/2013. Sports Medicine:

Chapter 1 Data Types and Data Collection. Brian Habing Department of Statistics University of South Carolina. Outline

May All Your Wishes Come True: A Study of Wishes and How to Recognize Them

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

Mediation Analysis With Principal Stratification

Domain Generalization and Adaptation using Low Rank Exemplar Classifiers

A2 Revision - 1. Based on previous questions, and potential answers to those questions

Lecture 21. RNA-seq: Advanced analysis

You must answer question 1.

Signalling, shame and silence in social learning. Arun Chandrasekhar, Benjamin Golub, He Yang Presented by: Helena, Jasmin, Matt and Eszter

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

North Independence: PE Student Booklet KS4 BTEC. Unit 1- Fitness for Sport and Exercise. Contents: I. Quizzes 10 credits each. Reading Task 50 credits

Exploring Trends of Cancer Research Based on Topic Model

Causality and Statistical Learning

Advanced ANOVA Procedures

Multi-Stage Stratified Sampling for the Design of Large Scale Biometric Systems

Diagnostic Prediction Using Discomfort Drawings

Reliability Theory for Total Test Scores. Measurement Methods Lecture 7 2/27/2007

The t-test: Answers the question: is the difference between the two conditions in my experiment "real" or due to chance?

Bayesian Hierarchical Models for Fitting Dose-Response Relationships

JOB DESCRIPTION. Cricket. August 2014

15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA

Major League Strength

Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Ryan Adams, Hugo LaRochelle NIPS 2012

Part 1: Bag-of-words models. by Li Fei-Fei (Princeton)

Friday 16 May 2014 Afternoon

New England Sports Village Attleboro, MA

A point estimate is a single value that has been calculated from sample data to estimate the unknown population parameter. s Sample Standard Deviation

Use Survey Happiness Complete. SAV file on Blackboard Assigned to Mistake 8

Causal Inference over Longitudinal Data to Support Expectation Exploration. Emre Kıcıman Microsoft Research

Discovering Inductive Biases in Categorization through Iterated Learning

AClass: A Simple, Online Probabilistic Classifier. Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL

Macroeconometric Analysis. Chapter 1. Introduction

Comparison of mental toughness status amongst players of team games

UD Campus Recreation.

HPV Private Practice Toolkit Young Adult Vaccination

GCSE EXAMINERS' REPORTS

Outline. Hierarchical Hidden Markov Models for HIV-Transmission Behavior Outcomes. Motivation. Why Hidden Markov Model? Why Hidden Markov Model?

A point estimate is a single value that has been calculated from sample data to estimate the unknown population parameter. s Sample Standard Deviation

Biostatistical modelling in genomics for clinical cancer studies

Granite Bay High School Parent Handbook

Chapter 34 Detecting Artifacts in Panel Studies by Latent Class Analysis

Survey Activity. 3. Do you think rude people should be able to smoke their cigarettes while attending a baseball game? Use of inflammatory words

The University of North Carolina at Chapel Hill Sport Concussion Policy

Gupta Faculty of Kinesiology & Applied Health

EFFECTS OF REWARDS ON UNFAVORABLE TASKS. By: Chandler Horne, Cristine Jimenez, Lauren Francis, and Claire Paul

DO SOMETHING THAT MATTERS.

TALENT SELECTION PROCEDURES

SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Mining Human-Place Interaction Patterns from Location-Based Social Networks to Enrich Place Categorization Systems

RANDOMIZATION. Outline of Talk

PEI Provincial Canada Games Committee. Canada Games Sports Plan

The Coming Gamification of Fitness

Lecture 13: Finding optimal treatment policies

Defining Goals and Types of Goals. Goal (common definition): An objective standard, or aim of some action.

THE ANALYTICS EDGE. Intelligence, Happiness, and Health x The Analytics Edge

An Intelligent Writing Assistant Module for Narrative Clinical Records based on Named Entity Recognition and Similarity Computation

(ICT) SURVEILLANCE AND IDENTITY MANAGEMENT OF LESBIANS IN THE

Data Analysis Using Regression and Multilevel/Hierarchical Models

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

arxiv: v1 [cs.cl] 10 Dec 2017

Auditory and Visual Stimuli System for Fast Eye Movement Analysis. Team #3: Brian Lewis Anthony Vessicchio Steven Kapinos

Technical Considerations: the past, present and future of simulation modeling of colorectal cancer

Arlington Public Schools Athletics

2. To provide trained coaches/ volunteers and specialized equipment at accessible facilities for sports clinics.

Section 6.1 Sampling. Population each element (or person) from the set of observations that can be made (entire group)

Transcription:

Incorporating Word Correlation Knowledge into Topic Modeling Pengtao Xie Joint work with Diyi Yang and Eric Xing 1

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 2

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 3

Topic Modeling Slides Courtesy: David Blei 4

Topic Modeling 5

Topic Modeling Conditional Independence Assumption 6

Word Correlation Knowledge WordNet Wikipedia Named Entity Recognization 7

Word Correlation Knowledge 8

Word Correlation Knowledge 9

Word Correlation Knowledge 10

Existing Works Impose constraints over the topic-word multinomials Cannot differentiate semantic subtleties 11

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 12

Word Correlation Knowledge Ideal Word Correlation Knowledge Actual Knowledge 13

Use Knowledge to Control Topic Assignments A B 14

Markov Random Field Regularized Latent Dirichlet Allocation 15 Word correlation knowledge (w1, w3), (w2, w5), (w3, w4), (w3, w5) Define a MRF over z P m n n m N i i z z I z p A z p ), ( 1 ) ( )exp ( ), ( 1 ), (

Markov Random Field Regularized Latent Dirichlet Allocation Sample a topic proportion vector ~ Dir( ) Sample topic assignments z ~ p( z, ) Sample each word w Multi( ) i ~ zi 16

Inference and Learning Variational inference: use a variational distribution to approximate the posterior of latent variables Optimize the variational lower bound using EM algorithm: in E step, estimate the variational parameters; in M step, learn model parameters Further upper bound the log-partition function to achieve tractability E [log A(, )] c Update of topic assignment q 1 exp ( i, j) 1 log c P V ik exp ( k ) wiv log kv v 1 jn ( i) jk 17

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 18

Experiments Dataset: 20-Newsgroups and NIPS Word correlation knowledge: Web Eigenwords Baselines: LDA, DF-LDA, Quad-LDA 19

Exemplar Topics Sex LDA DF-LDA Quad-LDA MRF-LDA sex book homosexuality men men men sex sex homosexuality books homosexual women homosexual homosexual sin homosexual gay homosexuality marriage homosexuality sexual reference context child com gay people ass homosexuals read sexual sexual people male gay gay cramer homosexual homosexuals homosexual 20

Exemplar Topics Health LDA DF-LDA Quad-LDA MRF-LDA government money money care money pay insurance insurance private insurance columbia private people policy pay cost will tax health health health companies tax costs tax today year company care plan private companies insurance health care tax program jobs write public 21

Exemplar Topics Sports LDA DF-LDA Quad-LDA MRF-LDA team game game game game games team team hockey players play hockey season hockey games players will baseball hockey play year fan season player play league rom fans nhl played period teams games season goal fan teams ball player best 22

Quantitative Evaluation Coherence Measure (CM): the ratio between the number of relevant words and total number of candidate words. Annotators: four graduate students 10% of topics were randomly chosen for labeling CM (%) on 20-Newsgroups Dataset Method A1 A2 A3 A4 Mean Std LDA 30 33 22 29 28.5 4.7 DF-LDA 35 41 35 27 36.8 2.9 Quad-LDA 32 36 33 26 31.8 4.2 MRF-LDA 60 60 63 60 60.8 1.5 23

Quantitative Evaluation CM (%) on NIPS Dataset Method A1 A2 A3 A4 Mean Std LDA 75 74 74 69 73 2.7 DF-LDA 65 74 72 47 66 9.5 Quad-LDA 40 40 38 25 35.8 7.2 MRF-LDA 86 85 87 84 85.8 1.0 24

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 25

Conclusions We propose a MRF-LDA model, aiming to incorporate word correlation knowledge to improve topic modeling. The model defines a MRF over the latent topic layer of LDA, to encourage correlated words to be put into the same topic. We evaluate the model on two datasets and corroborate its effectiveness both qualitatively and quantitatively. 26

Thank you! Questions? 27