Incorporating Word Correlation Knowledge into Topic Modeling. Pengtao Xie. Joint work with Diyi Yang and Eric Xing

Incorporating Word Correlation Knowledge into Topic Modeling Pengtao Xie Joint work with Diyi Yang and Eric Xing 1

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 2

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 3

Topic Modeling Slides Courtesy: David Blei 4

Topic Modeling 5

Topic Modeling Conditional Independence Assumption 6

Word Correlation Knowledge WordNet Wikipedia Named Entity Recognization 7

Word Correlation Knowledge 8

Word Correlation Knowledge 9

Word Correlation Knowledge 10

Existing Works Impose constraints over the topic-word multinomials Cannot differentiate semantic subtleties 11

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 12

Word Correlation Knowledge Ideal Word Correlation Knowledge Actual Knowledge 13

Use Knowledge to Control Topic Assignments A B 14

Markov Random Field Regularized Latent Dirichlet Allocation 15 Word correlation knowledge (w1, w3), (w2, w5), (w3, w4), (w3, w5) Define a MRF over z P m n n m N i i z z I z p A z p ), ( 1 ) ( )exp ( ), ( 1 ), (

Markov Random Field Regularized Latent Dirichlet Allocation Sample a topic proportion vector ~ Dir( ) Sample topic assignments z ~ p( z, ) Sample each word w Multi( ) i ~ zi 16

Inference and Learning Variational inference: use a variational distribution to approximate the posterior of latent variables Optimize the variational lower bound using EM algorithm: in E step, estimate the variational parameters; in M step, learn model parameters Further upper bound the log-partition function to achieve tractability E [log A(, )] c Update of topic assignment q 1 exp ( i, j) 1 log c P V ik exp ( k ) wiv log kv v 1 jn ( i) jk 17

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 18

Experiments Dataset: 20-Newsgroups and NIPS Word correlation knowledge: Web Eigenwords Baselines: LDA, DF-LDA, Quad-LDA 19

Exemplar Topics Sex LDA DF-LDA Quad-LDA MRF-LDA sex book homosexuality men men men sex sex homosexuality books homosexual women homosexual homosexual sin homosexual gay homosexuality marriage homosexuality sexual reference context child com gay people ass homosexuals read sexual sexual people male gay gay cramer homosexual homosexuals homosexual 20

Exemplar Topics Health LDA DF-LDA Quad-LDA MRF-LDA government money money care money pay insurance insurance private insurance columbia private people policy pay cost will tax health health health companies tax costs tax today year company care plan private companies insurance health care tax program jobs write public 21

Exemplar Topics Sports LDA DF-LDA Quad-LDA MRF-LDA team game game game game games team team hockey players play hockey season hockey games players will baseball hockey play year fan season player play league rom fans nhl played period teams games season goal fan teams ball player best 22

Quantitative Evaluation Coherence Measure (CM): the ratio between the number of relevant words and total number of candidate words. Annotators: four graduate students 10% of topics were randomly chosen for labeling CM (%) on 20-Newsgroups Dataset Method A1 A2 A3 A4 Mean Std LDA 30 33 22 29 28.5 4.7 DF-LDA 35 41 35 27 36.8 2.9 Quad-LDA 32 36 33 26 31.8 4.2 MRF-LDA 60 60 63 60 60.8 1.5 23

Quantitative Evaluation CM (%) on NIPS Dataset Method A1 A2 A3 A4 Mean Std LDA 75 74 74 69 73 2.7 DF-LDA 65 74 72 47 66 9.5 Quad-LDA 40 40 38 25 35.8 7.2 MRF-LDA 86 85 87 84 85.8 1.0 24

Outline Motivation Incorporating Word Correlation Knowledge into Topic Modeling Experiments Conclusions 25

Conclusions We propose a MRF-LDA model, aiming to incorporate word correlation knowledge to improve topic modeling. The model defines a MRF over the latent topic layer of LDA, to encourage correlated words to be put into the same topic. We evaluate the model on two datasets and corroborate its effectiveness both qualitatively and quantitatively. 26

Thank you! Questions? 27