Matching Methods for High-Dimensional Data with Applications to Text

Size: px

Start display at page:

Download "Matching Methods for High-Dimensional Data with Applications to Text"

Clyde Nelson
5 years ago
Views:

1 Matching Methods for High-Diensional Data with Applications to Text Margaret E. Roberts, Brandon M. Stewart, and Richard Nielsen This draft: October 6, 2015 We thank the following for helpful coents and suggestions on this work: David Blei, Jaes Fowler, Erin Hartan, Seth Hill, Gary King, Adeline Lo, David Mino, Jennifer Pan, Caroline Tolbert, and audiences at the Princeton Text Analysis Workshop, Political Methodology Society and the Visions in Methodology conference. We especially thank Dustin Tingley for nuerous insightful conversations on the connections between STM and causal inference. Dan Maliniak, Ryan Powers, and Barbara Walter graciously supplied data and replication code for the gender and citations study. 1

2 Abstract Matching is a popular technique for preprocessing observational data to facilitate causal inference and reduce odel dependence by ensuring that treated and control units are balanced along pre-treatent covariates. While ost applications of atching balance on a sall nuber of covariates, we identify situations where atching with thousands of covariates ay be desirable, such as causal inference where confounders are easured with text. With high-diensional covariates, traditional atching ethods are less effective and ay be difficult or ipossible to ipleent. We characterize the proble of atching in a high-diensional context as a tradeoff between diension reduction and ibalance bounding. We develop a new ethod called Topical Inverse Regression Matching (TIRM) that optiizes this tradeoff by including both a low-diensional projection of covariates and inforation about the probability of treatent. We illustrate our approach by estiating the effect of censorship on the writing of Chinese bloggers, the effects of gender on citation counts in international relations, and the effects of targeted killings and capture by counterterrorists on the popularity of jihadist writings. 0

3 1 Introduction Matching is a well-developed technique for finding appropriate counterfactuals for treated units within observational data (Rubin, 2006). Matching ethods have been shown to iprove balance along pre-treatent covariates and eliinate extree counterfactuals, reducing odel dependence (Ho et al., 2007; Morgan and Winship, 2014). Methods for atching have also been developed for cases with relatively few treated units, where synthetic atches, weighted cobinations of control units, provide the counterfactuals (Abadie, Diaond and Hainueller, 2010). While the atching literature is large and well-developed, current ethods for atching are predoinately developed for cases with a relatively sall nuber of pre-treatent covariates. Popular approaches such as propensity score atching (Rosenbau and Rubin, 1983) and coarsened exact atching (CEM) (Iacus, King and Porro, 2011) either explicitly or iplicitly assue that the diension of the pre-treatent confounders is far saller than the nuber of observations in the data set. For exaple, Rubin and Thoas (1996, 249) note that in typical exaples of atching, N t [the nuber of treated observations] is between 25 and 250, the ratio R = N c /N t [the ratio of control observations to treated observations] is between 2:1 and 20:1, and the nuber of atching variables, p, is between 5 and 50, although in soe exaples, N t ay be 1000 or ore, R a hundred or ore, and p ay be a hundred or ore. In coputational social science the diensionalty of rich data sources available to researchers is uch larger than these figures, and is quickly outpacing the atching techniques available to condition on that inforation (King, 2009; Lazer et al., 2009). In this paper we address a particular type of high diensional atching, where the pre-treatent confounder is captured in written text. 1 Matching on high diensional confounders poses three distinct challenges that ake the use of existing ethods infeasible. The first challenge is that as the diension of the pre-treatent covariates increases there will tend to be no observations that are nearly the 1 While we focus on the text analysis case here our general approach applies to any high-diensional setting where there is a reasonable odel for diension reduction of the confounders. 1

4 sae along all observed diensions. Thus ethods such as exact atching that require observations to atch across all covariates will typically produce no atches, effectively pruning the entire data set (Rubin and Thoas, 1996, 250). Methods such as coarsened exact atching (Iacus, King and Porro, 2011) weaken the restrictions of exact atching by atching within coarsened strata along each diension, however such ethods do not allow coarsening to be applied across variables, only within variables. The second challenge is that high-diensional data akes it difficult to estiate predictive odels of treatent. Propensity score atching and related ethods rely on the ability to build a predictive odel of treatent (Rosenbau and Rubin, 1983). To reduce bias, we only need to condition on the subset of variables that are related to treated and control. However in high-diensional data sets, the analyst usually has too any variables to use standard regression techniques; oreover, any of these variables are not relevant to predicting treatent status. A cobination of inverse regression and autoated variable selection can be used to estiate the probability of treatent for any individual observation. These diensionality reduction techniques are iperative to increasing the efficiency of atches, as atching on variables that are orthogonal to treatent will randoly prune observations fro the atched dataset. The third challenge is the relative lack of appropriate balance etrics for highdiensional data in general and text data in particular. When the analyst has a welldefined balance etric that accurately easures the properties of the data that ake observations suitable counterfactuals this balance etric can be directly optiized (Hainueller, 2011; Diaond and Sekhon, 2013; Iai and Ratkovic, 2014). With a suitable balance etric the central decision for atching is identifying the appropriate trade-off between balance and saple size (King, Lucas and Nielsen, 2015). However, with text data, for exaple, the best easure of ibalance is often a huan reading of atched pairs to evaluate if these pairs are sufficiently equivalent. As such, huan balance checking is iperative. In this paper, we set out to address each of these challenges with a particular focus on applications to text data. We develop two approxiate analogs to existing atching 2

5 ethods. First, we show that a propensity score for text data can be estiated efficiently using ultinoial inverse regression (Taddy, 2013b). Next, we show that atching on topics in text is siilar to coarsened exact atching where coarsening occurs across sets of variables, in this case indicators for words. We deonstrate that while each of these ethods has desirable properties, both have specific weaknesses that liit their applicability in all real-world settings. We develop an algorith called Topical Inverse Regression Matching (TIRM), which cobines the two previous ethods in a way that retains their desirable properties while aeliorating their weaknesses. The paper proceeds as follows. First, we explain possible use cases of text atching in a social science context and introduce the exaples we use in the paper. Next, we set up the proble and introduce soe notation. We then adapt CEM and propensity score atching to textual data and discuss their respective strengths and drawbacks. We introduce our approach for text atching that alleviates these drawbacks. Last, we apply our ethods to three social science exaples: understanding the effects of being censored on the reactions of bloggers, the effect of author gender on acadeic article citations, and the effects of killing jihadist clerics on their subsequent popularity. 2 Use Cases of High-Diensional Matching There are a nuber of contexts where social scientists ay wish to use atching with high-diensional covariate data. In the case of text, whenever siilarity between treated and control observations could be easured in ters of writing, text atching could be used to find appropriate counterfactuals. Since politics produces vast aounts of writing, our ethod is widely applicable. For exaple, if a political scientist in Aerican politics were interested in the effect of veto threats on repositioning in Congress, she ay want to control for the content of a bill, which ight confound vetos and repositioning. The effect of proxiity of two countries on trade between the ay be confounded by the type of trade agreeent between two countries; our ethod allows analysts to control for the agreeent directly. Studies of bias in college adissions or eployent the observational equivalents of experiental audit studies (Neuark, Bank and Nort, 1996) 3

6 could use our approach to control for the content of applicants letters of recoendation or CVs. While the ethods described here are focused on atching on textual data, other types of high-diensional data are becoing increasingly available that could use siilar ethods to those described here. An analyst estiating the effects of height on politicians popularity ay want to control for facial expressions of politicians captured in iage data. High-diensional biological data are now frequently collected on participants in lab experients and scholars ay want to use these data as controls. Other extreely fine-grained tie-series data streaing fro anything fro cell phones to MRIs could also be used for atching purposes. In this paper, we focus on three exaples of high-diensional atching in the case of text data to assist with social science inference. First, we exaine the question of how censorship affects bloggers in China. Do bloggers avoid sensitive topics after they experience censorship? Or does censorship backfire, causing bloggers to write on ore sensitive topics? Matching on the text of the blog posts, we easure self-censorship by coparing bloggers who have been censored to those who have not when the content of the blog post is identical or nearly so. We find that bloggers react negatively to censorship, writing on ore sensitive topics than those who were not censored. Second, we exaine how the gender of journal article authors affects the rate at which articles are cited by others. Maliniak, Powers and Walter (2013) find that woen get any fewer citations than en in international relations while controlling for article content using covariates coded fro article text by huans. We replicate this study using autoated atching on the text data in place of huan coding. We find even stronger gendered citation results than Maliniak, Powers and Walter (2013) after using autoated atching. Last, we exaine how targeted killings of jihadist clerics influence the popularity of their writings aong jihadist readers on the internet. Focusing on the death of Usaa Bin Laden, we test whether docuents authored by Bin Laden becae ore popular after his death by atching the to siilar texts by other authors. We find that Bin Laden s 4

7 death increased the popularity of his writings for at least six onths after his death and perhaps longer. In the next sections, we introduce the atching ethods and evaluate their properties. To help the reader gain intuition, we use the case of censorship in China as a running exaple. We go into ore detail of the perforance and results of each of the odels at the end of the paper. 3 The Setup of the Proble We begin by introducing the notation for atching in the context of text. We start with a data set of n observations. We assue that for each observation i there is a binary treatent T i, which takes a value of 1 for treated and 0 for control. We adopt the potential outcoe fraework so that the outcoe variable Y i takes on the value Y i (1) when unit i is treated and Y i (0) when unit i is not treated. In the censorship case, censorship is the treatent T i and the outcoe Y i is the subsequent reaction of the blogger. Because there is no rando assignent, treated and control groups ay not be siilar before treatent. In the censorship case, censored bloggers write about very different topics than uncensored bloggers and this could explain both their treatent status and the outcoe: their subsequent writing. To approxiate rando assignent, we control for pre-treatent covariates by atching on k pre-treatent covariates X = (X 1, X 2,... X k ). In order to estiate the population average treatent effect on the treated, conditional on X, treatent ust be independent of the potential outcoes: T i Y i (1), Y i (1) X. In the cases we consider in this paper, soe confounding covariates could be easured with text data, for exaple, the content of the blog post. In these cases of highdiensional atching, k is at best large and at worst undefined. For text, it isn t iediately clear how the features of each blog post should be represented. Text includes not only individual words, but hierarchies of words (titles and section headings) as well as word order. It could be that overlapping sets of five words within each text should ake up the feature space, in which case k would include each unique consecutive five word set within the corpus. Or, if only section headings are iportant confounders, then the 5

8 words within the section headings should be included in k. For siplicity, we will represent each docuent as a vector of counts of each word it contains. This is the bag of words assuption (Grier and Stewart, 2013). We will describe all text within our exaples in diension V, which we define as the nuber of unique words within the corpus. Thus all the docuents are collected within a sparse count atrix X whose typical eleent X ij, contains the nuber of ties the jth word appears within the text associated with observation i. In this case the X atrix has diension n, the nuber of observations, by k = V + r, where V is the nuber of unique words in the corpus and r are other covariates to be atched on in addition to the text. The ethods described below also apply to relaxations to the bag of words assuptions such as n-gras and with soe odifications to ethods which operate on sub-string sequences (Spirling, 2012). The diensionality proble arises because the nuber of unique words, V, can be very large relative to the nuber of docuents, n. The exaple corpora we use within this paper have V on the order of 10, , 000 and in two out of three of the cases V > n. These exaples are not atypical for text data. Except for docuents which are exact copies of each other, exact atching can not be conducted. Nor is exact atching necessarily desirable any of the exaples described above do not require the texts to be identical in order for causal inference to be conducted, texts only have to be sufficiently siilar so as not to confound their relationship between treatent T and the outcoe Y. Thus variable selection or coarsening across variables (in this case words) ust first be copleted to siplify X so that atching is tractable. Iportantly, in order to still be able to ake causal inferences, X ust be selected so that T i Y i (1), Y i (1) X still holds. 4 Analogs to Current Methods Siplifying X so that atching is tractable is not new to the atching literature: ost atching ethods are different ways of siplifying the inforation in X so that atching can be copleted when exact atching is ipossible. In this section, we adapt two 6

9 separate atching techniques fro the current literature to the text case which address this proble in distinct ways: propensity score atching (PSM) which estiates the probability that a unit receives treatent and then atches on this one-diensional projection, and coarsened exact atching (CEM) which coarsens along each diension of the covariates until exact atching is tractable. These two approaches use distinct approaches to handling diensionality reduction of X and it is this difference which gives rise to their individual properties. We describe this difference with ore atheatical precision below, but for pedagogical purposes we start with a siple exaple. Iagine that we want to estiate the causal effect of a job training progra on incoe in observational data. To keep atters siple, assue we have only two covariates about each individual: age and education. The need for ore elaborate atching ethods arise because it is presuably nearly ipossible to find a treated and control unit who have identical ages and levels of education, especially if both are easured continuously. Propensity scores odel the choice of each individual to enroll in the job training progra as a function of age and education, and then atches along this estiated propensity. This eans that two individuals with different ages and different levels of education can be atched together and treated as stochastically equivalent because both have a coon propensity to enroll in the job training progra. Put another way, the difference in age between a atched pair can copensate for the difference in education. Thus we address the difficulty of finding atches by relaxing the need for observations to atch along all diensions. Coarsened exact atching adopts a different approach. First, each variable is separately coarsened; education, for instance, ight be binned into categories such as no high-school degree, high-school degree, college degree and post-graduate degree. Second, the observations are exactly atched along the coarsened variable. For well chosen bins, all observations within a given stratu are stochastically equivalent, and observations are only atched when they share a stratu along every diension. This eans that while two atched observations ay not have identical years of schooling, they fall into the 7

10 sae category and are thus fundaentally coparable. No difference in age can copensate for a difference in education as in the propensity score odel. Indeed coarsened exact atching uses no odel of the probability of treatent, which eans that observations ust atch along every included covariate. PSM and CEM strategies yield a different set of costs and benefits (King and Nielsen, 2015). PSM requires treatent and controls units to atch only along a single scalar and approxiates a copletely randoized experient. CEM requires units to atch across all covariate diensions, but in return approxiates a ore powerful fully blocked design. With these differences in ind, in the next section we develop analogs of these two ethods for the text case which introduce inforation about the distribution of the highdiensional confounders through the use of a generative odel. We then show why each of these analogs is unsatisfying on its own. Indeed high-diensional data sees to aplify the weaknesses of each of these ethods, though we ephasize that this is not an indictent because these ethods were not necessarily designed for such data. PSM and CEM do provide a useful starting point for the fraework we present in Section Matching on probabilities: the Propensity Score and MNIR Propensity score atching is a coon approach to use when exact atching fails (Rubin, 2006, 178, 264, 283). The basic idea is to siplify X by estiating the probability of treatent conditional on X, or: ˆπ i = p(t i = 1 X i ) (1) 2 We have chosen to develop analogs of two popular atching ethods but there are nuerous others we ight have chosen. We briefly coent on two alternatives: Mahalanobis distance atching (MDM) and reweighting ethods such as Entropy Balancing (Hainueller, 2011). MDM uses a distance etric which noralizes the Euclidean distance by the saple covariance of the confounders. We return to this approach in Section 6.3 but siply note here that the high-diensional setting often akes it difficult to accurately estiate the covariance atrix, leading to an inefficient estiator. Entropy Balancing reweights observations to atch the oents of the treated and control distributions (Hainueller, 2011). Reweighting approaches are incredibly powerful when the correct balance etric is known and easily quantified. However, the nature of the weights on the observations akes it difficult to identify individual pairs of units which serve as counterfactuals and thus it is ore difficult to evaluate the quality of those atches after the fact. This is unnecessary in a case where we have full faith in a particular balance etric, but for high-diensional data particularly text it is helpful to be able to qualitatively exaine atches. 8

11 In typical practice this involves estiating a logistic regression where the treatent is the outcoe. Then, instead of atching on the full X, the estiated probabilities ˆπ i, or the linear predictor, are used to atch. The challenge for a direct application of this ethodology to text data is that X contains a very high-diensional representation of the texts, typically the word count atrix. The standard advice for variable selection with propensity score atching is that a variable should always be included unless there is consensus that it is unrelated to the outcoe variables or not a proper covariate (Rubin, 2006, 269). However, high-diensional data, the estiation of the conditional distribution p(t i X i ) will not be tractable or efficient unless the nuber of observations n scales well with the diension of the pre-treatent covariates k. We can obtain an estiate of the conditional distribution using regularization but it will necessarily be odel-dependent and noisy Inverse Regression To address this proble of estiating the conditional distribution efficiently, we adapt propensity score atching to the text case using inverse regression (Cook and Ni, 2005; Cook, 2007). The central idea is to posit a paraetric odel for the inverse proble, p(x T ), which allows us to obtain a sufficient reduction of the inforation in X about the conditional distribution p(t X). When the feature space consists of word counts, a natural approach is to assue that the word counts arise fro a ultinoial distribution which leads to the Multinoial Inverse Regression (MNIR) fraework developed in Taddy (2013b). This leads to the odel for a given docuent X i Multinoial( q i, i ) (2) q i,v = exp(α v + ψ vt i ) V v=1 exp(α v + ψ vt i (3) where T is an l-length containing a categorical encoding of the treatent variable. The coefficients ψ are often given a sparsity-inducing regularizing prior (a point which we return to below). Mechanically this aounts to estiating a ultinoial logistic regression with the 9

12 words as outcoes and the treatent as the predictor. After estiating the odel we can calculate a sufficient reduction score: z i = ψ (x i / i ) T i x i, i z i (4) where the latter part of the equations coes fro Propositions 3.1 and 3.2 of Taddy (2013b) which establish the classical sufficiency properties of the projection. This iplies that given the generative odel in Equation 2 we can condition on z i and discard the higher diensional data x i. Introducing this inforation about the generative process of the predictor X results in a gain in the efficiency of the estiator. 3 Under the standard propensity score odel the variance in the MLE of the coefficients for the propensity score odel descreases in the nuber of docuents. However with MNIR the variance decreases with the nuber of total words (Taddy, 2013c, See Proposition 1.1). In high-diensional data this is hugely advantageous. We defer to Taddy (2013b) and Taddy (2013c) for a ore coplete description of the technical properties of MNIR and Cook (2007) for inverse regression ore generally. To coplete the analogy, we can estiate the propensity score using the forward 1 regression ˆπ i = 1+exp( z i. This step would be necessary in cases where the propensity β) score itself was directly of interest, for exaple if it was used in a weighting schee (Glynn and Quinn, 2010). For the purposes of atching the forward regression provides no new inforation and we can atch directly on the sufficient reduction Inference for Multinoial Inverse Regression The ultinoial inverse regression odel involves estiating a large nuber of coefficients. The coefficient atrix ψ has one row per level of the treatent (so typically 2 in this setting) and one colun per word in the vocabulary. We don t however expect that 3 Efficiency is iportant in its own right but as pointed out in Robins and Morgenstern (1987) and King and Nielsen (2015) high variance estiators can lead to bias in practice. The logic is that even well-intentioned researchers will run ultiple odels and pick the best result. Thus when the estiator is inefficient it raises the possibility that these odels will yield radically different answers which can produce an effective bias in the published work. See King and Nielsen (2015) for ore on this point in the context of propensity score atching in particular. 10

13 there will be treatent effects on every word in the vocabulary. To siultaneously provide variable selection and estiation we follow Taddy (2013b) in estiating the coefficients with a regularizing prior. Taddy (2013b) develops a particular penalization schee called the Gaa-Lasso. This is a sparsity-inducing concave penalization ethod has the attractive property that it is asyptotically unbiased for large coefficients. It is otivated as axiu a-posteriori (MAP) estiation under the Bayesian prior ψ v Laplace(0, τ v ), τ v Gaa(s, r) for soe fixed hyperparaeters s and r. This prior essentially zeroes out coefficient for words where the ratio of the use under treatent to use control is neither too large or too sall. A direct ipleentation of the above odel would be prohibitively coputationally expensive. However, because the su of ultinoial rando variables is itself ultinoial, we can collapse the word counts by treatent status. This radically siplifies coputation. Leveraging later work in Taddy (2015a) we also distribute coputation across words in the vocab using the connection between the ultinoial and poisson. This estiation fraework including techniques for coputation are further developed in Taddy (2013b, 2015c,a) Using MNIR to Estiate Propensity Scores In standard regressions, we condition on our covariates and thus don t need to specify a paraetric generative odel. The idea of inverse regression is to use the assued generative odel to iprove the efficiency of our estiates. We first estiate z, a sufficient reduction of X, and then the propensity score can be estiated using a forward regression using only the low-diensional z variable as a predictor. Due to the use of the sparsityprooting prior, we are effectively only considering a subset of words which the odel estiates have substantially different rates of use in the treated group in coparison to the control group. The advantage of this fraework is that it provides a fairly straightforward connection to propensity scores. The literature on propensity scores is quite developed in areas beyond atching and other fors of conditioning such as inverse propensity score weighting could also be used with these ethods. Coputation is straightforward even with 11

14 extreely large datasets. Iportantly, MNIR with regularization excels at selecting a sall subset of variables that are related to treatent. When strong diensionality reduction is required to create atches, identifying variables that are related to treatent and ensuring that these variables are included increases the efficiency of the fraework. The assuptions required to estiate the MNIR propensity scores are quite strong. In addition to the usual assuptions for propensity scores, we also introduce assuptions about the suitability of the generative odel. Under the odel described here, ψ is estiating the population-average effect of the treatent on the word count vector and does not include, for exaple, the topic-specific types of generative odels that we consider next. Furtherore, atching on the propensity score ay not result in atched texts that see siilar to huan readers because propensity score atching only provides balance in distribution and does not necessarily recover (nearly) exactly atching pairs (King and Nielsen, 2015). As noted in the exaples below, we find that atching on the MNIR projection soeties atches very dissiilar docuents. For exaple, in the censorship case a blog post about a protest ay have a siilar propensity to be censored as a blog post that contains pornography. While these posts and bloggers aybe siilar in the probability that they were censored, they are otherwise copletely dissiilar. If the odel is correct, this will still result in the dataset being balanced in expectation, but in practice it draatically coplicates the ability to ake anual, reading-based, assessents of balance. 4.2 Coarsening High-Diensional Data: Topically Coarsened Exact Matching In this section we provide a brief review of coarsened exact atching and then explain why a naive application of CEM to text data will inevitably fail. We then show how applying CEM to topics provides a ore tractable for of coarsening. 12

15 4.2.1 Coarsened Exact Matching Coarsened Exact Matching siplifies X by coarsening each variable into substantively indistinguishable bins and then perforing exact atching. CEM creates strata for each variable in X so that exact atching can be perfored at the level of these strata. Above we noted that if exact atches could not be found when atching on years of education, X j education could be coarsened fro years of education into bins: no high-school degree, high-school degree, college degree and post-graduate degree. Treated and control units in 9th and 10th grades, respectively, would be counted as close enough. CEM has desirable properties for atching because it is a onotonic ibalance bounding (MIB) ethod, eaning that by choosing strata, the researcher bounds the differences between treated and control to the extrea of the strata. If we use exact atching on the education variable described above, we know that we will never atch a treated unit in high school to a control unit in iddle school. Unlike propensity score atching which approxiates a fully randoized experient, CEM approxiates the ore efficient fully blocked randoized experiental design (Iai, King and Stuart, 2008; King and Nielsen, 2015). An iplicit assuption of CEM is that the set of conditioning variables is not too large relative to the total nuber of observations. To see why this is the case consider a case with a single variable Education which is coarsened to have 4 categories. This results in 4 strata that ust be populated with treated and control units. Now add a second variable Age which also has 4 categories. Now we have 4 4 = 16 strata. Thus the growth is exponential in the nuber of variables. Consider now a text corpus with a very sall nuber of unique words: 100. Applying the axial aount of coarsening we coarsen each diension to a binary variable indicating whether or not the docuent contains the word at all. This still results in strata which is a nuber so incredibly large that we cannot possibly expect to see any atches unless all cobinations of words are alost perfectly correlated with each other. When applying CEM to individual word occurrences, any unique word in a text will eliinate all possible atches. However, the logic of CEM still applies if variables (in this case, indicators for words) 13

16 can be grouped into siilar bins. This procedure is already failiar to students of statistical text analysis because word steing is a type of coarsening across words. When analysts encounter a corpus containing, say, the words censor, censoring, and censored, they often conclude that these words are siilar enough to group into one variable censor by a steing algorith (reducing the diension of X by two). Applying CEM to steed text data aintains the MIB property because any atched docuents ust have equal counts of the ste censor (though they no longer have necessarily equal counts of the unsteed words censor, censoring, and censored ), so seantic distance between texts reains bounded. Steing is not enough, however. Even if we steed all words within our text, k would still not be sall enough to be tractable. Besides, steing algoriths are not welldeveloped for languages like Chinese (where verbs are not conjugated) or Arabic (where words are odified by infixing), so steing is not currently a viable general-purpose solution for diension reduction in text atching applications Topically Coarsened Exact Matching An alternative strategy, which we call topically coarsened exact atching, is to estiate a topic odel and then apply traditional atching ethods to the resulting topics. This aintains the onotonic ibalance bounding property in the topical space but it also can be seen as an analog to coarsened exact atching on words where coarsening is allowed to happen across bundles of related ters. Recall that in the generative process for a topic odel each individual word in the docuent has a topic assignent. Under this odel, two words with the sae topic assignent are stochastically equivalent. Thus applying coarsened exact atching to the topic proportions on each docuent assures that each atched pair of docuents has approxiately the sae proportion of words assigned to (for exaple) the censorship topic, but the odel is indifferent to which censorship words are used. 4 4 We note that this is related to but distinct fro a type of data-driven steing. In steing all variants of a word ( censor, censored, censoring ) ap to the sae ste. In a topic odel a word is apped to a topic based on the other words in the docuent. For exaple, the word block ight be alternatively apped to a topic about web censorship, childrens toys, or karate oves depending on the other words in the docuent. 14

17 We interpret this approach in two ways. If the topics theselves are seantically eaningful then the ibalance bounding properties are a eaningful and desirable aspect of the ethod. It is essentially a stateent that the iportant thing that akes two texts good counterfactuals for each other is not the exact words they use but the subject atter they discuss. Alternatively we can see the topic odel as an estiate of the joint density of the confounders. On this view, atching on the density estiate is a way of reducing variance at the risk of introducing a sall aount of bias (copared to CEM on the full set of confounders). 5 While we are not aware of any work which atches on a density estiate, this perspective does suggest connections to prior work in statistical genetics 6 and optial design for anual coding. 7 The density estiation view is preised on the idea that two observations with a coon density estiate are stochastically equivalent and deviations between the are essentially rando noise. Intuitively both of these interpretations are connected to the idea that if two words coonly co-occur in the corpus as a whole they are essentially interchangeable for the purposes of identifying counterfactuals. 8 The experiental analog of this ight be thought of as a kind of partial or hierarchical blocking in which balance is enforced across the collections of words but not within each collection. 5 CEM bounds the saple ibalance which is directly related to bias in the causal effect estiate (Iai, King and Stuart, 2008; Iacus, King and Porro, 2011). The use of the density estiate allows for the possibility that balance is achieved in the density estiate but ibalance reains in the space of the original confounders. 6 In an influential and highly cited article Price et al. (2006) suggest a procedure which aounts to an eigen decoposition of genotype data followed by a regression based adjustent using that decoposition. Although they are using regression they explicitly invoke the idea that this creates a virtual set of atched cases and controls (Price et al., 2006, pg. 904). The connection our proposed procedure is clear by consider the basic Latent Dirichlet Allocation topic odel as a for of odel-based discrete principal coponents analysis (Buntine and Jakulin, 2004) 7 Taddy (2013a) addresses the proble of how to select docuents for anual coding in supervised learning. The crux of the proble is that you want to choose docuents to code soewhat optially which suggests a space filling design on the text. Unfortunately this is ipractical because in high diensions every docuent is very far away fro every other docuent. His solution is to estiate a topic odel and then use a D-optial space filling design in the lower-diensional topic space. He dubs the strategy factor-optial design. Although neither the applications nor the set up are fraed in these ters, the findings here fairly clearly suggest a fraework for perforing approxiate blocking in treatent assignent of an experiental design. In this sense the work suggests an experiental analog of the strategy pursued here for observational data. 8 This is a bit iprecise as the topic assignent allows for words to be context sensitive. Thus, the word bat can be in a topic with sall aals or sports depending on the other words in the docuent. These two uses of bat are not interchangeable fro the point of the view of the odel. 15

18 As with all paraetric topic odels it is necessary to choose the nuber of topics. Thankfully in this setting the choice is less fraught than in cases where seantic interpretation is the priary concern (Grier and Stewart, 2013). In general, ore topics will result in closer atches. Redundant topics will not cause bias but will drive down the efficiency of the estiator (as it will begin to approach the efficiency of siply applying CEM to the raw word counts). The priary goal is to set the nuber of topics high enough that atched docuents are good counterfactuals, as deterined by the deands of the research design. However the risks of choosing too few topics are uch greater than choosing too any Properties Topically coarsened exact atching has a set of desirable properties that ake it useful for the text case. Siilar to CEM, topic atching bounds the differences between words in two atched docuents by ensuring that groups of words are treated siilarly. This ensures that two docuents that have copletely different topical content could not be atched. Topic atching in this respect does not have the sae pitfalls at MNIR discussed in the previous section where two copletely different docuents could be atched if they had the sae probability of treatent. However, topic atching does not include treatent in selecting variables to use for atching and therefore can fail to pick up on sets of words that are iportant for predicting treatent. Consider the censorship exaple. Two blog posts were atched both talking about a particular city, based on a topic with words about that city. However, one post ay be about an ongoing protest in the city, soething that would be typically censored, and one post ay be about a new construction project within the city. If this were the case, the topics ight be too coarse to distinguish iportant characteristics related to treatent and the assuption T i Y i (1), Y i (1) X would not hold. To understand why this happens, it is iportant to realize that the topics need to explain all the words in the corpus. Typically this eans that topics will generally capture the subject atter of the docuent rather than the sentient, even though sentient ay be a strong predictor of treatent assignent. Thus ideally we want a ethod that can 16

19 cobine the ibalance bounding properties on the topics with the directed assessent of words which affect the probability of treatent. 9 5 Topical Inverse Regression Matching In this section we cobine eleents of MNIR propensity score atching and topically coarsened exact atching to develop a ethod called topical inverse regression atching (TIRM). TIRM allows us to estiate both the topics to atch on and also within-topic propensities for treatent. This type of atching forces atched docuents to be topically siilar, while also increasing weights on words that are related to treatent within a topic, thus incorporating the within-topic perspective of the treated group. This disallows two types of atches. First, it ensures that a docuent related to, say, pornography will not be atched with a docuent related to protest because they are topically dissiilar (even though they have a siilar propensity to be censored). At the sae tie, this odel also prevents a docuent about construction in a city fro being atched to a docuent about the protest within the city (because they have very different propensities to be censored). This allows us to prioritize variables which are related to treatent assignent while approxiating a blocked design on the full set of confounders (siilar in spirit to the cobination of PSM and nearest neighbor atching on prognostic covariates proposed in Rubin and Thoas (2000)). The TIRM ethod also estiates topic-specific probabilities of treatent which akes it ore appropriate to any text applications than siply atching on the overall probability of treatent. In the censorship case, the word delete ay only be related to censorship if the topic has to do with censorship itself 10 and not if the topic has to do with coding software. TIRM allows words to be related to treatent in one topical context and not in another. This provides an additional value-added to atching siply on MNIR-estiated probabilities of treatent, which apply across all topics. TIRMestiated probabilities of treatent also do as well as or better than MNIR-estiated 9 We note that this proble arises due to the necessity of cobining ultiple diensions with topically coarsened exact atching. Under the standard CEM ethod for low-diensional data, atches are enforced across all diensions of the confounders. 10 Criticis of the censors is often censored, as described in King, Pan and Roberts (2013). 17

20 values of treatent out-of-saple, see a further discussion of this in the Appendix. We estiate the coponents of TIRM using the Structural Topic Model (STM). In this section, we first review the set up of the STM odel. We focus in particular on the content covariate, which allows us to estiate the topic-specific probability of treatent. We then explain how to atch with the estiators fro the STM odel before oving to the exaples. 5.1 Review of STM odel The Structural Topic Model is an extension of the popular Latent Dirichlet Allocation odel (Blei, Ng and Jordan, 2003; Blei, 2012) which is designed for use with covariates (Roberts et al., 2014; Roberts, Stewart and Airoldi, 2015). Covariates in the STM can be included to affect topic prevalence (the frequency with which a topic is discussed) and topical content (the words used to discuss a topic). In particular, we show that using STM with the treatent indicator as a topical content covariate effectively cobines the MNIR and topic odeling fraework. In the siplest version without covariates the data generating process of the Structural Topic Model can be given as: γ k Noral P (0, σ 2 k I P ), for k = 1... K 1, (5) θ d LogisticNoral K 1 (Γ x d, Σ), (6) z d,n Multinoial K ( θ d ), for n = 1... N d, (7) w d,n Multinoial V (B z d,n ), for n = 1... N d, (8) β d,k,v = exp( v + κ (t) k,v + κ(c) y d,v + κ (i) y d,k,v ) v exp( v + κ (t) k,v + for v = 1... V and k = 1... K, (9) κ(c) y d,v + κ (i) y d,k,v), Note in particular that the for of the odel in Equation 9 irrors the MNIR odel in Equation 2 but with the addition of topic-specific effects and (optionally) topic-covariate interactions. Thus we can equivalently see this for of the STM as ebedding a ultinoial inverse regression into a topic odel or ebedding topic-specific rando effects 18

21 inside the MNIR odel Matching Quantities Using the sae fraework as Taddy (2013b) we can derive a sufficient reduction of the inforation contained in the word counts about treatent. In this case though the projection represents the inforation about the treatent not carried in the topics. When we oit topic-covariate interactions, the sufficient reduction takes the siple for (κ (c) ) (x i / i ). This projection was explored in work a related work by Rabinovich and Blei (2014) who are focused priarily on using the sufficient reduction as a way to iprove prediction. When we include the interaction of topics and the content covariate the projection becoes ore coplex due to the coupling of the estiated topics and the treatent. The coplication arises because we need to reweight the interaction ter ( by the topic ( ) ) use in the given docuent. Thus the for is: (κ (c) ) (x i / i ) + 1 i v x i,v κ (int) v θi. Once we have coputed the analog of the MNIR sufficient reduction we can atch on both that reduction and the estiated topics, ensuring atches are both topically siilar and have siilar within-topic probabilities of treatent. In order to ensure that topics coparable irrespective of the treatent/control effects we ake an additional pass through all control docuents re-estiating their topic proportions as though they were observed as treated. 12 We then atch on these new topic proportion vectors and the projection using CEM to provide the ibalance bounding guarantees. In Section 6.3 we give an exaple of where it is appealing to atch in a way which does not prune treatent cases in order to preserve the estiand of interest. 11 We note a inor difference arises in the prior distribution used for κ. Taddy (2013b) uses the Gaa- Laplace scale ixture prior whereas we use the ore basic Laplace prior. This doesn t appear to atter an enorous aount in practice. 12 This is consistent with an estiand that is a local version of the average treatent effect on the treated. We can think of including the control docuents in the original odel as a ethod of borrowing strength fro those docuents to fit the topical paraeters. 19

22 5.3 Liitations of TIRM Matching on the STM topics and projections inherits the attractive properties of the MNIR propensity score and topically coarsened exact atching procedures. Indeed a significant benefit is that by jointly estiating both the propensity score and the topics, we ensure that our atching procedure aligns cases which have siilar probabilities of treatent but also are broadly siilar in the collections of words they use. However, a key liitation of course is that the atches do rely heavily on paraetric odels of a coplex data generating process. 13 The topic odel in particular can be interpreted as a density estiate of the data and the resulting atches will only be useful if the underlying topic odel provides an accurate representation of the texts. Thankfully the topics theselves provide soe indication of how useful they will be for atching. An analyst is able to substantively interpret the topics and deterine if docuents which shared a siilar topic profile would serve as reasonable counterfactuals for the analysis in question. An iportant area for future work and a liitation of the current state of the ethod is the lack of theoretical developent. Our interest in atching on texts and other highdiensional data is born out of a practical necessity. As we argued above and show through exaples in subsequent sections, there are a variety of research probles where docuents theselves are clearly the relevant pre-treatent confounders. We believe a particular fruitful avenue for future work is the theoretical properties of using density estiates in atching including the iplications that such strategies have for how a unit s density estiate is connected to other units within the saple. 14 Nevertheless, we do not believe the lack of such developent is an ipedient to practical work in the short ter. 13 We do however note that nothing about our fraework requires this particular data generating process. The ixed-ebership ultinoial odel we propose here could easily be replaced by a different density estiator. For exaple, Taddy (2015b) recently proposed an inverse regression technique based on distributed representations fro deep learning. We prefer the topic odel representation as it allows for context sensitivity for words, but as other techniques evolve there will alost certainly be better options available. Our fraework is sufficiently general to handle these changes. 14 We note that this proble isn t exactly new to our ethod. Well-studied approaches such as propensity score atching and Mahalanobis distance atching have a dependence on the saple through the estiation of the propensity score odel and saple covariance respectively. 20

23 TIRM also inherits soe liitations that are coon to its predecessors. In particular, we need to carefully consider the risks of interference between units (Bowers, Fredrickson and Panagopoulos, 2013; Aronow and Saii, 2013). Essentially all atching ethods require that Stable Unit Treatent Value Assuption (SUTVA) holds (Rubin, 1980); however, when a text in a corpus was written to influence the writers of other texts a corpus (as in acadeic articles, for exaple), this can be a difficult assuption to justify. While this concern is significantly ore general than the ethod we propose here, we highlight it in order to ephasize that these issues should be considered on a case by case basis. We also caution that as with other atching ethods, when units are dropped fro the saple we ust be careful to correctly characterize the group to which the estiated effect applies (King, Lucas and Nielsen, 2015; Rubin, 2006, ). Finally, TIRM still requires a selection on the observables assuption. However, by including the inforation within the text we have ade this assuption substantially ore plausible when we think that the iportant pre-treatent confounders are captured within the docuent. 5.4 Balance Checking Balance checking is an iportant part of any atching procedure, but it can be difficult to clearly specify an appropriate balance etric for text atching applications. 15 Even our nueric representation of the text as a count atrix is a substantial siplification of the true text. For this reason, we have ephasized a series of ethods where it is easy to show which observations are atched to each other so those atches can be evaluated in both a quantitative and qualitative way. Without a single, unifying balance etric, we find it useful to check balance across a variety of quantitative etrics in addition to ore thorough, but labor-intensive, qualitative reading. In the first and siplest balance check, we identify words that in the full saple were very associated with treatent. We then copare average word appearance of these words in the atched saple, verifying that the treatent and control uses of the 15 One could argue that it is a natural extension of the No Free Lunch theore that there cannot be a single balance etric appropriate for all situations (Wolpert and Macready, 1997). 21

Predicting Time Spent with Physician

Predicting Time Spent with Physician Ji Zheng jizheng@stanford.edu Stanford University, Coputer Science Dept., 353 Serra Mall, Stanford, CA 94305 USA Ioannis (Yannis) Petousis petousis@stanford.edu Stanford University, Electrical Engineering