This content downloaded from on Mon, 2 Sep :55:20 PM All use subject to JSTOR Terms and Conditions

Size: px

Start display at page:

Download "This content downloaded from on Mon, 2 Sep :55:20 PM All use subject to JSTOR Terms and Conditions"

Judith Ashlee Hubbard
5 years ago
Views:

Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs Author(s): Rajeev H. Dehejia and Sadek Wahba Source: Journal of the American Statistical Association, Vol.

1 Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs Author(s): Rajeev H. Dehejia and Sadek Wahba Source: Journal of the American Statistical Association, Vol. 94, No. 448 (Dec., 1999), pp Published by: American Statistical Association Stable URL: Accessed: 2/9/213 18:55 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at. JSTOR is a not-for-profit service that helps scholars, researchers, and students disver, use, and build upon a wide range of ntent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please ntact support@jstor.org.. American Statistical Association is llaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association.

2 Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs Rajeev H. DEHEJIA and Sadek WAHBA This article uses propensity sre methods to estimate the treatment impact of the National Supported Work (NSW) Demonstration, a labor training program, on postintervention earnings. We use data from Lalonde's evaluation of nonexperimental methods that mbine the treated units from a randomized evaluation of the NSW with nonexperimental mparison units drawn from survey datasets. We apply propensity sre methods to this mposite dataset and demonstrate that, relative to the estimators that Lalonde evaluates, propensity sre estimates of the treatment impact are much closer to the experimental benchmark estimate. Propensity sre methods assume that the variables associated with assignmento treatment are observed (referred to as ignorable treatment assignment, or selection on observables). Even under this assumption, it is difficult to ntrol for differences between the treatment and mparison groups when they are dissimilar and when there are many preintervention variables. The estimated propensity sre (the probability of assignmento treatment, nditional on preintervention variables) summarizes the preintervention variables. This offers a diagnostic on the mparability of the treatment and mparison groups, because one has only to mpare the estimated propensity sre across the two groups. We discuss several methods (such as stratification and matching) that use the propensity sre to estimate the treatment impact. When the range of estimated propensity sres of the treatment and mparison groups overlap, these methods can estimate the treatment impact for the treatment group. A sensitivity analysis shows that our estimates are not sensitive to the specification of the estimated propensity sre, but are sensitive to the assumption of selection on observables. We nclude that when the treatment and mparison groups overlap, and when the variables determining assignment to treatment are observed, these methods provide a means to estimate the treatment impact. Even though propensity sre methods are not always applicable, they offer a diagnostic on the quality of nonexperimental mparison groups in terms of observable preintervention variables. KEY WORDS: Matching; Program evaluation; Propensity sre. 1. INTRODUCTION and Hotz 1989; Manski, Sandefur, McLanahan, and Powers This article discusses the estimation of treatment effects 1992). in observational studies. This issue has been the focus of In this article we apply propensity sre methods (Rosenmuch attention because randomized experiments cannot albaum and Rubin 1983) to Lalonde's dataset. The propenways be implemented and has been addressed inter alia by sity sre is defined as the probability of assignment to Lalonde (1986), whose data we use herein. Lalonde estitreatment, nditional on variates. Propensity sre methmated the impact of the National Supported Work (NSW) ods focus on the mparability of the treatment and non- Demonstration, a labor training program, on postintervenexperimental mparison groups in terms of preintervention inme levels. He used data from a randomized evaltion variables. Controlling for differences in preintervenuation of the program and examined the extent to which tion variables is difficult when the treatment and mparison groups are dissimilar and when there are many preintervennonexperimental estimators can replicate the unbiased extion variables. The estimated propensity sre, a single variperimental estimate of the treatment impact when applied able on the unit interval that summarizes the preintervento a mposite dataset of experimental treatment units and tion variables, can ntrol for differences between the treatnonexperimental mparison units. He ncluded that stanment and nonexperimental mparison groups. When we dard nonexperimental estimatorsuch as regression, fixedapply these methods to Lalonde's nonexperimental data for effects, and latent variable selection models are either ina range of propensity sre specifications and estimators, accurate relative to the experimental benchmark or sensiwe obtain estimates of the treatment impact that are much tive to the specification used in the regression. Lalonde's closer to the experimental treatment effec than Lalonde's results have been influential renewing the debate on exnonexperimental estimates. perimental versus nonexperimental evaluations (see Manski The assumption underlying this method is that assignand Garfinkel 1992) and in spurring a search for alternament to treatment is associated only with observable preintive estimators and specification tests (see, e.g., Heckman tervention variables, called the ignorable treatment assignment assumption or selection on observables (see Heckman Rajeev Dehejia is Assistant Professor, Department of Enomics and and Robb 1985; Holland 1986; Rubin 1974, 1977, 1978). SIPA, Columbia University, New York, NY 127 ( dehejia@ Although this is a strong assumption, we demonstrate that lumbia.edu). Sadek Wahba is Vice President, Morgan Stanley & Co. Inpropensity sre methods are an informative starting point, rporated, New York, NY 136 ( wahbas@ms.m). This work was partially supported by a grant from the Social Sciences and Hu- because they quickly reveal the extent of overlap in the manities Research Council of Canada (first author) and a World Bank treatment and mparison groups in terms of preinterven- Fellowship (send author). The authors gratefully acknowledge an asso- tion variables. ciate editor, two anonymous referees, Gary Chamberlain, Guido Imbens, and Donald Rubin, whose detailed mments and suggestions greatly improved the article. They thank Robert Lalonde for providing the data from his 1986 study and substantial help in recreating the original dataset, and? 1999 American Statistical Association also thank Joshua Angrist, George Cave, David Cutler, Lawrence Katz, Journal of the American Statistical Association Caroline Minter-Hoxby, and Jeffrey Smith. December 1999, Vol. 94, No. 448, Applications and Case Studies 153

154 Journal of the American Statistical Association, December 1999 The article is organized as follows. Section 2 reviews employed prior to randomization.

3 154 Journal of the American Statistical Association, December 1999 The article is organized as follows. Section 2 reviews employed prior to randomization. Selection of this subset is Lalonde's data and reproduces his results. Section 3 identi- based only on preintervention variables (month of assignfies the treatment effect under the potential outmes causal ment and employment history). Assuming that the initial model and discusses estimation strategies for the treatment randomization was independent of preintervention varieffect. Section 4 applies our methods to Lalonde's dataset, ates, the subset retains a key property of the full experimenand Section 5 discusses the sensitivity of the results to the tal data: The treatment and ntrol groups have the same methodology. Section 6 ncludes the article. distribution of preintervention variables, although this distribution uld differ from the distribution of variates for 2. LALONDE'S RESULTS the larger sample. A difference in means remains an unbiased estimator of the average treatment impact for the 2.1 The Data reduced sample. The subset includes 185 treated and 26 The NSW Demonstration [Manpower Demonstration Re- ntrol observations. search Corporation (MDRC) 1983] was a federally and pri- We presen the preintervention characteristics of the origvately funded program implemented in the mid-197s to inal sample and of our subset in the first four rows of Table provide work experience for a period of 6-18 months to 1. Our subset differs from Lalonde's original sample, espeindividuals who had faced enomic and social problems cially in terms of 1975 earnings; this is a nsequence both prior to enrollment in the program. Those randomly se- of the hort phenomenon and of the fact that our sublected to join the program participated in various types of sample ntains more individuals who were unemployed work, such as restaurant and nstruction work. Informa- prior to program participation. The distribution of preintertion on preintervention variables (preintervention earnings vention variables is very similar across the treatment and as well as education, age, ethnicity, and marital status) was ntrol groups for each sample; none of the differences is obtained from initial surveys and Social Security Admin- significantly different from at a 5% level of significance, istration rerds. Both the treatment and ntrol groups with the exception of the indicator for "no degree". participated in follow-up interviews at specific intervals. Lalonde's nonexperimental estimates of the treatment ef- Lalonde (1986) offered a separate analysis of the male and fect are based on two distinct mparison groups: the Panel female participants. In this article we focus on the male par- Study of Inme Dynamics (PSID-1) and Westat's Matched ticipants, as estimates for this group were the most sensitive Current Population Survey-Social Security Administration to functional-form specification, as indicated by Lalonde. File (CPS-1). Table 1 presents the preintervention charac- Candidates eligible for the NSW were randomized into teristics of the mparison groups. It is evident that both the program between March 1975 and July One n- PSID-1 and CPS-1 differ dramatically from the treatment sequence of randomization over a 2-year period was that group in terms of age, marital status, ethnicity, and preinindividuals who joined early in the program had different tervention earnings; all of the mean differences are signifcharacteristics than those who entered later; this is referred icantly different from well beyond a 1% level of signifto as the "hort phenomenon" (MDRC 1983, p. 48). An- icance, except the indicator for "Hispanic". To bridge the other nsequence is that data from the NSW are delin- gap between the treatment and mparison groups in terms eated in terms of experimental time. Lalonde annualized of preintervention characteristics, Lalonde extracted subsets earnings data from the experiment because the nonexperi- from PSID-1 and CPS-1 (denoted PSID-2 and -3 and CPS-2 mental mparison groups that he used (discussed later) are and -3) that resemble the treatment group in terms of single delineated in calendar time. By limiting himself to those preintervention characteristics (such as age or employment assigned to treatment after December 1975, Lalonde en- status; see Table 1). Table 1 reveals that the subsets resured that retrospectivearnings information from the ex- main statistically substantially different from the treatment periment included calendar 1975 earnings, which he then group; the mean differences in age, ethnicity, marital status, used as preintervention earnings. By likewise limiting him- and earnings are smaller but remain statistically significant self to those who were no longer participating in the pro- at a 1% level. gram by January 1978, he ensured that the postintervention data included calendar 1978 earnings, which he took to be the outme of interest. Earnings data for both these 2.2 years Lalonde's Results are available for both nonexperimental mparison groups. Because our analysis in Section 4 uses a subset of This reduces the NSW sample to 297 treated observations Lalonde's original data and an additional variable (1974 and 425 ntrol observations for male participants. earnings), in Table 2 we reproduce Lalonde's results us- However, it is importanto look at several years of prein- ing his original data and variables (Table 2, panel A), and tervention earnings in determining the effect of job training then apply the same estimators to our subset of his data programs (Angrist 199, 1998; Ashenfelter 1978; Ashen- both without and with the additional variable (Table 2, panfelter and Card 1985; Card and Sullivan 1988). Thus we els B and C). We show that when his analysis is applied further limit ourselves to the subset of Lalonde's NSW data to the data and variables that we use, his basic nclusions for which 1974 earnings can be obtained: those individuals remain unchanged. In Section 5 we discuss the sensitivity who joined the program early enough for the retrospective of our propensity sre results to dropping the additional earnings information to include 1974, as well as those indi- earnings data. In his article, Lalonde nsidered linear reviduals who joined later but were known to have been un- gression, fixed-effects, and latent variable selection models

4 Dehejia and Wahba: Causal Effects in Nonexperimental Studies 155 Table 1. Sample Means of Characteristics for NSW and Comparison Samples No. of observations Age Education Black Hispanic No degree Married RE74 (U.S. $) RE75 (U.S. $) NSW/Lalonde:a Treated ,66 (.32) (.9) (.2) (.1) (.2) (.2) (236) Control ,26 (.32) (.8) (.2) (.2) (.2) (.2) (252) RE74 subset:b Treated ,96 1,532 (.35) (.1) (.2) (.1) (.2) (.2) (237) (156) Control ,17 1,267 (.34) (.8) (.2) (.2) (.2) (.2) (276) (151) Comparison groups:c PSID-1 2, ,429 19,63 [.78] [.23] [.3] [.1] [.4] [.3] [991] [1,2] PSID ,27 7,569 [1.] [.27] [.4] [.2] [.5] [.4] [853] [695] PSID ,566 2,611 [1.17] [.29] [.5] [.3] [.5] [.5] (686) [499] CPS-1 15, ,16 13,65 [.81] [.21] [.2] [.2] [.3] [.3] [75] [682] CPS-2 2, ,728 7,397 [.87] [.19] [.2] [.2] [.4] [.4] [667] [6] CPS ,619 2,467 [.87] [.23] [.3] [.3] [.4] [.4] [552] [288] NOTE: Standard errors are in parentheses. Standard error on difference in means with RE74 subset/treated is given in brackets. Age = age in years; Education = number of years of schooling; Black = 1 if black, otherwise; Hispanic = 1 if Hispanic, otherwise; No degree = 1 if no high school degree, otherwise; Married = 1 if married, otherwise; REx = earnings in calendar year 19x. a NSW sample as nstructed by Lalonde (1986). bthe subset of the Lalonde sample for which RE74 is available. c Definition of mparison groups (Lalonde 1986): PSID-1: All male household heads under age 55 who did not classify themselves as retired in PSID-2: Selects from PSID-1 all men who were not working when surveyed in the spring of PSID-3: Selects from PSID-2 all men who were not working in CPS-1: All CPS males under age 55. CPS-2: Selects from CPS-1 all males who were not working when surveyed in March CPS-3: Selects from CPS-2 all the unemployed males in 1976 whose inme in 1975 was below the poverty level. PSID1-3 and CPS-1 are identical to those used by Lalonde. CPS2-3 are similar to those used by Lalonde, but Lalonde's original subset uld not be recreated. of the treatment impact. Because our analysis focuses on panel B, is that the regression specifications and mparithe importance of preintervention variables, we focus on son groups fail to replicate the treatment impact. the first of these. Including 1974 earnings as an additional variable in the Table 2, panel A, reproduces the results of Lalonde (1986, regressions in Table 2, panel C does not alter Lalonde's basic message, although the estimates improve mpared Table 5). Comparing Panels A and B, we note that the treatto those in panel B. In lumns (1) and (3), many estimates ment effect, as estimated from the randomized experiment, remain negative, but less so than in panel B. In lumn (2) is higher in the latter ($1,794 mpared to $886). This rethe estimates for PSID-1 and CPS-1 are negative, but the flects differences in the mposition of the two samples, as estimates for the subsets improve. In lumns (4) and (5) discussed in the previous section: A higher treatment ef- the estimates are closer to the experimental benchmark than fect is obtained for those who joined the program earlier or in panel B, off by about $1, for PSID1-3 and CPS1-2 who were unemployed prior to program participation. The and by $4 for CPS-3. Overall, the results closest to the results in terms of the success of nonexperimental estimates experimental benchmark in Table 2 are for CPS-3, panel C. are qualitatively similar across the two samples. The sim- This raises a number of issues. The strategy of nsidering ple difference in means, reported in lumn (1), yields neg- subsets of the mparison group improves estimates of the ative treatment effects for the CPS and PSID mparison treatment effect relative to the benchmark. However, Table 1 reveals that significant differences remain between the groups in both samples (except PSID-3). The fixed-effectsmparison groups and the treatment group. These subsets type differencing estimator in the third lumn fares someare created based on one or two preintervention variables. what better, although many estimates are still negative or In Sections 3 and 4 we show that propensity sre methods deteriorate when we ntrol for variates in both panels. provide a systematic means of creating such subsets. The estimates in the fifth lumn are closest to the experimental estimate, nsistently closer than those in the sec- 3. IDENTIFYING AND ESTIMATING THE AVERAGE ond lumn, which do not ntrol for earnings in The TREATMENT EFFECT treatment effect is underestimated by about $1, for the 3.1 Identification CPS mparison groups and by $1,5 for the PSID groups. Let Yi, representhe value of the outme when unit i is Lalonde's nclusion from panel A, which also holds in exposed to regime 1 (called treatment), and let Yio represent

5 156 Journal of the American Statistical Association, December 1999 Q3 Lo -1 LO Lo 'I cm q q 'It CI) CC) CC) N N r, L O-) Lo C').) f: ui.) =3 CC) r' n C, C') N w m w 11- w (D C'i :3 (3) E2 CL C) CM E Lo C) (D I- C - =3 CZ CO ') CM C) C) 1 "- ; "t "t (O CI) ') ') C) Z CC) W N CO C"L b CL LL] 6 2LI i5 LO CM Lo CL :3 N LO O' C- C) =3 C) L6 L '2 (D C) C) - rl- (o Lo o C) (D =3 2 o Lo a w o C').) C) (D C) '('O F, N r' LO CO CQL 72 LL] 'r, (D ; W LO 11 - CI) C\j ') N rl- (D CM CM LO LO CI) (D CO, Q4) C? zi- z zti O- 7 7 C) C\j :;- _- a- LO ia C) i:::, CC) i:::, (Co') C 't to :3 r M Lo I'* Fi C, rl- ). rl- C) CM r ') "t ) (o JR). rl- w ui 4). CL E CM j:z: 'It LO - CM - - :;' C) Lo - w r'' r" LO CI) 't Ir - C) CI) C'q C'j -6 (D r- N N r".' CZ LU U =3 LU C3) m, - 6 ja a 2 CZ LO O- CO - r- CM gt r' LO ON "t a a m N :;: LO (D (D C) "t - R r- CI) (D U-) :3 Co C: ) ') (D I'- C') (D a) 6- cli E C) rl- CE2 - It (O ') CO 't 1, rlo I'-3 "t CI) L) cq (o CIZ LO to ) LO LO C\J r- C1 (o W "t 11- W ) LO C\j Q R -6 Oa" 4 E2 COO C: a) E =3 CM cl) CM ') LO i:::, (O C) (D C\i "t r- LO (D LO LO CM E r- ') CM ') E 23 :3 LO a) 16 CL Q)? o ztj E p ui w 14 a) C) CI) ') a) ro- Qp) E r o, izz, "t "t C) ro-) w M- m C'l N N Lo LQ N (q Lo C\L CI) oo (o r'- CC) -r- =3 C\j (o r-, ` Lo -o a) =3 Cr -o Lo? Sn a).3 - CO =3 E 'o E 2-6 ui C'z (O ) r'- q r- C\j (O ') "t Co Lo (o CO ) r- t w CC - =3 -o - (CD: 'D 5 'V5 C: c (n C\j S S -a 2 - Co. :L- :3 ) Lo (n m r'- C'l U') V 2 ) a) w -t 2 'Z.9 i5 6 -r- - -r-

6 Dehejia and Wahba: Causal Effects in Nonexperimental Studies 157 the value of the outme when unit i is exposed to regime (called ntrol). Only one of Yio or Yi1 can be observed for any unit, because one cannot observe the same unit under both treatment and ntrol. Let Ti be a treatment indicator (1 if exposed to treatment, otherwise). Then the observed outme for unit i is Yi = TiTYi + (1-Tj)Yio. The treatment effect for unit i is Ti = Yi-Yio. In an experimental setting where assignmento treatment is randomized, the treatment and ntrol groups are drawn from the same population. The average treatment effect for this population is T= E(Yi) - E(Yio). But randomization implies that {Yi, I Yio IL Ti} [using Dawid's (1979) notation, H represents independence], so that for j,1, and E(YijlTi = 1) = E(YijlTi = ) = E(YijTi = j) T E(Yi1 ITi 1) - E(Yio ITi = ) = E(YjjTi =1)-E(YijTi =?), which is readily estimated. In an observational study, the treatment and mparison -E(YilTi =,p(xi)) Ti = 1}, (2) groups are often drawn from different populations. In our assuming that the expectations are defined. The outer exapplication the treatment group is drawn from the popu- pectation is over the distribution of p(xi)iti = 1. lation of interest: welfare recipients eligible for the pro- One intuition for the propensity sre is that whereas in gram. The (nonexperimental) mparison group is drawn (1) we are trying to ndition on Xi (intuitively, to find obfrom a different population. (In our application both the servations with similar variates), in (2) we are trying to CPS and PSID are more representative of the general U.S. ndition just on the propensity sre, because the propopopulation.) Thus the treatment effect that we are trying sition implies that observations with the same propensity to identify is the average treatment effect for the treated sre have the same distribution of the full vector of population, variates, Xi. T T=l= E(Yi, ITi = 1) - E(Yio ITi = 1). 3.2 The Estimation Strategy Estimation is done in two steps. First, we estimate the This expression cannot be estimated directly, because Yio is propensity sre separately for each nonexperimental samnot observed for treated units. Assuming selection on ob- ple nsisting of the experimental treatment units and the servable variates, Xi-namely, {Yil, Yio HL Ti} Xi (Rubin specified set of mparison units (PSID1-3 or CPS 1-3). We 1974, 1977)-we obtain use a logistic probability model, but other standard models E (Yij I Xi, Ti = 1) = E (Yij I Xi, Ti = O) = E (Yi I Xi, Ti = j) for j =, 1. Conditional on the observables, Xi, there is yield similar results. One issue is what functional form of the preintervention variables to include in the logit. We rely on the following proposition. no systematic pretreatment difference between the groups Proposition 2 (Rosenbaum and Rubin 1983). If p(xi) assigned to treatment and ntrol. This allows us to identify is the propensity sre, then the treatment effect for the treated, Xi H Ti lp(xi) T1T=1 = EfE(YilXi,Ti = 1)-E(YilXi,Ti = O)jTi = 1}, (1) where the outer expectation is over the distribution of XilTi = 1, the distribution of preintervention variables in the treated population. In our application we have both an experimental ntrol group and a nonexperimental mparison group. Because the former is drawn from the population of interest along with the treated group, we enomize on notation and use Ti=1 to represen the entire group of interest and use i= to representhe nonexperimental group. Thus in (1) the expectation is over the distribution of Xi for the NSW population. One method for estimating the treatment effecthat stems from (1) is estimating E(Yi IXi, Ti = 1) and E(Yi IXi, Ti = ) as two nonparametric equations. This estimation strategy bemes difficult, however, if the variates, Xi, are high dimensional. The propensity sre theorem provides an intermediate step. Proposition ] (Rosenbaum and Rubin 1983). Let p(xi) be the probability of unit i having been assigned to treatment, defined as p(xi) _ Pr(Ti = 1JXi) = E(TilXi). Assume that < p(xi) < 1, for all Xi, and Pr(Tl, T2,... TNIX1,X2,... XN) = f.i=1 N P(Xi)Ti (1 p(xi))(1-ti) for the N units in the sample. Then {(Yi,Yio) II Ti} Xi =- {(Y 1,Yio) II Ti}lp(Xi). Corollary. If {(Yil, Yio) H Ti} IXi and the assumptions of Proposition 1 hold, then T T=l = EfE(YilTi = l,p(xi)) Proposition 2 asserts that, nditional on the propensity sre, the variates are independent of assignmento treatment, so that for observations with the same propensity sre, the distribution of variates should be the same across the treatment and mparison groups. Conditioning on the propensity sre, each individual has the same probability of assignmento treatment, as in a randomized experiment. We use this proposition to assess estimates of the propensity sre. For any given specification (we start by introducing the variates linearly), we group observations into strata defined on the estimated propensity sre and check

158 Journal of the American Statistical Association, December 1999 whether we succeed in balancing the variates within each ing nonparametric techniques. Hence we use a parametric stratum.

7 158 Journal of the American Statistical Association, December 1999 whether we succeed in balancing the variates within each ing nonparametric techniques. Hence we use a parametric stratum. We use tests for the statistical significance of dif- model for the propensity sre. This is preferable to apferences in the distribution of variates, focusing on first plying a parametric model directly to (1) because, as we and send moments (see Rosenbaum and Rubin 1984). If will see, the results are less sensitive to the logit specifithere are no significant differences between the two groups cation than regression models, such as those in Table 2. within each stratum, then we accept the specification. If Finally, depending on the estimator that one adopts (e.g., there are significant differences, then we add higher-order stratification), a precise estimate of the propensity sre is terms and interactions of the variates until this ndition not required. The process of validating the propensity sre is satisfied. In Section 5 we demonstrate that the results are estimate produces at least one partition structure that balnot sensitive to the selection of higher-order and interaction ances preintervention variates across the treatment and variables. mparison groups within each stratum, which, by (1), is In the send step, given the estimated propensity sre, all that is needed for an unbiased estimate of the treatment we need to estimate a univariate nonparametric regres- impact. sion, E(YjTi =T j,p(xi)), for j =,1. We focus on simple methods for obtaining a flexible functional form- 4. RESULTS USING THE PROPENSITY SCORE stratification and matching-but in principle one uld use Using the method outlined in the previous section, we any of the standard array of nonparametric techniques (see, separately estimate the propensity sre for each sample e.g., Hardle and Linton 1994; Heckman, Ichimura, and Todd of mparison units and treatment units. Figures 1 and ). present histograms of the estimated propensity sres for With stratification, observations are sorted from lowest to the treatment and PSID-1 and CPS-1 mparison groups. highest estimated propensity sre. We discard the mpar- Most of the mparison units (1,333 of a total of 2,49 ison units with an estimated propensity sre less than the PSID-1 units and 12,611 of 15,992 CPS-1 units) are disminimum (or greater than the maximum) estimated propen- carded because their estimated propensity sres are less sity sre for treated units. The strata, defined on the es- than the minimum for the treatment units. Even then, the timated propensity sre, are chosen so that the variates first bin (units with an estimated propensity sre of - within each stratum are balanced across the treatment and.5) ntains most of the remaining mparison units and mparison units. (We know that such strata exist from step few treatment units. An important difference between the 1.) Based on (2), within each stratum we take a difference figures is that Figure 1 has many bins in which the treatin means of the outme between the treatment and m- ment units greatly outnumber the mparison units. (Inparison groups, then weight these by the number of treated deed, for three bins there are no mparison units.) In nobservations in each stratum. We also nsider matching on trast, in Figure 2 for CPS- 1, each bin ntains at least a the propensity sre. Each treatment unit is matched with few mparison units. Overall, for PSID- 1 there are 98 replacemento the mparison unit with the closest propen- (more than half the total number) treated units with an estisity sre; the unmatched mparison units are discarded mated propensity sre in excess of.8, and only 7 mpar- (see Dehejia and Wahba 1998 for more details; also Heck- ison units, mpared to 35 treated and 7 mparison units man, Ichimura, Smith, and Todd 1998; Heckman, Ichimura, for CPS-1. and Todd 1998; Rubin 1979). Figures 1 and 2 illustrate the diagnostic value of the There are a number of reasons for preferring this two- propensity sre. They reveal that although the mparistep approach to direct estimation of (1). First, tackling (1) son groups are large relative to the treatment group, there directly with a nonparametric regression would enunter is limited overlap in terms of preintervention characteristhe curse of dimensionality as a problem in many datasets tics. Had there been no mparison units overlapping with such as ours that have a large number of variates. This a broad range of the treatment units, then it would not have would also occur when estimating the propensity sre us- been possible to estimate the average treatment effect on the 15 i 1 i l I l I l l o i j Ca 5Xm,lF,ln Estimated p(xi), 1333 mparison units discarded, first bin ntains 928 mparison units Figure 1. Histogram of the Estimated Propensity Sre for NSW Treated Units and PSID Comparison Units. The 1,333 PSID units whose estimated propensity sre is less than the minimum estimated propensity sre for the treatment group are discarded. The first bin ntains 928 PSID units. There is minimal overlap between the two groups. Three bins (.8-.85,.85-.9, and.9-.95) ntain no mparison units. There are 97 treated units with an estimated propensity sre greater than.8 and only 7 mparison units.

8 Dehejia and Wahba: Causal Effects in Nonexperimental Studies i i l I l l l I 15: 1{ o1 Ca 5 X Estimated p(xi), mparison units discarded, first bin ntains 2969 mparison units Figure 2. Histogram of the Estimated Propensity Sre for NSW Treated Units and CPS Comparison Units. The 12,611 CPS units whose estimated propensity sre is less than the minimum estimated propensity sre for the treatment group are discarded. The first bin ntains 2,969 CPS units. There is minimal overlap between the two groups, but the overlap is greater than in Figure 1; only one bin (.45-.5) ntains no mparison units, and there are 35 treated and 7 mparison units with an estimated propensity sre greater than.8. treatment group (although the treatment impact still uld number of treated observations within each stratum [Table be estimated in the range of overlap). With limited overlap, 3, lumn (4)]. An alternative is a within-block regression, we can proceed cautiously with estimation. Because in our again taking a weighted sum over the strata [Table 3, lapplication we have the benchmark experimental estimate, umn (5)]. When the variates are well balanced, such a we are able to evaluate the accuracy of the estimates. Even regression should have little effect, but it can help elimiin the absence of an experimental estimate, we show in Sec- nate the remaining within-block differences. Likewise for tion 5 that the use of multiple mparison groups provides matching, we can estimate a difference in means between the treatment and matched mparison groups for earnings another means of evaluating the estimates. in 1978 [lumn(7)], and also perform a regression of 1978 We use stratification and matching on the propensity earnings on variates [lumn(8)]. sre to group the treatment units with the small number Table 3 presents the results. For the PSID sample, the of mparison units whose estimated propensity sres are stratification estimate is $1,68 and the matching estigreater than the minimum-or less than the maximum- mate is $1,691, mpared to the benchmark randomizedpropensity sre for treatment units. We estimate the treat- experiment estimate of $1,794. The estimates from a difment effect by summing the within-stratum difference in ference in means and regression on the full sample are means between the treatment and mparison observations -$15,25 and $731. In lumns (5) and (8), ntrolling (of earnings in 1978), where the sum is weighted by the for variates has little impact on the stratification and Table 3. Estimated Training Effects for the NSW Male Participants Using Comparison Groups From PSID and CPS NSW treatment earnings less mparison group earnings, NSW earnings less nditional on the estimated propensity sre mparison group earnings Quadratic Stratifying on the sre Matching on the sre (1) (2) in sre' (4) (5) (6) (7) (8) Unadjusted Adjusteda (3) Unadjusted Adjusted Observationsc Unadjusted Adjustedd NSW 1,794 1,672 (633) (638) PSID-le -15, ,68 1,494 1,255 1,691 1,473 (1,154) (886) (1,389) (1,571) (1,581) (2,29) (89) PSID-21-3, ,22 2, ,455 1,48 (959) (1,28) (1,193) (1,768) (1,793) (2,33) (88) PSID-31 1, ,321 1, ,12 1,549 (899) (1,14) (1,383) (1,994) (2,2) (2,335) (826) CPS-1g -8, ,117 1,713 1,774 4,117 1,582 1,616 (712) (55) (747) (1,115) (1,152) (1,69) (751) CPS-29-3, ,543 1,622 1,493 1,788 1,563 (67) (658) (847) (1,461) (1,346) (1,25) (753) CPS , ,252 2, (657) (798) (951) (1,617) (2,82) (1,496) (776) a Least squares regression: RE78 on a nstant, a treatment indicator, age, age 2, education, no degree, black, Hispanic, RE74, RE75. bleast squares regression of RE78 on a quadratic on the estimated propensity sre and a treatment indicator, for observations used under stratification; see note (g). c Number of observations refers to the actual number of mparison and treatment units used for (3)-(5); namely, all treatment units and those mparison units whose estimated propensity sre is greater than the minimum, and less than the maximum, estimated propensity sre for the treatment group. dweighted least squares: treatment observations weighted as 1, and ntrol observations weighted by the number of times they are matched to a treatment observation [same variates as (a)]. Propensity sres are estimated using the logistic model, with specifications as follows: e PSID: Prb i= 1)= Fae Prob (T F(age, age, education, education, married, no degree, black, Hispanic, RE74, RE75, RE74, RE75, u74* black). f PI-adPD3Po =1=Fg,a,ndge,areblkHiai2,edai,eu PSID-2 and PSID-3: Prob (Ti 1) = = F(age, age,education, education,no degree, married, black, Hispanic, RE74, RE74, RE75, RE75, u74, u75). CPS-1, CPS-2, and CPS-3: Prob (Ti = 1) = F(age, age2, education, education2, no degree, married, black, Hispanic, RE74, RE75, u74, u75, education * RE74, age3)

9 16 Journal of the American Statistical Association, December 1999 matching estimates. Likewise for the CPS, the propensity- Column (3) in Table 3 illustrates the value of allowing sre-based estimates from the CPS-$1,713 and $1,582- both for a heterogeneous treatment effect and for a nonare much closer to the experimental benchmark than linear functional form in the propensity sre. The estimaestimates from the full mparison sample, -$8,498 tors in lumns (4)-(8) have both of these characteristics, and $972. whereas lumn (3) regresses 1978 earnings on a less non- We also nsider estimates from the subsets of the PSID linear function [quadratic, as opposed to the step function and CPS. In Table 2 the estimates tend to improve when in lumns (4) and (5)] of the estimated propensity sre applied to narrower subsets. However, the estimates still and a treatment indicator. The estimates are mparable range from -$3,822 to $1,326. In Table 3 the estimates do to those in lumn (2), where we regress the outme on not improve for the subsets, although the range of fluctua- all preintervention characteristics, and are farther from the tion is narrower, from $587 to $2,321. Tables 1 and 4 shed experimental benchmark than the estimates in lumns (4)- light on this. (8). This demonstrates the ability of the propensity sre to Table 1 presents the preintervention characteristics of summarize all preintervention variables, but underlines the the various mparison groups. We note that the subsets- importance of using the propensity sre in a sufficiently PSID-2 and -3 and CPS-2 and -3-although more closely nonlinear functional form. resembling the treatment group are still nsiderably differ- Finally, it must be noted that even though the estimates ent in a number of important dimensions, including ethnicpresented in Table 3 are closer to the experimental benchity, marital status, and especially earnings. Table 4 presents mark than those presented in Table 2, with the exception the characteristics of the matched subsamples from the of the adjusted matching estimator, their standard errors are mparison groups. The characteristics of the higher. In Table matched sub- 3, lumn (5), the standard errors are 1,152 and 1,581 for the CPS and PSID, mpared to 55 sets of CPS-1 and PSID-1 rrespond closely to the treatand 886 in Table 2, Panel C, lumn (5). This is because the ment group; none of the differences is statistically signifpropensity sre estimators use fewer observations. When icant. But as we create subsets of the mparison groups, stratifying on the propensity sre, we discard irrelevant the quality of the matches declines, most dramatically for ntrols, so that the strata may ntain as few as seven the PSID. PSID-2 and -3 earnings now increase from 1974 treated observations. However, the standard errors for the to 1975, whereas they decline for the treatment group. The adjusted matching estimator (751 and 89) are similar to training literature has identified the "dip" in earnings as an those in Table 2. important characteristic of participants in training programs By summarizing all of the variates in a single number, (see Ashenfelter 1974, 1978). The CPS subsamples retain the propensity sre method allows us to focus on the mthe dip, but 1974 earnings are substantially higher for the parability of the mparison group to the treatment group. matched subset of CPS-3 than for the treatment group. Hence it allows us to address the issues of functional form This illustrates one of the important features of propen- and treatment effect heterogeneity much more easily. sity sre methods, namely that creation of ad hoc subsamples from the nonexperimental mparison group is nei- 5. SENSITIVITY ANALYSIS ther necessary nor desirable; subsamples based on single preintervention characteristics may dispose of mparison 5.1 Sensitivity to the Specification of the units that still provide good overall mparisons with treat- Propensity Sre ment units. The propensity sre sorts out which mpari- The upper half of Table 5 demonstrates that the estimates son units are most relevant, nsidering all preintervention of the treatment impact are not particularly sensitive to the characteristics simultaneously, not just one characteristic at specification used for the propensity sre. Specifications 1 n time and 4 are the same as those in Table 3 (and hence they bal- Table 4. Sample Means of Characteristics for Matched Control Samples Matched No. of samples observations Age Education Black Hispanic No degree Married RE74 (U.S. $) RE75 (U.S.$) NSW ,96 1,532 MPSID ,794 1,126 [2.56] [.63] [.13] [.6] [.13] [.12] [1,46] [1,146] MPSID ,599 2,225 [2.63] [.83] [.14] [.8] [.16] [.16] [1,95] [1,228] MPSID ,386 1,863 [2.97] [.84] [.13] [.8] [.16] [.16] [1,68] [1,494] MCPS ,11 1,396 [1.25] [.32] [.6] [.4] [.7] [.6] [841] [563] MCPS ,758 1,24 [1.43] [.37] [.8] [.5] [.9].8 [896] [661] MCPS ,79 1,587 [1.68] [.48] [.9] [.6] [.1] [.9] [1,285] [76] NOTE: Standard error on the difference in means with NSW sample is given in brackets. MPSID1-3 and MCPS1-3 are the subsamples of PSID1-3 and CPS1-3 that are matched to the treatment group.

10 Dehejia and Wahba: Causal Effects in Nonexperimental Studies 161 Table 5. Sensitivity of Estimated Training Effects to Specification of the Propensity Sre NSW treatment earnings less mparison group earnings, NSW earnings less nditional on the estimated propensity sre mparison group earnings Quadratic Stratifying on the sre Matching on the sre (1) (2) in sre C (4) (5) (6) (7) (8) Comparison group Unadjusted Adjusteda (3) Unadjusted Adjusted Observationsd Unadjusted Adjustedb NSW 1,794 1,672 (633) (638) Dropping higher-order terms PSID-1: -15, ,68 1,254 1,255 1,691 1,54 Specification 1 (1,154) (866) (1,389) (1,571) (1,616) (2,29) (831) PSID-1: -15, ,524 1,775 1,533 2,281 2,291 Specification 2 (1,154) (863) (1,344) (1,527) (1,538) (1,732) (796) PSID-1: -15, ,185 1,237 1,155 1,373 1, Specification 3 (1,154) (863) (1,233) (1,144) (1,28) (1,72) (96) CPS-1: -8, ,117 1,713 1,774 4,117 1,582 1,616 Specification 4 (712) (547) (747) (1,115) (1,152) (1,69) (751) CPS-1: -8, ,248 1,452 1,454 6, Specification 5 (712) (546) (731) (632) (2,713) (1,7) (769) CPS-1: -8, ,241 1,299 1,95 6,17 1,13 1,471 Specification 6 (712) (546) (671) (547) (925) (877) (787) Dropping RE74 PSID-1: -15, ,23 1,284 1,727 1,34 Specification 7 (1,154) (88) (1,279) (1,41) (1,493) (1,447) (845) PSID-2: -3, Specification 8 (959) (1,4) (1,154) (1,472) (1,495) (1,848) (92) PSID-3: 1, , Specification 8 (899) (1,1) (1,261) (1,449) (1,493) (1,58) (938) CPS-1: -8, ,181 1,234 1,347 4,558 1, Specification 9 (712) (557) (698) (695) (683) (1,67) (786) CPS-2: -3, ,473 1,588 1,222 1,941 1,668 Specification 9 (67) (662) (731) (1,313) (1,39) (1,5) (755) CPS-3: ,348 1, ,97 1,12 Specification 9 (657) (87) (942) (1,61) (1,6) (1,366) (783) NOTE: Specification 1: Same as Table 3, note (c). Specification 2: Specification 1 without higher powers. Specification 3: Specification 2 without higher-order terms. Specification 4: Same as Table 3, note (e). Specification 5: Specification 4 without higher powers. Specification 6: Specification 5 without higher-order terms. Specification 7: Same as Table 3, note (c), with RE74 removed. Specification 8: Same as Table 3, note (d), with RE74 removed. Specification 9: Same as Table 3, note (e), with RE74 removed. a Least squares regression: RE78 on a nstant, a treatment indicator, age, education, no degree, black, Hispanic, RE74, RE75. bweighted least squares: treatment observations weighted as 1, and ntrol observations weighted by the number of times they are matched to a treatment observation [same variates as (a)]. c Least squares regression of RE78 on a quadratic on the estimated propensity sre and a treatment indicator, for observations used under stratification; see note (d). d Number of observations refers to the actual number of mparison and treatment units used for (3)-(5); namely, all treatment units and those mparison units whose estimated propensity sre is greater than the minimum, and less than the maximum, estimated propensity sre for the treatment group. ance the preintervention characteristics). In specifications outmes, Yi1 and Yio, are observed. This assumption led 2-3 and 5-6, we drop the squares and cubes of the - us to restrict Lalonde's data to the subset for which 2 years variates, and then the interactions and dummy variables. In (rather than 1 year) of preintervention earnings data is availspecifications 3 and 6, the logits simply use the variates able. In this section we nsider how our estimators would linearly. These estimates are farther from the experimen- fare in the absence of 2 years of preintervention earnings tal benchmark than those in Table 3, ranging from $835 to data. In the bottom part of Table 5, we reestimate the treat- $2,291, but they remain ncentrated mpared to the range ment impact without using 1974 earnings. For PSID1-3, of estimates from Table 2. Furthermore, for the alternative the stratification estimates (ranging from -$1,23 to $482) specifications, we are unable to find a partition structure are more variable than the regression estimates in lumn such that the preintervention characteristics are balanced (2) (ranging from -$265 to $297) and the estimates in Tawithin each stratum, which then nstitutes a well-defined ble 3, which use 1974 earnings (ranging from $1,494 to criterion for rejecting these alternative specifications. In- $2,321). The estimates from matching vary less than those deed, the specification search begins with a linear specifi- from stratification. Compared to the PSID estimates, the escation, then adds higher-order and interaction terms until timates from the CPS are closer to the experimental benchwithin-stratum balance is achieved. mark (ranging from $1,234 to $1,588 for stratification and from $861 to $1,941 for matching). They are also closer than the regression estimates in lumn (2). 5.2 Sensitivity to Selection on Observables The results clearly are sensitive to the set of preinter- One important assumption underlying propensity sre vention variables used, but the degree of sensitivity varies methods is that all of the variables that influence assign- with the mparison group. This illustrates the importance ment to treatment and that are rrelated with the potential of a sufficiently lengthy preintervention earnings history

162 Journal of the American Statistical Association, December 1999 for training programs. Table 5 also demonstrates the value rica, 66, 249-288. of using multiple mparison groups.

11 162 Journal of the American Statistical Association, December 1999 for training programs. Table 5 also demonstrates the value rica, 66, of using multiple mparison groups. Even if we did not Ashenfelter,. (1974), "The Effect of Manpower Training on Earnings: Preliminary Results," in Proceedings of the Twenty-Seventh Annual Winknow the experimental estimate, the variation in estimates ter Meetings of the Industrial Relations Research Association, eds. J. between the CPS and PSID would raise the ncern that the Stern and B. Dennis, Madison, WI: Industrial Relations Research Assovariables that we observe (assuming that earnings in 1974 ciation. are not observed) do not ntrol fully for the differences (1978),"Estimating the Effects of Training Programs on Earnings," Review of Enomics and Statistics, 6, between the treatment and mparison groups. If all rele- Ashenfelter, O., and Card, D. (1985), "Using the Longitudinal Structure vant variables are observed, then the estimates from both of Earnings to Estimate the Effect of Training Programs," Review of groups should be similar (as they are in Table 3). When Enomics and Statistics, 67, an experimental benchmark is not available, multiple m- Card, D., and Sullivan, D. (1988), "Measuring the Effect of Subsidized Training Programs on Movements In and Out of Employment," Enoparison groups are valuable, because they can suggest the metrica, 56, existence of important unobservables. Rosenbaum (1987) Dawid, A. P. (1979), "Conditional Independence in Statistical Theory," has developed this idea in more detail. Journal of the Royal Statistical Society, Ser. B, 41, Dehejia, R. H., and Wahba, S. (1998), "Matching Methods for Estimat- 6. CONCLUSIONS ing Causal Effects in Non-Experimental Studies," Working Paper 6829, National Bureau of Enomic Research. In this article we have demonstrated how to estimate the Hardle, W., and Linton,. (1994),"Applied Nonparametric Regression," in treatment impact in an observational study using propen- Handbook of Enometrics, Vol. 4, eds. R. Engle and D. L. McFadden, Amsterdam: Elsevier, pp sity sre methods. Our ntribution is to demonstrate the Heckman, J., and Hotz, J. (1989), "Choosing Among Alternative Nonuse of propensity sre methods and to apply them in a experimental Methods for Estimating the Impact of Social Programs: ntext that allows us to assess their efficacy. Our results The Case of Manpower Training," Journal of the American Statistical show that the estimates of the training effect for Lalonde's Association, 84, Heckman, J., Ichimura, H., Smith, J., and Todd, P. (1998), "Characterizhybrid of an experimental and nonexperimental dataset are ing Selection Bias Using Experimental Data," Enomnetrica, 66, 117- close to the benchmark experimental estimate and are ro bust to the specification of the mparison group and to the Heckman, J., Ichimura, H., and Todd, P. (1997), "Matching As An Enofunctional form used to estimate the propensity sre. A metric Evaluation Estimator: Evidence from Evaluating a Job Training Program," Review of Enomic Studies, 64, researcher using this method would arrive at estimates of (1998), "Matching as an Enometric Evaluation Estimator," Rethe treatment impact ranging from $1,473 to $1,774, close view of Enomic Studies, 65, to the benchmark unbiased estimate from the experiment Heckman, J., and Robb, R. (1985), "Alternative Methods for Evaluating of $1,794. Furthermore, our methods succeed for a trans- the Impact of Interventions," in Longituidinal Analysis of Labor Market Data, Enometric Society Monograph No. 1, eds. J. Heckman and B. parent reason: They use only the subset of the mparison Singer, Cambridge, U.K.: Cambridge University Press, pp group that is mparable to the treatment group, and dis- Holland, P. W. (1986), "Statistics and Causal Inference," Journal of the card the mplement. Although Lalonde attempted to fol- American Statistical Association, 81, low this strategy in his nstruction of other mparison Lalonde, R. (1986), "Evaluating the Enometric Evaluations of Training Programs," American Enomic Review, 76, groups, his method relies on an informal selection based Manpower Demonstration Research Corporation (1983), Summaty and on preintervention variables. Our application illustrates that Findings of the National Supported Work Demonstr-ation, Cambridge, even among a large set of potential mparison units, very U.K.: Ballinger. Manski, C. F., and Garfinkel, I. (1992), "Introduction," in Evaluating Welfew may be relevant, and that even a few mparison units fare and Training Programs, eds. C. Manski and I. Garfinkel, Cambridge, may be sufficient to estimate the treatment impact. U.K.: Harvard University Press, pp The methods we suggest are not relevant in all situa- Manski, C. F., Sandefur, G., McLanahan, S., and Powers, D. (1992), "Altertions. There may be important unobservable variates, for native Estimates of the Effect of Family Structure During Adolescence on High School Graduation," Journal of the American Statistical Assowhich the propensity sre method cannot acunt. Howciation, 87, ever, rather than giving up, or relying on assumptions about Rosenbaum, P. (1987), "The Role of a Send Control Group in an Obthe unobserved variables, there is substantial reward in ex- servational Study," Statistical Science, 2, ploring firsthe information ntained in the variables that Rosenbaum, P., and Rubin, D. (1983), "The Central Role of the Propensity Sre in Observational Studies for Causal Effects," Biometrika, 7, 41- are observed. In this regard, propensity sre methods can 55. offer both a diagnostic on the quality of the mparison (1984), "Reducing Bias in Observational Studies Using the Subgroup and a means to estimate the treatment impact. classification on the Propensity Sre," Jouirnal of the American Statistical Association, 79, Rubin, D. (1974),"Estimating Causal Effects of Treatments in Randomized [Received October Revised May 1999.] and Non-Randomized Studies," Journal of Educational Psychology, 66, REFERENCES (1977), "Assignmento a Treatment Group on the Basis of a Covariate," Journal of Educational Statistics, 2, Angrist, J. (199), "Lifetime Earnings and the Vietnam Draft Lottery: Ev- (1978), "Bayesian Inference for Causal Effects: The Role of Ranidence From Social Security Administrative Rerds," American E- domization," The Annals of Statistics, 6, nomic Review, 8, (1979),"Using Multivariate Matched Sampling and Regression Ad- ( 1998), "Estimating the Labor Market Impact of Voluntary Military justmen to Control Bias in Observation Studies," Journal of the Amer- Service Using Social Security Data on Military Applicants," Enomet- ican Statistical Association, 74,

1. INTRODUCTION. Lalonde estimates the impact of the National Supported Work (NSW) Demonstration, a labor

1. INTRODUCTION. Lalonde estimates the impact of the National Supported Work (NSW) Demonstration, a labor 1. INTRODUCTION This paper discusses the estimation of treatment effects in observational studies. This issue, which is of great practical importance because randomized experiments cannot always be implemented,