A Guide to Quasi-Experimental Designs

Western Kentucky University From the SelectedWorks of Matt Bogard Fall 2013 A Guide to Quasi-Experimental Designs Matt Bogard, Western Kentucky University Available at: https://works.bepress.com/matt_bogard/24/

A Guide to Quasi- Experimental Designs Matt Bogard Abstract Linear regression is a very powerful empirical tool that allows for controlled comparisons of treatment effects across groups. However, omitted variable bias, selection bias, and issues related to unobserved heterogeneity and endogeneity can bias standard regression results. Quasi- experimental designs including propensity score methods, instrumental variables, regression discontinuity, and difference- in- difference estimators offer an inferentially rigorous alternative for program evaluation. In this guide, I begin with an introduction to the potential outcomes framework for rigorously characterizing selection bias and follow with discussions of quasi- experimental methods that may be useful to practitioners involved in program evaluation. Introduction Linear regression is a very powerful empirical tool that allows for controlled comparisons of treatment effects across groups. However, omitted variable bias, selection bias, and issues related to unobserved heterogeneity and endogeneity can bias standard regression results. Quasi- experimental (QE) designs provide an inferentially rigorous approach to causal inference. As discussed in Cellini (2008): - in- differences approaches are becoming quite common. Indeed, these approaches have replaced basic multivariate regression as The discussion below introduces concepts related to selection bias and unobserved heterogeneity and several quasi- experimental approaches that can be used to address these issues. The Randomized Controlled Experiment In the classic randomized controlled experiment (RCE), subjects are randomly assigned to a treatment and control group in a careful manner that ensures that subjects in each group are identical in all respects except for the treatment assignment. In a RCE, we assume that any difference in the observed outcome of interest is due to the treatment effect because all other factors have been accounted for in the experimental design. In this case, correlation between treatment and outcome, or observed differences in outcomes between treatment and control groups imply a causal treatment effect.

Selection Bias and the Rubin Causal Model and Potential Outcomes Framework The problem of selection bias is best characterized within the Rubin causal model or potential outcomes framework (Angrist and Pischke,2008; Rubin, 1974; Imbens and Wooldridge, 2009, Klaiber & Smith,2009) Suppose Y i is the measured outcome of interest. This can be written in terms of potential outcomes as: Y i = {y 1i if d i =1; y 0i, if d i = 0} (1) = y 0i + (y 1i - y 0i )d i (2) Where d i = choice or selection or treatment Y 0i = baseline potential outcome Y 1i = potential treatment outcome The causal effect of interest is y 1i - y 0i outcomes for a single individual. Reality forces us to compare outcomes for different individuals (those treated vs. untreated). What we actually measure is E[Y i d i =1] - E[Y i d i =0], the observed effect or observed difference between means for treated vs. untreated groups. The problem of non- random treatment assignment or selection bias can be characterized as follows: E[Y i d i =1] - E[Y i d i =0] = E[Y 1i - Y 0i ] + {E[Y 0i d i =1] - E[Y 0i d i =0]} (3) 1 The observed effect or difference is equal to the population average treatment effect (ATE) E[Y 1i - Y 0i ] in addition to the bracketed term which characterizes selection bias. If 0i i =1) differ from potential 0i of - select (d i =0), then the term {E [Y 0i d i =1] - E [Y 0i d i =0]} could have a positive or negative value, creating selection bias. When we calculate the observed difference between treated and untreated groups, selection bias becomes confounded with the actual treatment effect E[Y 1i - Y 0i ]. Note, if the potential outcomes of the treated and control groups were the same, then the selection bias term would equal zero, and the observed difference would represent the population average treatment effect. This is the result we would get from an ideal randomized controlled experiment. Selection bias can overpower the actual treatment effect and leave the naïve researcher to conclude (based on the observed effect E[Y i d i =1] - E[Y i d i =0]) that the intervention

or treatment was ineffectual or lead them to under or overestimate the true treatment effects depending on the direction of the bias. In many cases, applied research in the social sciences and education involves non- experimental or observational data and is plagued by issues related to selection bias. Quasi- experimental methods such as propensity score matching offer a methodological approach to this problem. Propensity Score Matching Previously I explained how selection bias can overpower the actual treatment effect and leave the naïve researcher to conclude that the intervention or treatment was ineffectual or lead them to under or overestimate the true treatment effects depending on the direction of the bias. According to the conditional independence assumption (CIA) (Rubin, 1973; Angrist & Pischke, 2008; Rosenbaum and Rubin, 1983;Angrist and Hahn,2004) conditional on covariate comparisons may remove selection bias, giving us the estimate of the treatment effect we need: E[Y i x i,d i =1]- E[Y i x i,d i =0]= E[Y 1i - Y 0i x i ] or Y 1i,Y 0i d i x i (4) The last term implies that treatment assignment (d i ) and response (Y 1i,Y 0i ) are conditionally independent given covariates x i. This conclusion provides the justification and motivation for utilizing matched comparisons to estimate treatment effects. Matched estimates of treatment effects are achieved by comparing units with similar covariate values and computing a weighted average based on the distribution of covariates: x P(X i =x)=e[y 1i - Y 0i ] = ATE (5) Matched comparisons imply balance o situation similar to a randomized experiment where all subjects are essentially the same except for the treatment (Thoemmes and Kim,2011). As Angrist and Pischke (2009) demonstrate, regression also can be utilized as a type of variance based weighted matching estimator where: 0 1 d 2 X + e and 1 = E[Y i x i,d i =1]- E[Y i x i,d i =0] (6) Matching on covariates can be complicated and cumbersome. An alternative is to implement matching based on an estimate of the probability of receiving treatment or selection. This probability is referred to as a propensity score. Given estimates of the propensity or probability of receiving treatment, comparisons can then be made between observations matched on propensity scores. This is in effect a two stage

process requiring first a specification and estimation of a model used to derive the propensity scores, and then some implementation of matched comparisons made on (1983) states that if the CIA holds, then matching or conditioning on propensity scores (denoted p(x i )) will also eliminate selection bias, i.e. treatment assignment (d i ) and response (Y 1i,Y 0i ) are conditionally independent given propensity scores p(x i ): Y 1i,Y 0i d i x i = Y 1i,Y 0i d i p(x i ) (7) In fact, propensity score matching can provide a more asymptotically efficient estimator of treatment effects than covariate matching (Angrist andhahn,2004). So the idea is to first generate propensity scores by specifying a model that predicts the probability of receiving treatment given covariates x i p(x i ) = p(d i =1 x i ) (8) There are many possible functional forms for estimating propensity scores. Logit and probit models with the binary treatment indicator as the dependent variable are commonly used. Hirano et. al find that an efficient estimator can be achieved by weighting by a non- parametrically estimated propensity score (Hirano, et al, 2003). Millimet and Tchernis find evidence that more flexible and over specified estimators perform better in propensity score applications (Millimet and Tchernis, 2009). A comparative study of propensity score estimators using logistic regression, support vector machines, decision trees, and boosting algorithms can be found in Westreich et al (Westreich et al, 2009). matching is accomplished by identifying individuals in the control group with propensity scores similar to those in the treated group. Types of matching algorithms include 1:1 and nearest neighbor methods. Differences between matched cases are calculated and then combined to estimate an average treatment effect. Another method that implements matching based on propensity scores includes stratified comparisons. In this case treatment and control groups are stratified or divided into groups or categories or bins of propensity scores. Then comparisons are made across strata and combined to estimate an average treatment effect. Matched comparisons based on propensity score strata are discussed in Rosenbaum and Rubin (1984). This method can remove up to 90% of bias due to factors related to selection using as few as five strata (Rosenbaum and Rubin, 1984). Inverse Probability of Treatment Weighted Regression An alternative to direct matching or matching on propensity scores involves the use of the inverse of propensity scores in a weighted regression framework (Horvitz and

Thompson,1952), known as inverse probability of treatment weighted (IPTW) regression where: = E[Y 1i ] and = E[Y i0 ] (Hirano and Imbens, 2001) (9) IPTW regression (with weights specified as above) specifically estimates the average treatment effect (ATE) (Astin,2011): ATE = E[Y 1i - Y 0i ] (10) Inverse probability of treatment weighting uses weights derived from the propensity scores to create a pseudo population such that the distribution of covariates in the population are independent of treatment assignment. (Astin,2011). The weighting Unobserved Heterogeneity and Endogeneity Let's suppose we estimate the following: 0 1 D+ e (11) When we estimate a regression such as (11) above and leave out an important variable such as A 1 can become unbiased and inconsistent. In fact, to the extent that D and A are both correlated, D becomes correlated with the error term violating a basic assumption of regression. The omitted information in A is referred to in econometrics as Heterogeneity is simply variation across individual units of observations, and since we c heterogeneity as it relates to A, we have unobserved heterogeneity. Correlation between an explanatory variable and the error term is referred to as endogeneity. So in econometrics, when we have an omitted variable (as is often with cases of causal inference and selection bias) we say we have endogeneity caused by unobserved heterogeneity. 1? We know from basic econometrics that our estimate of 1 = b = COV(Y,D)/VAR(D) (12) 0 1 D+ e into (12) we get: 0 1 D+ e,d)/var(d) =

0,D)/VAR(D 1 D,D)/VAR(D) + COV(e,D)/VAR(D) (13) 1 VAR(D)/VAR(D) + COV(e,D)/VAR(D) (14) 1 + COV(e,D)/VAR(D) (15) We can see from (15) that if we leave out a variable in (11) i.e. we have unobserved heterogeneity, then the correlation that results between D and the error term will not be zero, and our estima 1 will be biased by the term COV(e,D)/VAR(D). If (11) were correctly specified, then the term COV(e,D)/VAR(D) will drop out and we will get 1 Instrumental Variables Suppose an institution has a summer camp designed to prepare high school students for their first year of college and we want to assess the impacts of the camp on 1 st year retention might be specified as follows: 0 1 2 X + e (16) Where y = binary first year retention indicator CAMP = an indicator for camp attendance X = a vector of controls us with: 0 1 CAMP + e (17) The causal effect of interest or the treatment effect of CAMP is our regression estimate 1 in the regression above. But, what if CAMP attendance is voluntary? If attendance is voluntary, then it could be that students that choose to attend also have a high propensity to succeed or retain due to unmeasured factors (social capital, innate ability, ambition, etc.) not captured even with controls like test scores or other measures of ability. 1 could overstate the actual impact of CAMP on retention. If we knew about a variable that captures the omitted factors that may be related to both the choice of attending the CAMP and having a greater t call it INDEX, we would include it and estimate the following: 0 1 2 INDEX + e (18)

Omitted variable bias in equation (17) would cause us to mis- estimate the effect of CAMP. One way to characterize the selection bias problem is through the potential outcomes framework that I have discussed before, but this time lets characterize this problem in terms of the regression specification above. By omitting INDEX, information about INDEX is getting sucked up into the error term. When this happens, to the extent that INDEX is correlated with CAMP, CAMP becom This correlation with the error term is a violation of the classical regression assumptions and leads to biased estimates of 1. For more technical terminology than up into the error term, we would frame this in the context of our previous discussion of unobserved heterogeneity and endogeneity. So the question becomes, how do we tease out the true effects of CAMP, when we have Techniques using what are referred to as instrumental variables will help us do this. d Z. Suppose that Z tends to be correlated with our variable of interest CAMP. But we also notice (or argue) that Z tends to be unrelated to all of those omitted factors like innate ability and ambition that comprise the variable INDEX that we wish we had. The technique of instrumental variables looks at changes in a variable like Z, and relates them to changes in our variable of interest CAMP, and then relates those changes to the outcome of interest, retention. Since Z is unrelated to INDEX, then those changes in CAMP that are related to Z are likely to be less correlated with INDEX (and hence less correlated with the error - technical way to think about this is that we are taking Z and going through CAMP to get to Y, and bringing with us only those aspects of CAMP that are unrelated to INDEX. Z is like a filter that picks up only the variation in CAMP (what we - experimental variation) that we are interested in and filters out the noise picked up from not including or controlling for INDEX. Z is technically related to Y only through CAMP. Z (19) If we can do this, then our estimate of the effects of CAMP on Y will be unbiased by the omitted effects of INDEX. So how do we do this in practice? We can do this through a series of regressions. To relate changes in Z to changes in CAMP we estimate: 0 1 Z + e (20) Notice in (20 1 only picks up the common variation between Z and CAMP and leaves all of the variation in CAMP related to INDEX in the residual term related to INDEX because we are arguing that Z and INDEX are uncorrelated). You can think of this as the filtering process. Then, to relate changes in Z to changes in our target Y we estimate:

0 2 Z + e (21) Our instrumental variable estimator then becomes: IV 2 1 or - 1-1 MP or COV(Y,Z)/COV(CAMP,Z) (22) The last term in (22 IV represents the proportion of total variation in CAMP that is related to our Z that is also related to Y. Or, the total proportion of variation in CAMP unrelated to INDEX that is related to Y. Or, the total proportion of - in CAMP related to Y. Regardless of how IV, we can see that it teases out only that variation in CAMP that is unrelated to INDEX and relates it to Y giving us an estimate for the treatment effect of CAMP that is less biased than the standard regression like (17). IV by substitution via two stage least squares : CAMP est 0 1 Z + e (23) Y 0 + IV CAMP est + e (24) As discussed above, the first regression gets only the variation in CAMP related to Z, and leaves all of the variation in CAMP related to INDEX in the residual term. As Angrist and IV and retains only the quasi- experimental variation in CAMP generated by the instrument Z. (because in the second regresson we are using the estimate of camp derived from (23) vs. CAMP itself). 2 As discussed in their book Mostly Harmless Econometrics, most IV estimates are derived using packages like SAS, STATA, or R vs. explicit implementation of the methods illustrated above. Caution should be used to derive the correct standard errors, which are not the ones you will get in the intermediate results from any of the regressions depicted above. Difference- In- Difference Estimators Difference- in- difference (DD) estimators assume that in absence of treatment the over time. DD estimators are a special type of fixed effects estimator.

(A- B) = Differences in groups pre- groups. - B) = total post treatment effect = normal effect (A- B) + treatme - A) - treatment to the difference in group averages post treatment. The larger the difference post treatment the larger the treatment effect. This can also be represented in the regression context with interactions where t = time indicating pre and post treatment and d is an indicator for treatment and control groups. At t= 0 there are no treatments so those terms equal 0. The parameter 3 on the interaction term is our difference- in- difference estimator as shown below. 0 1 d 2 3 d*t + e (25) Treatment A (D=1) Group (t = 0) (t = 1) Difference 0 1 Control B (D=0) Difference in Difference:

Regression Discontinuity Designs Suppose a policy or intervention is implemented or a treatment is applied based on arbitrary values of some observed covariate value or values X 0. If there is some positive applied to subjects where X > X 0 more likely to exhibit higher levels of the outcome variable Y anyway? Is it valid to make comparisons of observed outcomes (Y) between groups with differing values of (X)? One solution would be to implement matched comparisons between groups with similar values of covariates. Regression discontinuity designs allow us to compare differences between groups in the neighborhood of the cutoff value X 0 giving us unbiased estimates of treatment effects. Treatment effects can be characterized by a change in intercept or main effect at the discontinuity.treatment assignment is equivalent to random assignment within the neighborhood of the cutoff (Lee & Lemieux,2010). More complicated functional forms may be estimated: (26) where f(x) may be a pth order polynomial Comparisons of outcomes in the neighborhood of X 0 provide estimates of the treatment effect of E[Y X] (Angrist &Pischke, 2009). Even more complicated methods including local linear regression may be implemented as well as combined main and interaction effects.

Lee and Lemieux (2010) provide a very good introduction to RD designs and make two very important points about their usefulness as a QE method. design can be interpreted as a weighted average treatment effect across all individuals" the notion that the RD design gene instruments to the RD design" Among quasi- experimental designs, RD designs may be regarded as coming the closest to the ideal of a randomized controlled experiment.

Sharp vs.fuzzy RD With a sharp RD model subjects are assigned to treatment and control groups based on the value of the observed cutoff X 0 - he cutoff, (referred to as non- compliance or incomplete compliance in some settings) it may be the case that we find subjects with values of X near the cutoff in both treatment and control groups. As van der Klaauw (2002) explains, this could be a case where assignment is based on the observable values of X in addition to other unobservable factors. RD in this context is a case of both selection on observables and unobservables. As explained in Angrist and Pischke (2009) the discontinuity serves as an instrument for treatment status and fuzzy RD can be understood in the instrumental variables context. Conclusion Multivariable regression can be a powerful empirical tool for estimating treatment effects of interventions. However issues related to omitted variable bias, selection bias, and unobserved heterogeneity and endogeneity can bias standard regression results. Quasi- experimental designs including propensity score methods, instrumental variables, regression discontinuity, and difference- in- difference estimators offer a more rigorous alternative for program evaluation. Notes 1. A variation on equation (3) can be written as: E[Y i d i =1] - E[Y i d i =0] = E[Y 1i - Y 0i d i =1] +{E[Y 0i d i =1] - E[Y 0i d i =0]} The expression E[Y 1i - Y 0i d i =1] represents the treatment effect on the treated (ATT) vs. the average treatment effect (ATE) E[Y 1i - Y 0i ] depicted earlier. For more detailed discussion on ATE vs ATT in the context of quasi- experimental designs see Austin, P.(2011), Angrist, J. D. & Pischke J. (2008), and Lanehart et al (2012). 2. We can also see that instrumental variables correct for omitted variable bias in the following way: IV = COV(Y,Z)/COV(D,Z) 0 1 D+ e we get (where D is our treatment indicator as in : 0 1 D+ e,z)/ COV(D,Z) =

0,Z)/ COV(D,Z) 1 D,Z)/ COV(D,Z) + COV(e,Z)/ COV(D,Z) 1 COV(D,Z)/ COV(D,Z) + COV(e,Z)/ COV(D,Z) 1 + COV(e,Z)/ COV(D,Z) By construction COV(e,Z) = 0 and we get an unbiased estimate of 1 References Angrist, J. D., & Hahn, J. (2004). When to control for covariates? Panel- Asymptotic Results for Estimates of Treatment Effects. Review of Economics and Statistics. 86, 58-72. Angrist,J.D., Imbens, G.W. & Rubin, D. (1996). Identification of Causal Effects Using Instrumental Variables. Journal of the American Statistical Association, Vol. 91, No. 434 (Jun., 1996), pp. 444-455 Angrist, J. D. & Pischke J. (2009). Mostly harmless econometrics: An empiricist's companion. Princeton University Press. Austin, P.(2011). An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behav Research.May 46(3): 399 424. Baumer, P. Regression Discontinuity. Southern Methodist University. http://faculty.smu.edu/kyler/courses/7312/presentations/baumer/baumer_rd.pdf Caruana,R. & Niculescu- Mizil,A. (2006). An empirical comparison of supervised learning algorithms. The Proceedings of the 23rd International Conference on Machine Learning (ICML2006). pp. 161-168. Cleary,P. D. & Ronald,A. (1984). The analysis of relationships involving dichotomous dependent variables. Journal of Health and Social Behavior, Vol. 25, No. 3, pp. 334-348 Crump, R. K., Hotz, J., Imbens, G.W., & Mitnik, O.A. (2006). Moving the goalposts: addressing limited overlap in the estimation of average treatment effects by changing the estimand. Working Paper 33. National Bureau of Economic Research D'Agostino,R. B.(1971) A second look at analysis of variance on dichotomous data. Journal of Educational Measurement, Vol. 8, No. 4, pp. 327-333

Dey, E. L. & Astin, A. W. (1993). Statistical alternatives for studying college student retention: A comparative analysis of logit, probit, and linear regression. Research in Higher Education, 34(5). Evans, W. N. (2008). Difference in Difference Models, Course Notes ECON 47950: Methods for Inferring Causal Relationships in Economics. University of Notre Dame. Spring. Harder, V. S., Stuart, E. A., Anthony, J. C. (2010). Propensity score techniques and the assessment of measure covariate balance to test causal associations in psychological research. Psychological Methods, 15, 234-249. Hirano, K. & Imbens, G.W. (2001). Estimation of causal effects using propensity score weighting: an application to data on right heart catheterization. Health Services & Outcomes Research Methodology, 2:259 278. Hirano, K. & Imbens, G.W. & Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, Vol. 71, No. 4, 1161 1189. Horvitz D. G. & Thompson D. J.(1952). A Generalization of Sampling Without Replacement From a Finite Universe. Journal of the American Statistical Association, Vol. 47, No. 260 (Dec., 1952) Imbens, G W. & Lemieux, T.( 2008) "Regression discontinuity designs: A guide to practice," Journal of Econometrics, Elsevier, vol. 142(2), pages 615-635, February. Klaiber, H.A. & Smith,V.K. (2009). Evaluating Rubin's causal model for measuring the capitalization of environmental amenities. NBER Working Paper No 14957. National Bureau of Economic Research. Imbens, G. W. & Wooldridge, J.M.(2009). Recent developments in the econometrics of program evaluation. Journal of Economic Literature, 47:1, 5 86 Kang, J. and Schafer, J.(2007). Demystifying double robustness: A comparison of alternativestrategies for estimating a population mean from incomplete data. Statistical Science, 22(4), 523 539. Lanehart, R.E,de Gil, P.R, Kim,E.S, Bellara, A.P.,Kromney,J.D. & Lee, R.S. (2012). Propensity score analysis and assessment of propensity score approaches using SAS procedures. Paper 314-2012. SAS Global Forum 2012 Proceedings. North Carolina; SAS Institute

Lee, D.S. and Lemieux, T. (2010). Regression Discontinuity Designs in Economics Journal of Economic Literature 48 281-355 Lunney,G. H. (1970). Using analysis of variance with a dichotomous dependent variable: An empirical study. Journal of Educational Measurement, Vol. 7, No. 4, pp. 263-269 Maciejewski, M. L. & Brookhart, M.A. (2011). Propensity score workshop. Retrieved January 19,2013. Website: http://ahrqplexnet.sharepointspace.com/webinars/ps_webinar_followup.pdf Millimet, D. L. & Tchernis, R.(2009). On the specification of propensity scores, with applications to the analysis of trade policies. Journal of Business & Economic Statistics,Vol. 27, No. 3 Moss, B.G. and. Yeaton W.H. (2006). Shaping Policies Related to Developmental Education: An Evaluation Using the Regression- Discontinuity Design. EDUCATIONAL EVALUATION AND POLICY ANALYSIS September 21, 2006 vol. 28 no. 3 215-229 Pike,G.R., Hansen, M.J. & Lin, C. Using Instrumental Variables to Account for Selection Effects in Research on First- Year Programs. Research in Higher Education Volume 52, Number 2, 194-214 Pischke, J. (2012). Probit better than LPM? Accessed January 19, 2013. RetrievedJanuary 19,2013. Website: http://www.mostlyharmlesseconometrics.com/2012/07/probit- better- than- lpm/ Robins, J. M., Hernan, M. A., & Brumback, B. (2000). Marginal structural models and causal inference in epidemiology. Epidemiology, 11, 550-560. Rosenbaum, R. &. Rubin, D.B.(1983). The central role of the propensity score in observational studies for causal effects. Biometrika, Vol. 70, No. 1, pp. 41-55 Rosenbaum, R. &. Rubin, D.B.(1984). Reducing Bias in Observational Studies Using Sub classification on the Propensity Score. Journal of the American Statistical Association, Vol. 79, Issue. 387, pp. 516-524 Rubin, D. B.(1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, Vol 66(5), Oct 1974, 688-701

Rubin, Donald B. (1973). Matching to remove bias in observational studies. Biometrics, 29, 159-83. Stuart, E.(2011).Propensity score methods for estimating causal effects: The why, when, and how. Johns Hopkins Bloomberg School of Public Health. Department of Mental Health. Department of Biostatistics. Retrieved January 19,2013. Website: www.biostat.jhsph.edu/estuart Thoemmes, F. J. & Kim, E. S. (2011). A systematic review of propensity score methods in the social sciences. Multivariate Behavioral Research, 46(1), 90-118. Program Evaluation and the Difference- in- Difference Estimator Course Notes Education Policy and Program Evaluation. Vanderbilt University October 4, 2008 van der Klaauw (2002). Estimating the effect of financial aid offers on college enrollment: A regression- discontinuity approach. International Economic Review. 43(4), 1249 1287 Westreich, D., Justin L., & Funk, M.J. (2010). Propensity score estimation: machine learning and classification methods as alternatives to logistic regression. Journal of Clinical Epidemiology, 63(8): 826 833.