Authors: Samuel D. Pimental, University of Pennsylvania,

Size: px

Start display at page:

Download "Authors: Samuel D. Pimental, University of Pennsylvania,"

Emma Blake
6 years ago
Views:

1 Title: Optimal Multilevel Matching Using Network Flows Authors: Samuel D. Pimental, University of Pennsylvania, Lindsay C. Page, University of Pittsburgh, [presenting author] Matthew Lenard, Wake County Public School System Luke Keele, Georgetown University, 1

2 Problem: When an educational intervention or treatment of interest is not randomly assigned, matching estimators are one method of adjustment designed to mimic a randomized controlled trial (RCT) by constructing a set of treated and control units that are highly comparable on observed, pretreatment characteristics. Existing tools are well developed to handle selection and matching at the unit or student level. Especially in education, however, because of the organizational structure of schools, interventions are often assigned at the group or school level rather than the unit level. As a result, many observational studies of causal effects occur in settings with clustered treatment assignment. For example, an educational intervention might be administered to all or a subset of the students in a school while being withheld from all students in another school. Just as we can use matching to mimic an individual-level RCT, so too can we conceive of using matching to mimic a group RCT by creating comparable treatment and comparison clusters to overcome overt bias. Yet, compared to the extensive literature on individual-level matching, strategies for optimally handling group-level matches are much less well developed. In this paper, we develop a conceptual framework and a method of statistical adjustment for multilevel matching problems. We then apply the tools developed to a study of a summer reading program implemented in selected schools within the Wake County Public School System (Wake County, NC). Prior methodological research: To date, there has been limited work focused on matching in multilevel contexts. Steiner et al. (2013) consider matching in multilevel settings, but they assume that treatment assignment occurs at the student level rather than school level. Other efforts to translate standard methods of adjustment into the context of data with a multilevel structure utilize hierarchical regression or methods based on propensity score stratification (Hong and Raudenbush, 2006; Arpino and Mealli, 2011; Li et al., 2013). Yet, when translated to the multilevel context, these methods can suffer from drawbacks, such as lack of model convergence or a tendency to extrapolate over areas without common support. Our research is an extension of recent matching algorithm design to be optimal for multilevel settings (Zubizaretta and Keele, in press). Method: In this paper, we first outline two possible assignment mechanisms at the cluster level. For example, we consider (1) the selection of entire groups such as whole schools into treatment and (2) the selection of subsets of students from within a group into treatment. This second circumstance might arise if schools are selected for a treatment that is then only applied to students with certain characteristics, such as being above or below a specific academic threshold (e.g., gifted or remedial programs, respectively). We then derive different research designs based on these different possible assignment mechanisms. For each of these designs, we develop a matching algorithm for multilevel data 2

3 based on a network flow algorithm. Earlier work on multilevel matching relied on integer programming, which allows for balance targeting on specific covariates, but can be slow with larger data sets. While we cannot target balance on specific covariates, our algorithm is quite fast and scales easily to larger data sets. We also allow the algorithm to trim treated observations to increase balance and increase overlap in the covariate distributions. Setting: We apply our algorithm to investigate the effectiveness of a summer school reading intervention implemented within the Wake County Public School System (WCPSS; Wake County, North Carolina). North Carolina state legislation required that students who didn t meet district standards at the end of 3rd grade were required to attend summer reading camps or risk retention. In summer 2013, WCPSS selected myon, a product of Capstone Digital, for implementation at Title I summer school sites in an effort to boost reading comprehension among the majority low-ses attendees. myon is a form of internet-based software designed to serve primarily as an electronic reading device. The software provides students with access to a library of books and suggests titles to students based on topic interests and reading ability. Students at myon sites used the program for up to one-half hour during the daily literacy block and could continue using the program at home if they had a device and internet connection. Not all summer school students in WCPSS were given access to the myon reading program. Rather, it was used by teachers at eight of the 19 summer school sites. These summer school sites were selected based on a mix of factors including internet bandwidth, computer access, and regional distribution. Students were assigned to summer school sites primarily through geographic proximity. Thus all students in a school close to a myon summer school site used the myon program during summer school. Principals and schools themselves had no input into program participation. Analysis & Findings: Given that the intervention was assigned to entire elementary schools, we use our multi-level matching strategy to develop a matched sample against which to compare treated schools and students. We then use multilevel modeling to examine the impact of myon on student outcomes. We find that the intervention does not appear to increase reading test scores, however, in a sensitivity analysis, we find that an unobserved confounder could easily mask a larger treatment effect. 3

4 Works Cited Arpino, B. and Mealli, F. (2011). The specification of the propensity score in multilevel observational studies. Computational Statistics & Data Analysis, 55(4): Hong, G. and Raudenbush, S. W. (2006). Evaluating kindergarten retention policy: A case study of causal inference for multilevel data. Journal of the American Statistical Association, 101(475): Li, F., Zaslavsky, A. M., and Landrum, M. B. (2013). Propensity score weighting with multilevel data. Statistics in medicine, 32(19): Steiner, P., Kim, J.-S., and Thoemmes, F. (2013). Matching strategies for observational multilevel data. In JSM proceedings, pages Zubizarreta, J. R., & Keele, L. (in press). Optimal multilevel matching in clustered observational studies: A case study of the effectiveness of private schools under a large-scale voucher system. 4

5 Title: Analyzing multilevel experiments in the presence of peer effects Authors: Guillaume Basse, Harvard University, Avi Feller, UC Berkeley, [presenting author] 1

6 Problem/background: Multilevel randomization is a powerful design for estimating causal effects in the presence of social interactions. A typical multilevel design has two stages. First, whole clusters (e.g., households or schools) are assigned to treatment or control. Second, units within each treated cluster are randomly assigned to treatment or control, as if each treated cluster were a separate, individuallyrandomized experiment. This design allows researchers to assess peer effects by comparing untreated units in treated clusters with similarly untreated units in the control clusters (Hudgens & Halloran, 2008). There has been substantial interest in multilevel randomization in recent years, with prominent examples in economics Crepon, Duflo, Gurgand, Rathelot, & Zamora (2013), political science (Sinclair, McConnell, & Green, 2012), and public health (Hudgens & Halloran, 2008). This randomized design, however, is relatively rare in education; see, for example, Somers (2010). By contrast, several important papers discuss the non-randomized analog, including Hong & Raudenbush (2006). Prior methodological research: There is a small but growing methodological literature on analyzing multilevel experiments. Key citations include Hudgens & Halloran (2008), Liu & Hudgens (2014), and Rigdon & Hudgens (2015) in statistics; Sinclair et al. (2012) in political science; and Baird, Bohren, McIntosh, & Ozler (2014) in economics. Kang & Imbens (2016) explore a related experimental design known as a peer encouragement design. Finally, Weiss, Lockwood, & McCaffrey (2016) discuss methods for analyzing individually randomized group treatment designs, which appear similar to multilevel experiments on the surface but rely on fundamentally distinct assumptions. Method: Our paper addresses three key practical issues that arise in analyzing multilevel experiments. First, researchers are often interested in an intervention s overall impact on students rather than on clusters; that is, researchers might give equal weight to each individual rather than equal weight to each cluster. While this distinction is well-known in the literature on cluster-randomization (Schochet, 2013), existing approaches for multilevel experiments focus either on equal weights for clusters (Hudgens & Halloran, 2008) or side-step the issue by assuming clusters are of equal size (Baird et al., 2014). We propose unbiased estimators for a broad class of individual- and cluster-weighted estimands, with corresponding theoretical and estimated variances. We also derive the bias of a simple difference in means for estimating individual-weighted estimands. Second, we connect two common approaches for analyzing multilevel designs: linear regression, which is more common in the social sciences, and randomization inference, which is more common in epidemiology and public health. We show that, with suitably chosen standard errors and correct weights, weighted regression and randomization inference yield identical point and variance estimates. These results hold for both individual- and cluster-weighted estimands, greatly extending existing results. We believe this equivalence will be important in practice, since the vast majority of applied papers in this area take a regression first approach to analysis that can obfuscate key inferential issues. Finally, we propose methods for incorporating covariates to improve precision, with a focus on poststratification and model-assisted estimation. While largely ignored in the existing methodological literature, we find that covariate adjustment is quite important in practice. Setting: Our motivating example is a large randomized evaluation of an intervention targeting student absenteeism among elementary and high school students in the School District of Philadelphia over the school year (Rogers & Feller, 2016). In this study, parents of at-risk students were randomly assigned to a direct mail intervention with tailored information about their students attendance over the 2

7 course of the year. In treated households with multiple eligible students, one student was selected at random to be the subject of the mailings, following a multilevel randomization. We consider a subset of 3,804 households with between 2 and 7 eligible students in each household, with 8,496 total students. Analysis & Findings: The left panel of Figure 1 shows the impact of random assignment on chronic absenteeism, defined as being absent 18 or more school days (results are similar for log-absences). The primary effect, defined as the impact of assignment on the target of the mailings, is roughly 4 percentage points for both household- and individual-weighted estimands. The spillover effect, defined as the impact of being an untreated individual in a treated household, is roughly 3 percentage points for both householdand individual-weighted estimands. The right panel of Figure 1 shows the same results after adjusting for pre-treatment covariates, especially prior year attendance, both via a model-assisted estimator and via a model-assisted estimator combined with post-stratification. Both adjustment methods yield considerable gains in precision. These results are consistent with extensive simulation studies, which show similar patterns for simulated data that mimics the applied example. Overall, these results show that spillover effects can be quite large in magnitude, nearly as large as the primary effects. Conclusion: Multilevel randomizations are important designs in settings with interactions between units. This paper addresses issues that arise when analyzing such designs. First, we address issues that arise when cluster sizes vary. Second, we demonstrate that appropriately weighted regression can yield identical point- and variance-estimates to fully randomization-based methods. Methodologically, we believe that this is a useful addition to the literatures on both causal inference with interference and randomization-based inference. Substantively, we find important insights into the intra-household dynamics of student behavior. 3

8 References Baird, S., Bohren, J. A., McIntosh, C., & Ozler, B. (2014). Designing Experiments to Measure Spillover Effects. Crepon, B., Duflo, E., Gurgand, M., Rathelot, R., & Zamora, P. (2013). Do Labor Market Policies have Displacement Effects? Evidence from a Clustered Randomized Experiment. The Quarterly Journal of Economics, 128(2), Hong, G., & Raudenbush, S. W. (2006). Evaluating Kindergarten Retention Policy. Journal of the American Statistical Association, 101(475), Hudgens, M. G., & Halloran, M. E. (2008). Toward Causal Inference With Interference. Journal of the American Statistical Association, 103(482), Kang, H., & Imbens, G. W. (2016, September 14). Peer Encouragement Designs in Causal Inference with Partial Interference and Identification of Local Average Network Effects. arxiv.org. Liu, L., & Hudgens, M. G. (2014). Large Sample Randomization Inference of Causal Effects in the Presence of Interference. Journal of the American Statistical Association, 109(505), Marie-Andrée Somers, W. C. S. S. T. S. J. L. A. C. Z. (2010). The Enhanced Reading Opportunities Study Final Report: The Impact of Supplemental Literacy Courses for Struggling Ninth-Grade Readers, Rigdon, J., & Hudgens, M. G. (2015). Exact confidence intervals in the presence of interference. Statistics and Probability Letters, 105, Rogers, T., & Feller, A. (2016). Reducing Student Absences at Scale, Schochet, P. Z. (2013). Estimators for Clustered Education RCTs Using the Neyman Model for Causal Inference. Journal of Educational and Behavioral Statistics, 38(3), Sinclair, B., McConnell, M., & Green, D. P. (2012). Detecting Spillover Effects: Design and Analysis of Multilevel Experiments. American Journal of Political Science, 56(4), Weiss, M. J., Lockwood, J. R., & McCaffrey, D. F. (2016). Estimating the Standard Error of the Impact Estimator in Individually Randomized Trials With Clustering. Journal of Research on Educational Effectiveness, 9(3),

9 Figures Household Weighted Primary Unadjusted Primary Individual Weighted Household Weighted (PS) Model Assisted Individual Weighted (PS) Model Assisted + PS Household Weighted Spillover Unadjusted Spillover Individual Weighted Household Weighted (PS) Model Assisted Individual Weighted (PS) Model Assisted + PS Treatment Effect (pct. pt.) Treatment Effect (pct. pt.) Figure 1: The left panel shows primary and spillover effects of randomization on chronic absenteeism for household- and individual-weighted estimands. The right panel shows the impact on individualweighted estimands adjusting for covariates. PS denotes post-stratification on the number of individuals within each household. 5

10 Title: Methods for generalizing treatment effects from cluster randomized trials to target populations Authors: Elizabeth A. Stuart, Johns Hopkins Bloomberg School of Public Health, [Presenting author] Robert B. Olsen, Rob Olsen LLC, Cyrus Ebnesajjad, The Fred Hutchinson Cancer Research Center, Stephen H. Bell, Abt Associates, Larry L. Orr, Johns Hopkins Bloomberg School of Public Health, 1

11 Abstract: The ultimate goal of many educational evaluations is to inform policy; e.g., to help policy makers understand the effects of interventions or programs in target populations of policy interest. While randomized trials provide internal validity - they yield unbiased effect estimates for the subjects in the study sample - there is growing awareness that they may not provide external validity - the ability to estimate what effects would be in other, target populations or in some other context (see, e.g., Cook, 2014; Olsen et al., 2013; Tipton, 2014). Recently developed statistical methods that use trial data and data on a population of interest have the potential to utilize the strong internal validity of trials while also enhancing the external validity (Kern et al., 2016; O Muircheartaigh & Hedges, 2014; Stuart et al., 2011). These methods fall into two broad classes: (1) flexible regression models of the outcome as a function of treatment status and covariates, and (2) reweighting methods that weight the RCT sample to reflect the covariate distribution in the population. However, to this point the existing methods have focused on a relatively simple scenario with individual-level randomization, ignoring the common complication of clustering. In particular, many educational evaluations are conducted by selecting a sample of schools or districts, and then randomizing students within those schools. This talk will discuss the extension of existing generalizability methods to multi-level settings such as cluster randomized trials. Discussion will include the pros and cons of using aggregated data (aggregated to the cluster level) or fully utilizing the multilevel structure, and the implications for analyses of those two approaches. The talk will also present results from simulation studies examining the performance of methods in each of these two broad classes of approaches (outcome regression models and reweighting approaches), extended to this multilevel setting. The simulations are designed to be as realistic as possible, based on data on a representative sample of public school students nationwide (the ECLS-K), empirical evidence on impact variation in two large-scale RCTs in education, and evidence on the types of schools that were selected for several RCTs in education (see Stuart et al., in press). We find that, perhaps not surprisingly given the broader literature on the benefits of multilevel models, using only aggregate data does not perform as well as using the multilevel structure. This is true even though selection into the trial is at the school level, and thus one might expect that school-selection weights would be sufficient. Initial evidence implies that part of the reason for that poor performance is due to the relatively small number of schools in the hypothetical trials, especially relative to the very large size of the population, making it hard to estimate accurate weights. However, when a multilevel structure is used in the analysis approach, and when the assumptions underlying each approach are satisfied, each approach (outcome modeling and reweighting) works well. However, when key assumptions are violated for example, if we do not observe all of the factors that moderate treatment effects and that differ between the RCT sample and the target population none of the methods consistently estimates the population effects. We conclude with recommendations for practice, including the need for thorough and consistent covariate measurement and a better understanding of treatment effect 2

12 heterogeneity. This work helps to identify the conditions under which different statistical methods can reduce external validity bias in educational evaluations. 3

13 References: Cook, T. D. (2014). Generalizing causal knowledge in the policy sciences: External validity as a task of both multi-attribute representation and multi-attribute extrapolation. Journal of Policy Analysis and Management, 33, Kern, H.L., Stuart, E.A., Hill, J., and Green, D.P. (2016). Assessing methods for generalizing experimental impact estimates to target populations. Forthcoming in Journal of Research on Educational Effectiveness. Published online 14 January Tipton, E. (2014). How generalizable is your experiment? An index for comparing experimental samples and populations. Journal of Educational and Behavioral Statistics, 39, Olsen, R., Bell, S., Orr, L., & Stuart, E. A. (2013). External validity in policy evaluations that choose sites purposively. Journal of Policy Analysis and Management, 32, O Muircheartaigh, C., & Hedges, L. V. (2014). Generalizing from unrepresentative experiments: A stratified propensity score approach. Journal of the Royal Statistical Society: Series C, 63, Stuart, E.A., Bell, S.H., Ebnesajjad, C., Olsen, R.B., and Orr, L.L. (in press). Characteristics of school districts that participate in rigorous national educational evaluations. Forthcoming in Journal of Research on Educational Effectiveness. Stuart, E.A., Cole, S.R., Bradshaw, C.P., and Leaf, P.J. (2011). The use of propensity scores to assess the generalizability of results from randomized trials. The Journal of the Royal Statistical Society, Series A 174(2): PMCID:

14 Title: Covariate restrictions for estimating principal causal effects in single- and multi-site trials Authors: Avi Feller, UC Berkeley, Luke W. Miratrix, Harvard University, [presenting author] Lo-Hua Yuan, Harvard University, 1

15 Problem: Researchers are often interested in causal effects for subgroups defined by posttreatment outcomes. Prominent examples in education include noncompliance, attrition, and implementation fidelity. Unfortunately, estimating principal causal effects (PCEs) in these principal strata or endogenous subgroups can be difficult, as subgroup membership is not fully observed. A large and growing literature addresses the associated methodological challenges (see Page, Feller, Grindal, Miratrix, & Somers (2015) for a recent review). Prior methodological research: Many existing strategies for estimating PCEs crucially depend on covariates. In particular, covariates can be used to directly identify PCEs given appropriate additional assumptions. A broad range of methods fall under this umbrella, and despite making similar assumptions, they often seem quite different on the surface. In general, we divide these methods, which are gaining increased focus and attention in the education research world, into two broad categories: Single-site methods. These approaches generalize standard instrumental variable (IV) methods, positing the existence of an additional covariate that functions like an instrument in standard IV methods. Key citations are Jo (2002), Peck (2003), Ding, Geng, Yan, & Zhou (2011), and Mealli & Pacini (2013). Multi-site methods. These approaches leverage the multi-site design common in largescale randomized experiments to identify PCEs via site-level regressions. Key citations are Gennetian, Bos, & Morris (2002), Kling, Liebman, & Katz (2007), Reardon & Raudenbush (2013), and Kolesár, Chetty, Friedman, Glaeser, & Imbens (2014). In separate work, we address other methods for estimating PCEs, including finite mixture models and partial identification/bounds, as well as methods that use covariates to increase precision rather than as a primary means of identification. Method: The goals of this paper are (1) to unify and extend current methods that use covariate restrictions to estimate PCEs; and (2) bridge the divide between single-site and multi-site methods. First, we situate existing methods in a common framework, reframing these approaches within a unified notation. This allows us to directly compare and contrast these methods and their (sometimes quite stringent) assumptions. It also allows us to characterize the bias due to violations of these assumptions. Using these analytic comparisons, researchers can then better assess the large menu of available methodological options given their application. Second, we connect the analysis frameworks for single- and multi-site trials. As we show, the critical difference between these settings is the ability, in the multi-site context, to assume that observed sites are a sample from some (possibly hypothetical) population. For example, consider a categorical covariate,!. In a single-site context, with! something such as grade level, we might identify PCEs by making the very strong assumption that grade-specific treatment effects are all equal. In a multi-site context,! would instead be an indicator for site membership. As we show, we can identify PCEs in this context by instead assuming that the site-specific treatment 2

16 effects are equal in expectation. We can construct this assumption of equality in expectation by viewing the units of different levels of! as clusters sampled from a larger population, which is precisely the multi-site formulation. In short, the assumption of structure across sites relaxes the very strong assumptions within sites an option not typically available using covariates in a single-site study. We further connect to similar assumptions and corresponding diagnostics in the literature on ecological inference (Gelman, Park, Ansolabehere, Price, & Minnite, 2001). Setting: We complement our analytical results with simulation studies built from datasets constructed by imputing missing class membership and potential outcomes from real-world studies, which allows for the preservation of much of the structure among the covariates and outcomes. Using these studies, we compare the performance of the different techniques under a variety of plausible circumstances. We finally apply these methods to two common data sets that represent the type of data increasingly available to researchers, the JOBS II study (e.g., Jo, 2002) and the Head Start Impact Study (Puma et al., 2010) and compare the resulting treatment effect estimates to each other as well as plausible baseline values. For JOBS II, the two principal causal effects of interest here are the effect of randomization on depression score for Compliers (those who would enroll if offered) and for Never Takers (those who would never enroll). As Jo (2002) notes, there is a concern that randomization might have a negative impact on Never Takers, even though randomization does not change enrollment behavior. We therefore do not want to assume ex ante that the PCE for this group is zero, the assumption necessary for using a classic IV approach. For the HSIS we also have two subgroups of interest defined by their counterfactual care setting. See (Feller, Grindal, Miratrix, & Page, 2016). Conclusions: Our primary contribution is to lay out the assumptions behind several methods for estimation in stark relief. As some of these assumptions seem implausible in practice, we feel it is important that researchers who do use these methods have both eyes open. More encouragingly, however, researchers can use our framework to tailor different methods to their particular application, mixing and matching assumptions under two-sided noncompliance, for example. This points the way to tackling more difficult estimation problems by building a body of assumptions from the ground up, rooted in substance. Importantly, we find that these assumptions matter in practice. The estimators that rely on alternative sets of assumptions in the JOBS II dataset, for example, do in fact give different results. The simulation studies also show that violating the assumptions is a meaningful concern. And the theoretical results demonstrate how, under such misspecification, the final estimates can be misleading quantities, e.g., a re-scaled difference-in-means of two incomparable populations. 3

17 References Ding, P., Geng, Z., Yan, W., & Zhou, X.-H. (2011). Identifiability and Estimation of Causal Effects by Principal Stratification With Outcomes Truncated by Death. Journal of the American Statistical Association, 106(496), Feller, A., Grindal, T., Miratrix, L. W., & Page, L. C. (2016). Compared to What? Variation in the Impacts of Early Childhood Education by Alternative Care Type. Annals of Applied Statistics. Gelman, A., Park, D. K., Ansolabehere, S., Price, P. N., & Minnite, L. C. (2001). Models, assumptions and model checking in ecological regressions. Journal of the Royal Statistical Society. Series a. Statistics in Society, 164(1), Gennetian, L. A., Bos, J. M., & Morris, P. A. (2002). Using instrumental variables analysis to learn more from social policy experiments. New York. Jo, B. (2002). Estimation of Intervention Effects with Noncompliance: Alternative Model Specifications. Journal of Educational and Behavioral Statistics, 27(4), Kling, J. R., Liebman, J. B., & Katz, L. F. (2007). Experimental Analysis of Neighborhood Effects. Econometrica, 75(1), Kolesár, M., Chetty, R., Friedman, J. N., Glaeser, E. L., & Imbens, G. W. (2014). Identification and Inference with Many Invalid Instruments, Mealli, F., & Pacini, B. (2013). Using Secondary Outcomes to Sharpen Inference in Randomized Experiments With Noncompliance. Journal of the American Statistical Association, 108(503), Page, L. C., Feller, A., Grindal, T., Miratrix, L., & Somers, M. A. (2015). Principal Stratification: A Tool for Understanding Variation in Program Effects Across Endogenous Subgroups. American Journal of Evaluation, 36(4), Peck, L. R. (2003). Subgroup Analysis in Social Experiments: Measuring Program Impacts Based on Post-Treatment Choice. American Journal of Evaluation, 24(2), Reardon, S. F., & Raudenbush, S. W. (2013). Under What Assumptions Do Site-by-Treatment Instruments Identify Average Causal Effects? Sociological Methods & Research, 42(2),

Complier Average Causal Effect (CACE)

Complier Average Causal Effect (CACE) Booil Jo Stanford University Methodological Advancement Meeting Innovative Directions in Estimating Impact Office of Planning, Research & Evaluation Administration