Authors: Samuel D. Pimental, University of Pennsylvania,

Similar documents
Complier Average Causal Effect (CACE)

Analysis methods for improved external validity

Recent advances in non-experimental comparison group designs

Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy

Title: New Perspectives on the Synthetic Control Method. Authors: Eli Ben-Michael, UC Berkeley,

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Propensity Score Methods with Multilevel Data. March 19, 2014

Version No. 7 Date: July Please send comments or suggestions on this glossary to

Introduction to Observational Studies. Jane Pinelis

16:35 17:20 Alexander Luedtke (Fred Hutchinson Cancer Research Center)

Introduction to Program Evaluation

Evaluating Social Programs Course: Evaluation Glossary (Sources: 3ie and The World Bank)

Analysis A step in the research process that involves describing and then making inferences based on a set of data.

A Potential Outcomes View of Value-Added Assessment in Education

Causal Validity Considerations for Including High Quality Non-Experimental Evidence in Systematic Reviews

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Causal Inference in Randomized Experiments With Mediational Processes

Methods of Randomization Lupe Bedoya. Development Impact Evaluation Field Coordinator Training Washington, DC April 22-25, 2013

Measuring Impact. Program and Policy Evaluation with Observational Data. Daniel L. Millimet. Southern Methodist University.

Identifying Peer Influence Effects in Observational Social Network Data: An Evaluation of Propensity Score Methods

You must answer question 1.

Randomization as a Tool for Development Economists. Esther Duflo Sendhil Mullainathan BREAD-BIRS Summer school

Quantitative Methods. Lonnie Berger. Research Training Policy Practice

The Stable Unit Treatment Value Assumption (SUTVA) and Its Implications for Social Science RCTs

ICH E9(R1) Technical Document. Estimands and Sensitivity Analysis in Clinical Trials STEP I TECHNICAL DOCUMENT TABLE OF CONTENTS

Overview of Perspectives on Causal Inference: Campbell and Rubin. Stephen G. West Arizona State University Freie Universität Berlin, Germany

Estimating population average treatment effects from experiments with noncompliance

Unit 1 Exploring and Understanding Data

Propensity Score Analysis Shenyang Guo, Ph.D.

Current Directions in Mediation Analysis David P. MacKinnon 1 and Amanda J. Fairchild 2

Identifying Mechanisms behind Policy Interventions via Causal Mediation Analysis

CAUSAL INFERENCE IN HIV/AIDS RESEARCH: GENERALIZABILITY AND APPLICATIONS

A Comparison of Variance Estimates for Schools and Students Using Taylor Series and Replicate Weighting

Using ASPES (Analysis of Symmetrically- Predicted Endogenous Subgroups) to understand variation in program impacts. Laura R. Peck.

Propensity scores and causal inference using machine learning methods

Estimands and Sensitivity Analysis in Clinical Trials E9(R1)

Econometric analysis and counterfactual studies in the context of IA practices

Accelerating Patient-Centered Outcomes Research and Methodological Research

Methods of Reducing Bias in Time Series Designs: A Within Study Comparison

Book review of Herbert I. Weisberg: Bias and Causation, Models and Judgment for Valid Comparisons Reviewed by Judea Pearl

Propensity score weighting with multilevel data

WWC Standards for Regression Discontinuity and Single Case Study Designs

Mediation Analysis With Principal Stratification

Objective: To describe a new approach to neighborhood effects studies based on residential mobility and demonstrate this approach in the context of

The Use of Propensity Scores for Nonrandomized Designs With Clustered Data

BIOSTATISTICAL METHODS

Generalizing Experimental Findings

Lecture II: Difference in Difference and Regression Discontinuity

Module 14: Missing Data Concepts

Practical propensity score matching: a reply to Smith and Todd

Matching methods for causal inference: A review and a look forward

Propensity Score Matching with Limited Overlap. Abstract

Research in Real-World Settings: PCORI s Model for Comparative Clinical Effectiveness Research

PDRF About Propensity Weighting emma in Australia Adam Hodgson & Andrey Ponomarev Ipsos Connect Australia

Understanding Regression Discontinuity Designs As Observational Studies

The Limits of Inference Without Theory

Detection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis

Manitoba Centre for Health Policy. Inverse Propensity Score Weights or IPTWs

THE USE OF NONPARAMETRIC PROPENSITY SCORE ESTIMATION WITH DATA OBTAINED USING A COMPLEX SAMPLING DESIGN

Getting ready for propensity score methods: Designing non-experimental studies and selecting comparison groups

EPSE 594: Meta-Analysis: Quantitative Research Synthesis

Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016

TRIPLL Webinar: Propensity score methods in chronic pain research

Rise of the Machines

CASE STUDY 2: VOCATIONAL TRAINING FOR DISADVANTAGED YOUTH

DRAFT (Final) Concept Paper On choosing appropriate estimands and defining sensitivity analyses in confirmatory clinical trials

Challenges of Observational and Retrospective Studies

Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha

Comparing treatments evaluated in studies forming disconnected networks of evidence: A review of methods

TRACER STUDIES ASSESSMENTS AND EVALUATIONS

1. The Role of Sample Survey Design

Abstract Title Page. Authors and Affiliations: Chi Chang, Michigan State University. SREE Spring 2015 Conference Abstract Template

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

The Regression-Discontinuity Design

Epidemiologic Methods I & II Epidem 201AB Winter & Spring 2002

Instrumental Variables I (cont.)

Can Quasi Experiments Yield Causal Inferences? Sample. Intervention 2/20/2012. Matthew L. Maciejewski, PhD Durham VA HSR&D and Duke University

A (Constructive/Provocative) Critique of the ICH E9 Addendum

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

Transmission to CHMP July Adoption by CHMP for release for consultation 20 July Start of consultation 31 August 2017

RANDOMIZATION. Outline of Talk

Causal Inference in Observational Settings

Working Paper: Designs of Empirical Evaluations of Non-Experimental Methods in Field Settings. Vivian C. Wong 1 & Peter M.

Pros. University of Chicago and NORC at the University of Chicago, USA, and IZA, Germany

Data Analysis Using Regression and Multilevel/Hierarchical Models

Section on Survey Research Methods JSM 2009

Instrumental Variables Estimation: An Introduction

Estimands. EFPIA webinar Rob Hemmings, Frank Bretz October 2017

Methods for Addressing Selection Bias in Observational Studies

Causal Mediation Analysis with the CAUSALMED Procedure

Brief introduction to instrumental variables. IV Workshop, Bristol, Miguel A. Hernán Department of Epidemiology Harvard School of Public Health

What is Multilevel Modelling Vs Fixed Effects. Will Cook Social Statistics

PubH 7405: REGRESSION ANALYSIS. Propensity Score

Our Theories Are Only As Good As Our Methods. John P. Barile and Anna R. Smith University of Hawai i at Mānoa

Threats and Analysis. Bruno Crépon J-PAL

Researchers often use observational data to estimate treatment effects

Social Change in the 21st Century

A note on evaluating Supplemental Instruction

John A. Nunnery, Ed.D. Executive Director, The Center for Educational Partnerships Old Dominion University

Meta-analysis using HLM 1. Running head: META-ANALYSIS FOR SINGLE-CASE INTERVENTION DESIGNS

Transcription:

Title: Optimal Multilevel Matching Using Network Flows Authors: Samuel D. Pimental, University of Pennsylvania, spi@wharton.upenn.edu Lindsay C. Page, University of Pittsburgh, lpage@pitt.edu [presenting author] Matthew Lenard, Wake County Public School System Luke Keele, Georgetown University, lk681@georgetown.edu 1

Problem: When an educational intervention or treatment of interest is not randomly assigned, matching estimators are one method of adjustment designed to mimic a randomized controlled trial (RCT) by constructing a set of treated and control units that are highly comparable on observed, pretreatment characteristics. Existing tools are well developed to handle selection and matching at the unit or student level. Especially in education, however, because of the organizational structure of schools, interventions are often assigned at the group or school level rather than the unit level. As a result, many observational studies of causal effects occur in settings with clustered treatment assignment. For example, an educational intervention might be administered to all or a subset of the students in a school while being withheld from all students in another school. Just as we can use matching to mimic an individual-level RCT, so too can we conceive of using matching to mimic a group RCT by creating comparable treatment and comparison clusters to overcome overt bias. Yet, compared to the extensive literature on individual-level matching, strategies for optimally handling group-level matches are much less well developed. In this paper, we develop a conceptual framework and a method of statistical adjustment for multilevel matching problems. We then apply the tools developed to a study of a summer reading program implemented in selected schools within the Wake County Public School System (Wake County, NC). Prior methodological research: To date, there has been limited work focused on matching in multilevel contexts. Steiner et al. (2013) consider matching in multilevel settings, but they assume that treatment assignment occurs at the student level rather than school level. Other efforts to translate standard methods of adjustment into the context of data with a multilevel structure utilize hierarchical regression or methods based on propensity score stratification (Hong and Raudenbush, 2006; Arpino and Mealli, 2011; Li et al., 2013). Yet, when translated to the multilevel context, these methods can suffer from drawbacks, such as lack of model convergence or a tendency to extrapolate over areas without common support. Our research is an extension of recent matching algorithm design to be optimal for multilevel settings (Zubizaretta and Keele, in press). Method: In this paper, we first outline two possible assignment mechanisms at the cluster level. For example, we consider (1) the selection of entire groups such as whole schools into treatment and (2) the selection of subsets of students from within a group into treatment. This second circumstance might arise if schools are selected for a treatment that is then only applied to students with certain characteristics, such as being above or below a specific academic threshold (e.g., gifted or remedial programs, respectively). We then derive different research designs based on these different possible assignment mechanisms. For each of these designs, we develop a matching algorithm for multilevel data 2

based on a network flow algorithm. Earlier work on multilevel matching relied on integer programming, which allows for balance targeting on specific covariates, but can be slow with larger data sets. While we cannot target balance on specific covariates, our algorithm is quite fast and scales easily to larger data sets. We also allow the algorithm to trim treated observations to increase balance and increase overlap in the covariate distributions. Setting: We apply our algorithm to investigate the effectiveness of a summer school reading intervention implemented within the Wake County Public School System (WCPSS; Wake County, North Carolina). North Carolina state legislation required that students who didn t meet district standards at the end of 3rd grade were required to attend summer reading camps or risk retention. In summer 2013, WCPSS selected myon, a product of Capstone Digital, for implementation at Title I summer school sites in an effort to boost reading comprehension among the majority low-ses attendees. myon is a form of internet-based software designed to serve primarily as an electronic reading device. The software provides students with access to a library of books and suggests titles to students based on topic interests and reading ability. Students at myon sites used the program for up to one-half hour during the daily literacy block and could continue using the program at home if they had a device and internet connection. Not all summer school students in WCPSS were given access to the myon reading program. Rather, it was used by teachers at eight of the 19 summer school sites. These summer school sites were selected based on a mix of factors including internet bandwidth, computer access, and regional distribution. Students were assigned to summer school sites primarily through geographic proximity. Thus all students in a school close to a myon summer school site used the myon program during summer school. Principals and schools themselves had no input into program participation. Analysis & Findings: Given that the intervention was assigned to entire elementary schools, we use our multi-level matching strategy to develop a matched sample against which to compare treated schools and students. We then use multilevel modeling to examine the impact of myon on student outcomes. We find that the intervention does not appear to increase reading test scores, however, in a sensitivity analysis, we find that an unobserved confounder could easily mask a larger treatment effect. 3

Works Cited Arpino, B. and Mealli, F. (2011). The specification of the propensity score in multilevel observational studies. Computational Statistics & Data Analysis, 55(4):1770 1780. Hong, G. and Raudenbush, S. W. (2006). Evaluating kindergarten retention policy: A case study of causal inference for multilevel data. Journal of the American Statistical Association, 101(475):901 910. Li, F., Zaslavsky, A. M., and Landrum, M. B. (2013). Propensity score weighting with multilevel data. Statistics in medicine, 32(19):3373 3387. Steiner, P., Kim, J.-S., and Thoemmes, F. (2013). Matching strategies for observational multilevel data. In JSM proceedings, pages 5020 5032. Zubizarreta, J. R., & Keele, L. (in press). Optimal multilevel matching in clustered observational studies: A case study of the effectiveness of private schools under a large-scale voucher system. 4

Title: Analyzing multilevel experiments in the presence of peer effects Authors: Guillaume Basse, Harvard University, gbasse@fas.harvard.edu Avi Feller, UC Berkeley, afeller@berkeley.edu [presenting author] 1

Problem/background: Multilevel randomization is a powerful design for estimating causal effects in the presence of social interactions. A typical multilevel design has two stages. First, whole clusters (e.g., households or schools) are assigned to treatment or control. Second, units within each treated cluster are randomly assigned to treatment or control, as if each treated cluster were a separate, individuallyrandomized experiment. This design allows researchers to assess peer effects by comparing untreated units in treated clusters with similarly untreated units in the control clusters (Hudgens & Halloran, 2008). There has been substantial interest in multilevel randomization in recent years, with prominent examples in economics Crepon, Duflo, Gurgand, Rathelot, & Zamora (2013), political science (Sinclair, McConnell, & Green, 2012), and public health (Hudgens & Halloran, 2008). This randomized design, however, is relatively rare in education; see, for example, Somers (2010). By contrast, several important papers discuss the non-randomized analog, including Hong & Raudenbush (2006). Prior methodological research: There is a small but growing methodological literature on analyzing multilevel experiments. Key citations include Hudgens & Halloran (2008), Liu & Hudgens (2014), and Rigdon & Hudgens (2015) in statistics; Sinclair et al. (2012) in political science; and Baird, Bohren, McIntosh, & Ozler (2014) in economics. Kang & Imbens (2016) explore a related experimental design known as a peer encouragement design. Finally, Weiss, Lockwood, & McCaffrey (2016) discuss methods for analyzing individually randomized group treatment designs, which appear similar to multilevel experiments on the surface but rely on fundamentally distinct assumptions. Method: Our paper addresses three key practical issues that arise in analyzing multilevel experiments. First, researchers are often interested in an intervention s overall impact on students rather than on clusters; that is, researchers might give equal weight to each individual rather than equal weight to each cluster. While this distinction is well-known in the literature on cluster-randomization (Schochet, 2013), existing approaches for multilevel experiments focus either on equal weights for clusters (Hudgens & Halloran, 2008) or side-step the issue by assuming clusters are of equal size (Baird et al., 2014). We propose unbiased estimators for a broad class of individual- and cluster-weighted estimands, with corresponding theoretical and estimated variances. We also derive the bias of a simple difference in means for estimating individual-weighted estimands. Second, we connect two common approaches for analyzing multilevel designs: linear regression, which is more common in the social sciences, and randomization inference, which is more common in epidemiology and public health. We show that, with suitably chosen standard errors and correct weights, weighted regression and randomization inference yield identical point and variance estimates. These results hold for both individual- and cluster-weighted estimands, greatly extending existing results. We believe this equivalence will be important in practice, since the vast majority of applied papers in this area take a regression first approach to analysis that can obfuscate key inferential issues. Finally, we propose methods for incorporating covariates to improve precision, with a focus on poststratification and model-assisted estimation. While largely ignored in the existing methodological literature, we find that covariate adjustment is quite important in practice. Setting: Our motivating example is a large randomized evaluation of an intervention targeting student absenteeism among elementary and high school students in the School District of Philadelphia over the 2014-2015 school year (Rogers & Feller, 2016). In this study, parents of at-risk students were randomly assigned to a direct mail intervention with tailored information about their students attendance over the 2

course of the year. In treated households with multiple eligible students, one student was selected at random to be the subject of the mailings, following a multilevel randomization. We consider a subset of 3,804 households with between 2 and 7 eligible students in each household, with 8,496 total students. Analysis & Findings: The left panel of Figure 1 shows the impact of random assignment on chronic absenteeism, defined as being absent 18 or more school days (results are similar for log-absences). The primary effect, defined as the impact of assignment on the target of the mailings, is roughly 4 percentage points for both household- and individual-weighted estimands. The spillover effect, defined as the impact of being an untreated individual in a treated household, is roughly 3 percentage points for both householdand individual-weighted estimands. The right panel of Figure 1 shows the same results after adjusting for pre-treatment covariates, especially prior year attendance, both via a model-assisted estimator and via a model-assisted estimator combined with post-stratification. Both adjustment methods yield considerable gains in precision. These results are consistent with extensive simulation studies, which show similar patterns for simulated data that mimics the applied example. Overall, these results show that spillover effects can be quite large in magnitude, nearly as large as the primary effects. Conclusion: Multilevel randomizations are important designs in settings with interactions between units. This paper addresses issues that arise when analyzing such designs. First, we address issues that arise when cluster sizes vary. Second, we demonstrate that appropriately weighted regression can yield identical point- and variance-estimates to fully randomization-based methods. Methodologically, we believe that this is a useful addition to the literatures on both causal inference with interference and randomization-based inference. Substantively, we find important insights into the intra-household dynamics of student behavior. 3

References Baird, S., Bohren, J. A., McIntosh, C., & Ozler, B. (2014). Designing Experiments to Measure Spillover Effects. Crepon, B., Duflo, E., Gurgand, M., Rathelot, R., & Zamora, P. (2013). Do Labor Market Policies have Displacement Effects? Evidence from a Clustered Randomized Experiment. The Quarterly Journal of Economics, 128(2), 531 580. http://doi.org/10.1093/qje/qjt001 Hong, G., & Raudenbush, S. W. (2006). Evaluating Kindergarten Retention Policy. Journal of the American Statistical Association, 101(475), 901 910. http://doi.org/10.1198/016214506000000447 Hudgens, M. G., & Halloran, M. E. (2008). Toward Causal Inference With Interference. Journal of the American Statistical Association, 103(482), 832 842. http://doi.org/10.1198/016214508000000292 Kang, H., & Imbens, G. W. (2016, September 14). Peer Encouragement Designs in Causal Inference with Partial Interference and Identification of Local Average Network Effects. arxiv.org. Liu, L., & Hudgens, M. G. (2014). Large Sample Randomization Inference of Causal Effects in the Presence of Interference. Journal of the American Statistical Association, 109(505), 288 301. http://doi.org/10.1080/01621459.2013.844698 Marie-Andrée Somers, W. C. S. S. T. S. J. L. A. C. Z. (2010). The Enhanced Reading Opportunities Study Final Report: The Impact of Supplemental Literacy Courses for Struggling Ninth-Grade Readers, 1 30. Rigdon, J., & Hudgens, M. G. (2015). Exact confidence intervals in the presence of interference. Statistics and Probability Letters, 105, 130 135. http://doi.org/10.1016/j.spl.2015.06.011 Rogers, T., & Feller, A. (2016). Reducing Student Absences at Scale, 1 13. Schochet, P. Z. (2013). Estimators for Clustered Education RCTs Using the Neyman Model for Causal Inference. Journal of Educational and Behavioral Statistics, 38(3), 219 238. http://doi.org/10.3102/1076998611432176 Sinclair, B., McConnell, M., & Green, D. P. (2012). Detecting Spillover Effects: Design and Analysis of Multilevel Experiments. American Journal of Political Science, 56(4), 1055 1069. http://doi.org/10.1111/j.1540-5907.2012.00592.x Weiss, M. J., Lockwood, J. R., & McCaffrey, D. F. (2016). Estimating the Standard Error of the Impact Estimator in Individually Randomized Trials With Clustering. Journal of Research on Educational Effectiveness, 9(3), 421 444. http://doi.org/10.1080/19345747.2015.1086911 4

Figures Household Weighted Primary Unadjusted Primary Individual Weighted Household Weighted (PS) Model Assisted Individual Weighted (PS) Model Assisted + PS Household Weighted Spillover Unadjusted Spillover Individual Weighted Household Weighted (PS) Model Assisted Individual Weighted (PS) Model Assisted + PS 8 4 0 Treatment Effect (pct. pt.) 8 4 0 Treatment Effect (pct. pt.) Figure 1: The left panel shows primary and spillover effects of randomization on chronic absenteeism for household- and individual-weighted estimands. The right panel shows the impact on individualweighted estimands adjusting for covariates. PS denotes post-stratification on the number of individuals within each household. 5

Title: Methods for generalizing treatment effects from cluster randomized trials to target populations Authors: Elizabeth A. Stuart, Johns Hopkins Bloomberg School of Public Health, estuart@jhu.edu [Presenting author] Robert B. Olsen, Rob Olsen LLC, rob.olsen.research@gmail.com Cyrus Ebnesajjad, The Fred Hutchinson Cancer Research Center, cebnesaj@fredhutch.org Stephen H. Bell, Abt Associates, Stephen_bell@abtassoc.com Larry L. Orr, Johns Hopkins Bloomberg School of Public Health, larry.orr.consulting@gmail.com 1

Abstract: The ultimate goal of many educational evaluations is to inform policy; e.g., to help policy makers understand the effects of interventions or programs in target populations of policy interest. While randomized trials provide internal validity - they yield unbiased effect estimates for the subjects in the study sample - there is growing awareness that they may not provide external validity - the ability to estimate what effects would be in other, target populations or in some other context (see, e.g., Cook, 2014; Olsen et al., 2013; Tipton, 2014). Recently developed statistical methods that use trial data and data on a population of interest have the potential to utilize the strong internal validity of trials while also enhancing the external validity (Kern et al., 2016; O Muircheartaigh & Hedges, 2014; Stuart et al., 2011). These methods fall into two broad classes: (1) flexible regression models of the outcome as a function of treatment status and covariates, and (2) reweighting methods that weight the RCT sample to reflect the covariate distribution in the population. However, to this point the existing methods have focused on a relatively simple scenario with individual-level randomization, ignoring the common complication of clustering. In particular, many educational evaluations are conducted by selecting a sample of schools or districts, and then randomizing students within those schools. This talk will discuss the extension of existing generalizability methods to multi-level settings such as cluster randomized trials. Discussion will include the pros and cons of using aggregated data (aggregated to the cluster level) or fully utilizing the multilevel structure, and the implications for analyses of those two approaches. The talk will also present results from simulation studies examining the performance of methods in each of these two broad classes of approaches (outcome regression models and reweighting approaches), extended to this multilevel setting. The simulations are designed to be as realistic as possible, based on data on a representative sample of public school students nationwide (the ECLS-K), empirical evidence on impact variation in two large-scale RCTs in education, and evidence on the types of schools that were selected for several RCTs in education (see Stuart et al., in press). We find that, perhaps not surprisingly given the broader literature on the benefits of multilevel models, using only aggregate data does not perform as well as using the multilevel structure. This is true even though selection into the trial is at the school level, and thus one might expect that school-selection weights would be sufficient. Initial evidence implies that part of the reason for that poor performance is due to the relatively small number of schools in the hypothetical trials, especially relative to the very large size of the population, making it hard to estimate accurate weights. However, when a multilevel structure is used in the analysis approach, and when the assumptions underlying each approach are satisfied, each approach (outcome modeling and reweighting) works well. However, when key assumptions are violated for example, if we do not observe all of the factors that moderate treatment effects and that differ between the RCT sample and the target population none of the methods consistently estimates the population effects. We conclude with recommendations for practice, including the need for thorough and consistent covariate measurement and a better understanding of treatment effect 2

heterogeneity. This work helps to identify the conditions under which different statistical methods can reduce external validity bias in educational evaluations. 3

References: Cook, T. D. (2014). Generalizing causal knowledge in the policy sciences: External validity as a task of both multi-attribute representation and multi-attribute extrapolation. Journal of Policy Analysis and Management, 33, 527 536. Kern, H.L., Stuart, E.A., Hill, J., and Green, D.P. (2016). Assessing methods for generalizing experimental impact estimates to target populations. Forthcoming in Journal of Research on Educational Effectiveness. Published online 14 January 2016. Tipton, E. (2014). How generalizable is your experiment? An index for comparing experimental samples and populations. Journal of Educational and Behavioral Statistics, 39, 478 501. Olsen, R., Bell, S., Orr, L., & Stuart, E. A. (2013). External validity in policy evaluations that choose sites purposively. Journal of Policy Analysis and Management, 32, 107 121. O Muircheartaigh, C., & Hedges, L. V. (2014). Generalizing from unrepresentative experiments: A stratified propensity score approach. Journal of the Royal Statistical Society: Series C, 63, 195 210. Stuart, E.A., Bell, S.H., Ebnesajjad, C., Olsen, R.B., and Orr, L.L. (in press). Characteristics of school districts that participate in rigorous national educational evaluations. Forthcoming in Journal of Research on Educational Effectiveness. Stuart, E.A., Cole, S.R., Bradshaw, C.P., and Leaf, P.J. (2011). The use of propensity scores to assess the generalizability of results from randomized trials. The Journal of the Royal Statistical Society, Series A 174(2): 369-386. PMCID: 4051511. 4

Title: Covariate restrictions for estimating principal causal effects in single- and multi-site trials Authors: Avi Feller, UC Berkeley, afeller@berkeley.edu Luke W. Miratrix, Harvard University, lmiratrix@g.harvard.edu [presenting author] Lo-Hua Yuan, Harvard University, lohuayuan@fas.harvard.edu 1

Problem: Researchers are often interested in causal effects for subgroups defined by posttreatment outcomes. Prominent examples in education include noncompliance, attrition, and implementation fidelity. Unfortunately, estimating principal causal effects (PCEs) in these principal strata or endogenous subgroups can be difficult, as subgroup membership is not fully observed. A large and growing literature addresses the associated methodological challenges (see Page, Feller, Grindal, Miratrix, & Somers (2015) for a recent review). Prior methodological research: Many existing strategies for estimating PCEs crucially depend on covariates. In particular, covariates can be used to directly identify PCEs given appropriate additional assumptions. A broad range of methods fall under this umbrella, and despite making similar assumptions, they often seem quite different on the surface. In general, we divide these methods, which are gaining increased focus and attention in the education research world, into two broad categories: Single-site methods. These approaches generalize standard instrumental variable (IV) methods, positing the existence of an additional covariate that functions like an instrument in standard IV methods. Key citations are Jo (2002), Peck (2003), Ding, Geng, Yan, & Zhou (2011), and Mealli & Pacini (2013). Multi-site methods. These approaches leverage the multi-site design common in largescale randomized experiments to identify PCEs via site-level regressions. Key citations are Gennetian, Bos, & Morris (2002), Kling, Liebman, & Katz (2007), Reardon & Raudenbush (2013), and Kolesár, Chetty, Friedman, Glaeser, & Imbens (2014). In separate work, we address other methods for estimating PCEs, including finite mixture models and partial identification/bounds, as well as methods that use covariates to increase precision rather than as a primary means of identification. Method: The goals of this paper are (1) to unify and extend current methods that use covariate restrictions to estimate PCEs; and (2) bridge the divide between single-site and multi-site methods. First, we situate existing methods in a common framework, reframing these approaches within a unified notation. This allows us to directly compare and contrast these methods and their (sometimes quite stringent) assumptions. It also allows us to characterize the bias due to violations of these assumptions. Using these analytic comparisons, researchers can then better assess the large menu of available methodological options given their application. Second, we connect the analysis frameworks for single- and multi-site trials. As we show, the critical difference between these settings is the ability, in the multi-site context, to assume that observed sites are a sample from some (possibly hypothetical) population. For example, consider a categorical covariate,!. In a single-site context, with! something such as grade level, we might identify PCEs by making the very strong assumption that grade-specific treatment effects are all equal. In a multi-site context,! would instead be an indicator for site membership. As we show, we can identify PCEs in this context by instead assuming that the site-specific treatment 2

effects are equal in expectation. We can construct this assumption of equality in expectation by viewing the units of different levels of! as clusters sampled from a larger population, which is precisely the multi-site formulation. In short, the assumption of structure across sites relaxes the very strong assumptions within sites an option not typically available using covariates in a single-site study. We further connect to similar assumptions and corresponding diagnostics in the literature on ecological inference (Gelman, Park, Ansolabehere, Price, & Minnite, 2001). Setting: We complement our analytical results with simulation studies built from datasets constructed by imputing missing class membership and potential outcomes from real-world studies, which allows for the preservation of much of the structure among the covariates and outcomes. Using these studies, we compare the performance of the different techniques under a variety of plausible circumstances. We finally apply these methods to two common data sets that represent the type of data increasingly available to researchers, the JOBS II study (e.g., Jo, 2002) and the Head Start Impact Study (Puma et al., 2010) and compare the resulting treatment effect estimates to each other as well as plausible baseline values. For JOBS II, the two principal causal effects of interest here are the effect of randomization on depression score for Compliers (those who would enroll if offered) and for Never Takers (those who would never enroll). As Jo (2002) notes, there is a concern that randomization might have a negative impact on Never Takers, even though randomization does not change enrollment behavior. We therefore do not want to assume ex ante that the PCE for this group is zero, the assumption necessary for using a classic IV approach. For the HSIS we also have two subgroups of interest defined by their counterfactual care setting. See (Feller, Grindal, Miratrix, & Page, 2016). Conclusions: Our primary contribution is to lay out the assumptions behind several methods for estimation in stark relief. As some of these assumptions seem implausible in practice, we feel it is important that researchers who do use these methods have both eyes open. More encouragingly, however, researchers can use our framework to tailor different methods to their particular application, mixing and matching assumptions under two-sided noncompliance, for example. This points the way to tackling more difficult estimation problems by building a body of assumptions from the ground up, rooted in substance. Importantly, we find that these assumptions matter in practice. The estimators that rely on alternative sets of assumptions in the JOBS II dataset, for example, do in fact give different results. The simulation studies also show that violating the assumptions is a meaningful concern. And the theoretical results demonstrate how, under such misspecification, the final estimates can be misleading quantities, e.g., a re-scaled difference-in-means of two incomparable populations. 3

References Ding, P., Geng, Z., Yan, W., & Zhou, X.-H. (2011). Identifiability and Estimation of Causal Effects by Principal Stratification With Outcomes Truncated by Death. Journal of the American Statistical Association, 106(496), 1578 1591. http://doi.org/10.1198/jasa.2011.tm10265 Feller, A., Grindal, T., Miratrix, L. W., & Page, L. C. (2016). Compared to What? Variation in the Impacts of Early Childhood Education by Alternative Care Type. Annals of Applied Statistics. Gelman, A., Park, D. K., Ansolabehere, S., Price, P. N., & Minnite, L. C. (2001). Models, assumptions and model checking in ecological regressions. Journal of the Royal Statistical Society. Series a. Statistics in Society, 164(1), 101 118. http://doi.org/10.1111/1467-985x.00190 Gennetian, L. A., Bos, J. M., & Morris, P. A. (2002). Using instrumental variables analysis to learn more from social policy experiments. New York. Jo, B. (2002). Estimation of Intervention Effects with Noncompliance: Alternative Model Specifications. Journal of Educational and Behavioral Statistics, 27(4), 385 409. http://doi.org/10.2307/3648123?ref=search-gateway:efdea301c4cf2e40deece4a8b62c8156 Kling, J. R., Liebman, J. B., & Katz, L. F. (2007). Experimental Analysis of Neighborhood Effects. Econometrica, 75(1), 83 119. http://doi.org/10.2307/4123109?ref=searchgateway:020f5798b821e1efff4dec99986f4028 Kolesár, M., Chetty, R., Friedman, J. N., Glaeser, E. L., & Imbens, G. W. (2014). Identification and Inference with Many Invalid Instruments, 1 37. Mealli, F., & Pacini, B. (2013). Using Secondary Outcomes to Sharpen Inference in Randomized Experiments With Noncompliance. Journal of the American Statistical Association, 108(503), 1120 1131. http://doi.org/10.1080/01621459.2013.802238 Page, L. C., Feller, A., Grindal, T., Miratrix, L., & Somers, M. A. (2015). Principal Stratification: A Tool for Understanding Variation in Program Effects Across Endogenous Subgroups. American Journal of Evaluation, 36(4), 514 531. http://doi.org/10.1177/1098214015594419 Peck, L. R. (2003). Subgroup Analysis in Social Experiments: Measuring Program Impacts Based on Post-Treatment Choice. American Journal of Evaluation, 24(2), 157 187. http://doi.org/10.1177/109821400302400203 Reardon, S. F., & Raudenbush, S. W. (2013). Under What Assumptions Do Site-by-Treatment Instruments Identify Average Causal Effects? Sociological Methods & Research, 42(2), 143 163. http://doi.org/10.1177/0049124113494575 4