Why randomize? Rohini Pande Harvard University and J-PAL www.povertyactionlab.org
Agenda I. Background on Program Evaluation II. What is a randomized experiment? III. Advantages and Limitations of Experiments IV. How wrong can you go: Vote 2002 campaign V. Conclusions
What is Program or Impact Evaluation? Program evaluation is a set of techniques used to determine if a treatment or intervention works. The treatment could be determined by an individual, or by a social program. To determine the causal effect of the treatment we need knowledge of counterfactuals, that is, what would have happened in the absence of the treatment? The true counterfactual is not observable: Observational studies, or associations, will not estimate causal effects The key goal of all program/impact evaluation methods is to construct or mimic the counterfactual.
Program Evaluation: The problem of Causal Inference What would have happened absent the program? How do we compare a world in which a program happens to a hypothetical world in which it doesn t? The program either happens or it doesn t. We cannot create a world in which the program never happened. So how do we measure this Counterfactual? We cannot do this exactly. But we can identify groups who are similar and measure their outcomes.
Counterfactuals We compare a group that is given the program to a group that is not given the program. Terminology: Treated (T): Group affected by program Control (C): Group unaffected by program Every program evaluation establishes a control group Randomized: Use random assignment of the program to create a control group which mimics the counterfactual. Non-randomized: Argue that a certain excluded group mimics the counterfactual.
What is a good counterfactual? When the only difference between Treatment and Control group is the program. In the absence of the program, Treatment and Control Units would have behaved the same. Key question: Are there other unobservable factors that might affect the treated units but not the control units? We call these unobservable factors selection bias. Presence of selection bias is rampant in observational studies.
Intuition Observational studies are contaminated with selection bias of unknown magnitude and direction. The selection bias is a consequence of the control and treatment groups differing in ways that we do not observe. Answer from Observational Study= Treatment Effect + Selection Bias If selection bias is positive, an observational study overstates the treatment effect; if negative, we understate this effect.
Intuition Observational studies are often used to understand the effect of exercise on health, taking extra classes on SAT scores and video games on exam performance. All are contaminated with selection bias. Key objective of program evaluation is finding or creating situations where the selection bias term is non-existent or very small. Good impact studies are those with small or zero selection bias. This will happen when: Treatment and Control groups differ only by exposure to the program. If they differ in the absence of the program, then we re not estimating a causal effect.
Types of Impact Evaluation Methods (1) Randomized Evaluations: Use coin-flip to classify subjects into Treatment and Control Groups. Give Treatment to the Treatment Group. Control group provides the counterfactual Also known as: Random Assignment Studies Randomized Field Trials Social Experiments Randomized Controlled Experiments
Types of Impact Evaluation Methods (Cont) (2) Non-Experimental or Quasi-Experimental Methods Simple Difference: Compare outcomes of treatment and Control group, where control group wasn t exposed to program for `exogenous reasons Differences-in-Differences Compare changes over time between treatment group and control group Statistical Matching Identify control group based on observables Instrumental Variables Predict treatment as function of variables which don t directly affect outcome of interest.
II WHAT IS A RANDOMIZED EVALUATION? www.povertyactionlab.org
The basics Start with simple case: Take a sample of program applicants Randomly assign them to either: Treatment Group - receives treatment Control Group - not allowed to receive treatment (during the evaluation period)
Random Assignment Suppose 50% of a group of individuals are randomly `treated to a program (without regard to their characteristics). If successfully randomized, individuals assigned to the treatment and control groups differ only in their exposure to the treatment. Implies that the distribution of both observable and unobservable characteristics in the treatment and control groups are statistically identical. Had no individual been exposed to the progam, the average outcomes of treatment and control group would have been the same, and there would be no selection bias!
Basic Setup of a Randomized Evaluation Target Population Potential Participants Evaluation Sample Random Assignment Treatment Group Control Group Participants No-Shows Based on Orr (1999)
Some variations on the basics Assigning to multiple treatment groups Assigning to units other than individuals or households Health Centers Schools Local Governments Villages Important factors: What is the decision-making unit At what level can data be collected?
Key Steps in conducting an experiment 1. Design the study carefully What is the problem? What is the key question to answer? What are the potential policy levers to address the problem? 2. Collect baseline data and randomly assign people to treatment or control 3. Verify that assignment looks random 4. Monitor process so that integrity of experiment is not compromised
Key Steps in conducting an experiment (cont) 5. Collect follow-up data for both the treatment and control groups 6. Estimate program impacts by comparing mean outcomes of treatment group vs. mean outcomes of control group 7. Assess whether program impacts are statistically significant and practically significant
III - ADVANTAGES AND LIMITATIONS OF RANDOMIZED EVALUATIONS www.povertyactionlab.org
Key advantage of experiments Because members of the groups (treatment and control) do not differ systematically at the outset of the experiment, any difference that subsequently arises between them can be attributed to the treatment rather than to other factors
Validity A tool to assess credibility of a study Internal validity: relates to ability to draw causal inference, i.e. can we attribute our impact estimates to the program and not to something else External validity: relates to ability to generalize to other settings of interest, i.e. can we generalize our impact estimates from this program to other populations, time periods, countries, etc? Random assignment versus Random sampling
Other advantages of experiments Relative to results from non-experimental studies, results from experiments are: Less subject to methodological debates Easier to convey More likely to be convincing to program funders and/or policymakers
Limitations of experiments Despite great methodological advantage of experiments, they are also potentially subject to threats to their validity. For example, Internal Validity (e.g. Hawthorne Effects, survey non-response, noshows, crossovers, duration bias, etc.) External Validity (e.g. are the results generalizable to the population of interest?) It is important to realize that some of these threats also affect the validity of nonexperimental studies
Other limitations of experiments Measure the impact of the offer to participate in the program Depending on design it may be possible to understand of the mechanism underlying the intervention. Costs (though need to think about costs of getting the wrong answer and costs ) Partial equilibrium
Other limitations of experiments Ethical issues Some public policies are implemented in randomized fashion due to lack of resources, for ethical reasons or to remove discretion from the system. In such cases researchers can evaluate the policy immediately e.g. in which Indian villages are leader positions reserved for woman Exploit Pilots or phase-in due to budgetary constraints
A nuanced view of Randomization Randomization offers best gold-standard answer for any treatment that we want to offer on a universal, compulsory, or mandatory basis. Vaccinations. Access to education until a certain level. Medical treatment for a certain condition. Universal Health Insurance. A randomization is less useful for situations where the effect of the treatment varies across individuals, and where individuals will be allowed to self-select into treatment. Attending Business School. Elective Surgery.
IV HOW WRONG CAN YOU GO: THE VOTE 2002 CAMPAIGN www.povertyactionlab.org
Case 1 Vote 2002 campaign Intervention designed to increase voter turnout in 2002 Phone calls to ~60,000 individuals Only ~35,000 individuals were reached Key Question: Did the campaign have a positive effect (i.e. impact) on voter turnout? 5 methods were used to estimate impact
Methods 1-3 Based on comparing reached vs not-reached Method 1: Simple difference in voter turnout, (Voter turnout) reached (Voter turnout) not reached Method 2: Multiple Regression controlling for some differences between the two groups Method 3: Method 2, but also controlling for differences in past voting behavior between the two groups
Impact Estimates using Methods 1-3 Estimated Impact Method 1 10.8 pp * Method 2 6.1 pp * Method 3 4.5 pp * pp=percentage points; *: statistically significant at the 5% level
Methods 1-3 Is any of these impact estimates likely to be the true impact of the Vote 2002 campaign?
Reached vs. Not Reached Reached Not Reached Difference Female 56.2% 53.8% 2.4 pp* Newly Regist. 7.3% 9.6% -2.3 pp* From Iowa 54.7% 46.7% 8.0 pp* Voted in 2000 71.7% 63.3% 8.3 pp* Voted in 1998 46.6% 37.6% 9.0 pp* pp=percentage points *: statistically significant at the 5% level
Method 4: Matching Similar data available on 2,000,000 individuals Select as a comparison group a subset of the 2,000,000 individuals that resembles the reached group as much as possible Statistical procedure: matching To estimate impact, compare voter turnout between reached and comparison group
An illustration of matching Source: Arceneaux, Gerber, and Green (2004)
Impact Estimates using Matching Estimated Impact Matching on 4 covariates 3.7 pp * Matching on 6 covariates 3.0 pp * Matching on all covariates 2.8 pp * pp=percentage points; *: statistically significant at the 5% level
Method 4: Matching Is this impact estimate likely to be the true impact of the Vote 2002 campaign? Key: These two groups should be equivalent in terms of the observable characteristics that were used to do the matching. But what about unobservable characteristics?
Method 5: Randomized Experiment Turns out 60,000 were randomly chosen from population of 2,060,000 individuals Hence, treatment was randomly assigned to two groups: Treatment group (60,000 who got called) Control group (2,000,000 who did not get called) To estimate impact, compare voter turnout between treatment and control groups Do a statistical adjustment to address the fact that not everyone in the treatment group was reached
Method 5: Randomized Experiment Impact estimate: 0.4 pp, not statistically significant Is this impact estimate likely to be the true impact of the Vote 2002 campaign? Key: Treatment and control groups should be equivalent in terms of both their observable and unobservable characteristics Hence, any difference in outcomes can be attributed to the Vote 2002 campaign
Summary Table Method Estimated Impact 1 Simple Difference 10.8 pp * 2 Multiple regression 6.1 pp * 3 Multiple regression with 4.5 pp * panel data 4 Matching 2.8 pp * 5 Randomized Experiment 0.4 pp
VI CONCLUSIONS
Conclusions Good public policies require knowledge of causal effects. Causal effects are estimated only if we have good counterfactuals. In the absence of good counterfactuals, analysis is contaminated with selection bias. Be suspicious of causal claims that come from observational studies. Randomization offers a solution to generating good counterfactuals.
Conclusions If properly designed and conducted, social experiments provide the most credible assessment of the impact of a program Results from social experiments are easy to understand and much less subject to methodological quibbles Credibility + Ease =>More likely to convince policymakers and funders of effectiveness (or lack thereof) of program
Conclusions (cont.) However, these advantages are present only if social experiments are well designed and conducted properly Must assess validity of experiments in same way we assess validity of any other study Must be aware of limitations of experiments
THE END