Planning Sample Size for Randomized Evaluations.

Similar documents
Abdul Latif Jameel Poverty Action Lab Executive Training: Evaluating Social Programs Spring 2009

Planning sample size for impact evaluations

Randomization as a Tool for Development Economists. Esther Duflo Sendhil Mullainathan BREAD-BIRS Summer school

SAMPLING AND SAMPLE SIZE

Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha

Sampling for Impact Evaluation. Maria Jones 24 June 2015 ieconnect Impact Evaluation Workshop Rio de Janeiro, Brazil June 22-25, 2015

Statistical Power Sampling Design and sample Size Determination

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

Sample Size, Power and Sampling Methods

Evaluating Social Programs Course: Evaluation Glossary (Sources: 3ie and The World Bank)

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Statistical Significance and Power. November 17 Clair

The t-test: Answers the question: is the difference between the two conditions in my experiment "real" or due to chance?

Sheila Barron Statistics Outreach Center 2/8/2011

Power & Sample Size. Dr. Andrea Benedetti

AP Statistics Exam Review: Strand 2: Sampling and Experimentation Date:

APPENDIX N. Summary Statistics: The "Big 5" Statistical Tools for School Counselors

CHL 5225 H Advanced Statistical Methods for Clinical Trials. CHL 5225 H The Language of Clinical Trials

Unit 1 Exploring and Understanding Data

Chapter 2. The Data Analysis Process and Collecting Data Sensibly. Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

CHAPTER NINE DATA ANALYSIS / EVALUATING QUALITY (VALIDITY) OF BETWEEN GROUP EXPERIMENTS

The Pretest! Pretest! Pretest! Assignment (Example 2)

Statistics and Probability

Vocabulary. Bias. Blinding. Block. Cluster sample

NEED A SAMPLE SIZE? How to work with your friendly biostatistician!!!

Greg Pond Ph.D., P.Stat.

Business Statistics Probability

CHAPTER 3 METHOD AND PROCEDURE

AP STATISTICS 2008 SCORING GUIDELINES (Form B)

Methods of Randomization Lupe Bedoya. Development Impact Evaluation Field Coordinator Training Washington, DC April 22-25, 2013

PSYCHOLOGY 300B (A01) One-sample t test. n = d = ρ 1 ρ 0 δ = d (n 1) d

Population. population. parameter. Census versus Sample. Statistic. sample. statistic. Parameter. Population. Example: Census.

Threats and Analysis. Bruno Crépon J-PAL

Threats and Analysis. Shawn Cole. Harvard Business School

Stat Wk 9: Hypothesis Tests and Analysis

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

Inferential Statistics

IAPT: Regression. Regression analyses

Impact Evaluation Methods: Why Randomize? Meghan Mahoney Policy Manager, J-PAL Global

How to Randomise? (Randomisation Design)

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

STA 291 Lecture 4 Jan 26, 2010

Conducting Strong Quasi-experiments

Lesson 8 Descriptive Statistics: Measures of Central Tendency and Dispersion

Assignment #6. Chapter 10: 14, 15 Chapter 11: 14, 18. Due tomorrow Nov. 6 th by 2pm in your TA s homework box

Risk Aversion in Games of Chance

In this chapter we discuss validity issues for quantitative research and for qualitative research.

Ch. 1 Collecting and Displaying Data

In this second module in the clinical trials series, we will focus on design considerations for Phase III clinical trials. Phase III clinical trials

Sanjay P. Zodpey Clinical Epidemiology Unit, Department of Preventive and Social Medicine, Government Medical College, Nagpur, Maharashtra, India.

Review and Wrap-up! ESP 178 Applied Research Methods Calvin Thigpen 3/14/17 Adapted from presentation by Prof. Susan Handy

FORM C Dr. Sanocki, PSY 3204 EXAM 1 NAME

Statistics: Making Sense of the Numbers

Critical Appraisal Series

Adaptive approaches to randomised trials of nonpharmaceutical

Simple Linear Regression the model, estimation and testing

Our goal in this section is to explain a few more concepts about experiments. Don t be concerned with the details.

Statistical Sampling: An Overview for Criminal Justice Researchers April 28, 2016

Methods for Determining Random Sample Size

Still important ideas

Protocol Development: The Guiding Light of Any Clinical Study

Randomized Evaluations

Sampling. (James Madison University) January 9, / 13

Variable Data univariate data set bivariate data set multivariate data set categorical qualitative numerical quantitative

Sample size for clinical trials

CHAPTER III RESEARCH METHODOLOGY

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Chapter Three: Sampling Methods

(Regulatory) views on Biomarker defined Subgroups

STATISTICS 8 CHAPTERS 1 TO 6, SAMPLE MULTIPLE CHOICE QUESTIONS

Statistical Tests Using Experimental Data

Statistics for Psychology

Villarreal Rm. 170 Handout (4.3)/(4.4) - 1 Designing Experiments I

INTERNAL VALIDITY, BIAS AND CONFOUNDING

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Student Performance Q&A:

QALYs and the QALY problem in PC. 1. Idea of the QALY 2. QALY problem in Palliative Care

Psychology Research Process

e.com/watch?v=hz1f yhvojr4 e.com/watch?v=kmy xd6qeass

Chapter Eight: Multivariate Analysis

Still important ideas

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

Unit 3: Collecting Data. Observational Study Experimental Study Sampling Bias Types of Sampling

EC352 Econometric Methods: Week 07

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

ANOVA. Thomas Elliott. January 29, 2013

Example 7.2. Autocorrelation. Pilar González and Susan Orbe. Dpt. Applied Economics III (Econometrics and Statistics)

Effective Implementation of Bayesian Adaptive Randomization in Early Phase Clinical Development. Pantelis Vlachos.

TRANSLATING RESEARCH INTO ACTION. Why randomize? Dan Levy. Harvard Kennedy School

To open a CMA file > Download and Save file Start CMA Open file from within CMA

15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA

A NEW TRIAL DESIGN FULLY INTEGRATING BIOMARKER INFORMATION FOR THE EVALUATION OF TREATMENT-EFFECT MECHANISMS IN PERSONALISED MEDICINE

Desirability Bias: Do Desires Influence Expectations? It Depends on How You Ask.

Welcome to our e-course on research methods for dietitians

Research Methods in Social Psychology. Lecture Notes By Halford H. Fairchild Pitzer College September 4, 2013

Programme Name: Climate Schools: Alcohol and drug education courses

Transcription:

Planning Sample Size for Randomized Evaluations www.povertyactionlab.org

Planning Sample Size for Randomized Evaluations General question: How large does the sample need to be to credibly detect a given effect size? Relevant factors: Expected effect size Variability in the outcomes of interest Design of the experiment: stratification, control variables, baseline data, group v. individual level randomization

Basic set up At the end of an experiment, we will compare the outcome of interest in the treatment and the comparison groups. We are interested in the difference: Mean in treatment - Mean in control = Effect size This can also be obtained through a regression of Y on T: Y i = i = Effect size.

Estimation We estimate the mean in our sample, for example by computing the sample average Y 1 n n Y i i 1 If there is a lot of variation in the sample, the averages will be imprecise, so even if the real treatment effect is the same, it will be harder to detect. i 1

25 20 15 10 5 0 A tight conclusion Low Standard Deviation value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Number mean 50 mean 60 Frequency

9 8 7 6 5 4 3 2 1 0 Less Precision Medium Standard Deviation value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Number mean 50 mean 60 Frequency

Can we conclude anything? High Standard Deviation 8 7 6 5 4 3 mean 50 mean 60 Frequency 2 1 0 value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Number

Confidence Intervals The goal is to figure out the true effect of the program From our sample, we get an estimate of the program effect What can we learn about the true program effect from the estimate? A 95% confidence interval for an estimate tells us that, with 95% probability, the true lies within the confidence interval For large sample, an approximate confidence interval is: µb b 1.96se( ) µb b b µb

Confidence Intervals Example 1: Sampled women Pradhans have 7.13 years of education Sampled male Pradhans have 9.92 years of education The difference is 2.59 with a standard error of 0.54 The 95% confidence interval is [-1.73;-3.84] Example 2: Sampled women Pradhans have 2.45 children Sampled male Pradhans have 2.50 children The difference is 0.05, with a standard error of 0.26 The 95% confidence interval is [-0.55;0.46]

Hypothesis testing Often we are interested in testing the hypothesis that the effect size is equal to zero: We want to test the null hypothesis: H o : Effect size Against the alternative hypothesis: H a : Effect size (other possible alternatives: Ha>0, Ha<0). Hypothesis testing asks: when can I reject the null in favor of the alternative? 0 0

Two types of mistakes Type I error: Reject the null hypothesis Ho when it is in fact true. The significance level (or level) of a test is the probability of a type I error =P(Reject Ho Ho is true) Example 1: Female Pradhan s year of education is 7.13, and Male s is 9.92 in our sample. Do female Pradhan have different level of education, or the same? If I say they are different, how confident am I in the answer? Example 2: Female Pradhan s number of children is 2.45, and Males s is 2.40 in our sample. Do female Pradhan have different number of children, or the same? If I say they are different, how confident am I in the answer? Common level of : 0.05, 0.01, 0.1.

Two types of mistakes Type II error: Failing to reject H o when it is in fact false. The power of a test is one minus the probability of a type II error (0) = P(Reject H0 Effect size not zero) How likely is my experiment to actually detect an effect if there actually is an effect? Example: If I run 100 experiments, in how many of them will I be able to reject the hypothesis that women and men have the same education at the 5% level, if in fact they are different?

Testing equality of means We use t = ) b ) se( b ) So if t 1.96, we reject the hypothesis of equality at a 5% level of confidence. It t 1.96, we fail to reject the hypothesis of equality at a 5% level of confidence Example of Pradhan s education: Difference: 2.59 Standard error: 0.54 We definitely reject equality at 5% level.

Calculating Power When planning an evaluation, with some preliminary research we can calculate the minimum sample we need to get to: Test a pre-specified null hypothesis (e.g. treatment effect 0) For a pre-specified significance level (e.g. 0.05) Given a pre-specified effect size (e.g. 0.2 standard deviation of the outcomes of interest). To achieve a given power A power of 80% tells us that, in 80% of the experiments of this sample size conducted in this population, if Ho is in fact false (e.g. the treatment effect is not zero), we will be able to reject it. The larger the sample, the larger the power. Common Power used: 80%, 90%

Ingredients for a power calculation in a simple study What we need Where we get it Significance level This is conventionally set at 5% The mean and the variance of the outcome in the comparison group From a small survey in the same or a similar population The effect size that we want to detect What is the smallest effect that should prompt a policy response? Rationale: If the effect is any smaller than this, then it is not interesting to distinguish it from zero

Picking an effect size What is the smallest effect that should justify the program to be adopted: Cost of this program vs the benefits it brings Cost of this program vs the alternative use of the money If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero In contrast, any effect larger than that effect would justify adopting this program: we want to be able to distinguish it from zero

Standardized Effect Sizes How large an effect you can detect with a given sample depends on how variable the outcomes is. Example: If all children have very similar learning level without a program, a very small impact will be easy to detect The Standardized effect size is the effect size divided by the standard deviation of the outcome = effect size/st.dev. Common effect sizes: small) medium) large)

Power Calculations using the OD software Choose Power vs number of clusters in the menu clustered randomized trials

Cluster Size Choose cluster with 1 units

Choose Significance Level and Treatment Effect Pick Normally you pick 0.05 Pick Can experiment with 0.20 You obtain the resulting graph showing power as a function of sample size.

Power and Sample Size

The Design factors that influence power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

Clustered Design Cluster randomized trials are experiments in which social units or clusters rather than individuals are randomly allocated to intervention groups Examples: PROGRESA Gender Reservations Flipcharts, Deworming Iron supplementation Village Panchayats school Family

Reason for adopting cluster randomization Need to minimize or remove contamination Example: In the deworming program, schools was chosen as the unit because worms are contagious Basic Feasibility considerations Example: The PROGRESA program would not have been politically feasible if some families were introduced and not others. Only natural choice Example: Any education intervention that affect an entire classroom (e.g. flipcharts, teacher training).

Impact of Clustering The outcomes for all the individuals within a unit may be correlated All villagers are exposed to the same weather All Panchayats share a common history All students share a schoolmaster The program affect all students at the same time. The member of a village interact with each other We call the correlation between the units within the same cluster

Implications For Design and Analysis Analysis: The standard errors will need to be adjusted to take into account the fact that the observations within a cluster are correlated. Adjustment factor (design effect) for given total sample size, clusters of size m, intracluster correlation of the size of smallest effect we can detect increases by 1+ ( m - 1) compared to a non-clustered design Design: We need to take clustering into account when planning sample size r

Example: detectable treatment size vs. rho Intraclass Randomized Group Size_ Correlation

Implications We now need to consider when choosing a sample size (as well as the other effects) It is extremely important to randomize an adequate number of groups Often the number of individuals within groups matter less than the number of groups

Choosing the number of clusters with a known number of units Example: Randomization of a treatment at the classroom level with 20 students per class: Choose other options as before Set the number of students per school (e.g. 20) set

Power Against Number of Clusters with 20 Students per Cluster

Choosing the number of clusters when we can choose the number of units To chose how many Panchayats to survey and how many villages per Panchayats to detect whether water improvement are significantly different for women and men Mean in unreserved GPs: 14.7 Standard deviation: 19 0.07 We want to detect at least a 30% increase, so we set delta = 0.24 We look for a power of 80%

Minimum Number of GP s, Fix Villages per GP We search for the minimum number of GP we need if we survey 1 village per GP: Answer: 553

Number of Clusters for 80% power

Minimum Number of GP s, Fix Villages per GP We search for the minimum number of GP we need if we survey 1 village per GP: Answer: 553 We search for the minimum number of GP if we survey 2, 3, 4, etc village per GP

Power Against Number of Clusters with 5 Villages Per Panchayat

Minimum Number of GP s, Fix Villages per GP We search for the minimum number of GP we need if we survey 1 village per GP: Answer: 553 We search for the minimum number of GP if we survey 2, 3, 4, etc village per GP For each combination, we calculate the number of villages we will need to survey, and the budget.

What sample do we need? Exercise A Power: 80% # of village per GP # of GP's total # of villages Total Cost (man days) 1 553 553 3041.5 2 297 594 2673.0 3 209 627 2612.5 4 162 648 2592.0 5 141 705 2749.5 6 121 726 2783.0 7 107 749 2835.5 8 101 808 3030.0

The Design factors that influence power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

Availability of a Baseline A baseline has two main uses: Can check whether control and treatment group were the same or different before the treatment Reduce the sample size needed, but requires that you do a survey before starting the intervention: typically the evaluation cost go up and the intervention cost go down To compute power with a baseline: You need to know the correlation between two subsequent measurement of the outcome (for example: pre and post test score in school). The stronger the correlation, the bigger the gain. Very big gains for very persistent outcomes such as tests scores Using OD Pre-test score will be used as a covariate, r2 is it correlation over time.

The Design factors that influence power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

Stratified Samples Stratification will reduce the sample size needed to achieve a given power (you saw this in real time in the Balsakhi exercise). The reason is that it will reduce the variance of the outcome of interest in each strata (and hence increase the standardized effect size for any given effect size) and the correlation of units within clusters. Example: if you randomize within school and grade which class is treated and which class is control: The variance of test score goes down because age is controlled for The within cluster correlation goes down because the common principal effect disappears. Common stratification variables: Baseline values of the outcomes when possible We expect the treatment to vary in different subgroups

Power in Stratified Samples with OD We use a lower value for standardized effect size We use a lower value for (this is now the residual correlation within each site) We need to specify the variability in effect size Common values: 0.01; 0.05

The Design factors that influence power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested.

The Hypothesis that is being tested Are you interested in the difference between two treatments as well as the difference between treatment and control? Are you interested in the interaction between the treatments? Are you interested in testing whether the effect is different in different subpopulations? Does your design involve only partial compliance? (e.g. encouragement design?)

Conclusions Power calculations involve some guess work. They also involve some pilot testing before the proper experiment begins They can tell you: How many treatments to have How to trade off more clusters vs. more observations per cluster Whether it s feasible or not