Planning sample size for impact evaluations

Similar documents
Planning Sample Size for Randomized Evaluations.

Abdul Latif Jameel Poverty Action Lab Executive Training: Evaluating Social Programs Spring 2009

SAMPLING AND SAMPLE SIZE

Statistical Significance and Power. November 17 Clair

Randomization as a Tool for Development Economists. Esther Duflo Sendhil Mullainathan BREAD-BIRS Summer school

Sampling for Impact Evaluation. Maria Jones 24 June 2015 ieconnect Impact Evaluation Workshop Rio de Janeiro, Brazil June 22-25, 2015

Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha

Sampling and Power Calculations East Asia Regional Impact Evaluation Workshop Seoul, South Korea

Statistical Power Sampling Design and sample Size Determination

AP Statistics Exam Review: Strand 2: Sampling and Experimentation Date:

Evaluating Social Programs Course: Evaluation Glossary (Sources: 3ie and The World Bank)

Chapter 5: Producing Data

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

Unit 3: Collecting Data. Observational Study Experimental Study Sampling Bias Types of Sampling

Chapter 2. The Data Analysis Process and Collecting Data Sensibly. Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

In this second module in the clinical trials series, we will focus on design considerations for Phase III clinical trials. Phase III clinical trials

Sample Size, Power and Sampling Methods

Population. population. parameter. Census versus Sample. Statistic. sample. statistic. Parameter. Population. Example: Census.

Variable Data univariate data set bivariate data set multivariate data set categorical qualitative numerical quantitative

Chapter 19. Confidence Intervals for Proportions. Copyright 2010 Pearson Education, Inc.

The principles of physical activity

Ch. 1 Collecting and Displaying Data

Clever Hans the horse could do simple math and spell out the answers to simple questions. He wasn t always correct, but he was most of the time.

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ

Unit 1 Exploring and Understanding Data

Power & Sample Size. Dr. Andrea Benedetti

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 1.1-1

Chapter 19. Confidence Intervals for Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

CHAPTER 3 METHOD AND PROCEDURE

10.1 Estimating with Confidence. Chapter 10 Introduction to Inference

Outline. 1.What is sampling? 2.How large of a sample size do we need? 3.How can we increase statistical power?

AUDIT SAMPLING CHAPTER LEARNING OUTCOMES CHAPTER OVERVIEW

CP Statistics Sem 1 Final Exam Review

Letter to Community Members

Lesson 8 Descriptive Statistics: Measures of Central Tendency and Dispersion

Cognitive Restructuring

Greg Pond Ph.D., P.Stat.

For each of the following cases, describe the population, sample, population parameters, and sample statistics.

Sheila Barron Statistics Outreach Center 2/8/2011

Paper Airplanes & Scientific Methods

15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA

Study Design STUDY DESIGN CASE SERIES AND CROSS-SECTIONAL STUDY DESIGN

Chapter 8 Estimating with Confidence

Threats and Analysis. Bruno Crépon J-PAL

Practitioner s Guide To Stratified Random Sampling: Part 1

PSYCHOLOGY 300B (A01) One-sample t test. n = d = ρ 1 ρ 0 δ = d (n 1) d

Measuring impact. William Parienté UC Louvain J PAL Europe. povertyactionlab.org

Resolução da Questão 1 Item I (Texto Definitivo) Questão 1

Chapter 8: Estimating with Confidence

Assignment 4: True or Quasi-Experiment

Recommendations from the Report of the Government Inquiry into:

How to Randomise? (Randomisation Design)

REVIEW FOR THE PREVIOUS LECTURE

Statistical Tests Using Experimental Data

Statistics and Probability

Vocabulary. Bias. Blinding. Block. Cluster sample

THIS CHAPTER COVERS: The importance of sampling. Populations, sampling frames, and samples. Qualities of a good sample.

Lecture Notes Module 2

STA 291 Lecture 4 Jan 26, 2010

5 Bias and confounding

Sampling. (James Madison University) January 9, / 13

Chapter Three: Sampling Methods

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

And here is a LISTING of the respondents IDEAS ABOUT THE AIMS OF SUCH A TOOL.

ÁLVARO PINTO. Servicio de Oncología Médica Hospital Universitario La Paz IdiPAZ, Madrid

How to select study subjects using Sampling Technique

Statistical Sampling: An Overview for Criminal Justice Researchers April 28, 2016

Math 140 Introductory Statistics

INFLUENCE OF ONE WEEK EDUCATION PROGRAM ON THE KNOWLEDGE AND APPROACH OF PHARMACY STUDENTS TOWARDS DIABETES MELLITUS

Revista PSICOLOGIA, 2016, Vol. 30 (2), doi: /rpsicol.v30i2.1115

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

MAE 298, Lecture 10 May 4, Percolation and Epidemiology on Networks

An introduction to power and sample size estimation

Understanding Vaccine Economics with Simulations Authors: David Bishai and Sasmira Matta

How to select study subjects using Sampling Technique

Methods of Randomization Lupe Bedoya. Development Impact Evaluation Field Coordinator Training Washington, DC April 22-25, 2013

AQA (A) Research methods. Model exam answers

NEED A SAMPLE SIZE? How to work with your friendly biostatistician!!!

Villarreal Rm. 170 Handout (4.3)/(4.4) - 1 Designing Experiments I

Chapter 7: Descriptive Statistics

Hospital visitors as controls in case-control studies Visitantes hospitalares como controles em estudos caso-controle

A NON-TECHNICAL INTRODUCTION TO REGRESSIONS. David Romer. University of California, Berkeley. January Copyright 2018 by David Romer

SIDELG ORLISTAT 120 MG GENFAR PRECIO

Suggested Answers Part 1 The research study mentioned asked other family members, particularly the mothers if they were available.

Threats and Analysis. Shawn Cole. Harvard Business School

Programme Name: Climate Schools: Alcohol and drug education courses

Program Evaluations and Randomization. Lecture 5 HSE, Dagmara Celik Katreniak

ORAL HYGIENE SESSION 2

Appendix: Instructions for Treatment Index B (Human Opponents, With Recommendations)

Probability Models for Sampling

The Yin and Yang of OD

FORT WORTH DISTRICT. SH 360 Corridor Noise Workshop Andy Kissig, P.E.

Business Statistics Probability

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Auditoría de modelos numéricos para macizos rocosos

Chapter 11. Experimental Design: One-Way Independent Samples Design

Introduction to basic concept of sampling and sample size

Statistics for Psychology

Concept Attainment Lesson Plan

Welcome to our e-course on research methods for dietitians

Transcription:

Planning sample size for impact evaluations David Evans, Banco Mundial Basado en slides de Esther Duflo (J-PAL) y Jed Friedman (Banco Mundial) Size of the sample for impact evaluations Pergunta geral De que tamanho tem que ser a mostra para creivelmente perceber dado tamanho de impacto? Que quer dizer «creívelmente» aqui? Tenho um nível de certeza que a diferença entre o grupo que recebeu o programa y o que não está devida ao programa A aleatorización tira os biases mas não tira o barulho: Funciona pela lei dos grandes números Que tão grande tem que ser o número? 1

Que tão grande? 2 pessoas selecionadas de uma maneira aleatoria? T 10 personas? C T Muitas personas! Quantas são muitas? C T C Organização básica Ao final do experimento, comparamos o resultado de interés nos grupos de tratamento e de controle Nos interessa a diferença Média do grupo de tratamento Médio do grupo controle _ Tamanho do efeito Por exemplo Renda a média de lares que recebem bolsa Renda a média de lares que não recebem bolsa Tamanho do efeito 2

Frequency 5/6/2010 A estimação Não temos suficiente dinheiro como para observar todos os lares senão uma mostra (nim temos que fazê-lo). Em cada lar da mostra, há certo nível de renda. Pode estar mais perto ou mais longe da média de toda a população, como função de outros fatores que afeitam a renda. Inferimos a renda média na população utilizando a média na mostra. Se temos muito poucos lares, a médias estarão imprecisos. Se não vemos diferenças entre a média do grupo de tratamento e de controle, não sabemos se não há efeito ou se não há potencia de detectar o efeito. The variability that we measure in the result If the results varies a lot within the treatment and the control group, it will be difficult to say whether it was the treatment that caused the difference in means High Standard Deviation 8 7 6 5 4 3 2 1 0 value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Number mean 50 mean 60 3

value 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 Frequency 5/6/2010 The variability that we measure in the result If the result varies very little within the groups, it is easier to say that the treatment caused the difference Low Standard Deviation 25 20 15 10 mean 50 mean 60 5 0 Number The standard error The standard error of the sample estimate captures the size of the sample and the variability of the result with a small sample with a high variable results A confidence interval of 95% for an effect tells us that, for 95% of the samples that we could draw from the same population, the estimated effect would fall in this interval. 4

Hypothesis Testing Often we are interested in testing the hypothesis that the effect size is equal to zero (in other words, My program had no effect? I hope not!) We want to test: Against: H o : Effect size H a :Effect size 0 0 Two Types of Mistakes - 1 First type of error : Conclude that there is an effect, when in fact there are no effect. The level of your test is the probability that you will falsely conclude that the program has an effect, when in fact it does not. So with a level of 5%, you can be 95% confident in the validity of your conclusion that the program had an effect For policy purposes, we want to be very confident in the estimated impact: the level will be set fairly low. Common level: 5%, 10%, 1%. 5

Relation with Confidence Intervals If zero does not belong to the 95% confidence interval of the effect size we measured, then we can be at least 95% sure that the effect size is not zero (thus there is an effect) So the rule of thumb is that if the effect size is more than twice the standard error, you can conclude with more than 95% certainty that the program had an effect Two Types of Mistakes 2 Second type of error: you think the program had no effect, when it fact it does have an effect. The Power of a test is the probability that I will be able to find a significant effect in my experiment if indeed there truly is an effect (higher power is better since I am more likely to find an effect) Power is a planning tool. It tells me how likely it is that I find a significant effect for a given sample size 6

Calculating Power When planning an evaluation, with some preliminary research we can calculate the minimum sample we need to get to: Test a hypothesis: program effect was zero or not zero For a pre-specified level (e.g. 5%) Given a pre-specified effect size (what you think the program will do) To achieve a given power A power of 80% tells us that, in 80% of the experiments of this sample size conducted in this population, if there is indeed an effect in the population, we will be able to say in our sample that there is an effect with the level of confidence desired. The larger the sample, the larger the power. Common Power used: 80%, 90% Ingredients for a Power Calculation in a Simple Study What we need Where we get it Significance level This is often conventionally set at 5%. The lower it is, the larger the sample size needed for a give power The mean and the variability of the outcome in the comparison group The effect size that we want to detect -From previous surveys conducted in similar settings - The larger the variability is, the larger the sample for a given power What is the smallest effect that should prompt a policy response? The smaller the effect size we want to detect, the larger a sample size we need for a given power 7

Picking an Effect Size What is the smallest effect that should justify the program to be adopted: Cost of this program v the benefits it brings Cost of this program v the alternative use of the money If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero In contrast, any effect larger than that effect would justify adopting this program: we want to be able to distinguish it from zero Common danger: picking effect size that are too optimistic the sample size may be set too low! Standardized Effect Sizes How large an effect you can detect with a given sample depends on how variable the outcomes is. Example: If all children have very similar learning level without a program, a very small impact will be easy to detect The standard deviation captures the variability in the outcome. The more variability, the higher the standard deviation is The Standardized effect size is the effect size divided by the standard deviation of the outcome d = effect size/st.dev. Common effect sizes: d=0.20 (small) d =0.40 (medium) d =0.50 (large) 8

The Design Factors that Influence Power The level of randomization Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested. Level of Randomization Clustered Design Cluster randomized trials are experiments in which social units or clusters rather than individuals are randomly allocated to intervention groups Examples: Conditional cash transfers ITN distribution IPT Iron supplementation Villages Health clinics Schools Family 9

Reason for Adopting Cluster Randomization Need to minimize or remove contamination Example: In the deworming program, schools was chosen as the unit because worms are contagious Basic Feasibility considerations Example: The PROGRESA program would not have been politically feasible if some families were introduced and not others. Only natural choice Example: Any education intervention that affect an entire classroom (e.g. flipcharts, teacher training). Impact of Clustering The outcomes for all the individuals within a unit may be correlated All villagers are exposed to the same weather All patients share a common health practitioner All students share a schoolmaster The program affect all students at the same time. The member of a village interact with each other The sample size needs to be adjusted for this correlation The more correlation between the outcomes, the greater the need to expand the sample 10

Example of the effect of clustering Number of classes, for power of 0.80 Intra-Class Students in each class Correlation 10 50 100 200 0.00 23 7 5 4 0.02 25 10 8 8 0.05 30 16 15 13 0.10 40 25 23 22 Implications It is extremely important to randomize an adequate number of groups The number of individual within groups matter less than the number of groups The law of large number applies only when the number of groups that are randomized increase You CANNOT randomize at the level of the district, with one treated district and one control district! [Even with 2,000 students per district] 11

Design factors that influence power Level of the randomization Availability of a baseline Availability of control variables and stratification variables The kind of hypothesis you want to test Availability of a Baseline A baseline has three main uses: Can check whether control and treatment group were the same or different before the treatment Reduce the sample size needed, but requires that you do a survey before starting the intervention: typically the evaluation cost go up and the intervention cost go down Can be used to stratify and form subgroups To compute power with a baseline: You need to know the correlation between two subsequent measurement of the outcome (for example: consumption measured in two years). The stronger the correlation, the bigger the gain. Very big gains for very persistent outcomes such as Labor Force Participation; 12

Los factores del diseño que influyen la potencia El nivel de la aleatorización La disponilidad de una línea de base (encuesta inicial) La disponilidad de variables control y de estratificación El tipo de hipótesis que se quiere poner a prueba Control Variables If we have additional relevant variables (e.g. school size, student characteristics, etc.) we can also control for them What matters now for power is, the variation that remains after controlling for those variables If the control variables explain a large part of the variance, the precision will increase and the sample size requirement decreases. Warning: control variables must only include variables that are not INFLUENCED by the treatment: variables that have been collected BEFORE the intervention. 13

Stratified Samples Stratification: create BLOCKS by value of the control variables and randomize within each block Stratification ensure that treatment and control groups are balanced in terms of these control variables. This reduces variance for two reasons: it will reduce the variance of the outcome of interest in each strata the correlation of units within clusters. Example: if you stratify by district for an agricultural program Agroclimatic and associated epidemiologic factors are controlled for The common district government effect disappears. The Design Factors that Influence Power Clustered design Availability of a Baseline Availability of Control Variables, and Stratification. The type of hypothesis that is being tested. 14

The Hypothesis that is being Tested Are you interested in the difference between two treatments as well as the difference between treatment and control? Are you interested in the interaction between the treatments? Are you interested in testing whether the effect is different in different subpopulations? Does your design involve only partial compliance? To remember! The number of groups is much more important than the number of individuals Schools vs students, villages vs homes Two types of errors Type I: You think there is an effect when there is not level Type II: You think there is no effect then there is one power Avoiding errors requires a sufficient sample the power calculation 15

Power Calculations Using the OD Software Choose Power v. number of clusters in the menu clustered randomized trials http://sitemaker.umich.edu/group-based/optimal_design_software Cluster Size Choose cluster size 16

Choose Significance Level, Treatment Effect, and Correlation Pick a : level Normally you pick 0.05 Pick d : Can experiment with 0.20 Pick the intra class correlation (rho) You obtain the resulting graph showing power as a function of sample size Power and Sample Size 17

To remember! The number of groups is much more important than the number of individuals Schools vs students, villages vs homes Two types of errors Type I: You think there is an effect when there is not level Type II: You think there is no effect then there is one power Avoiding errors requires a sufficient sample the power calculation Conclusions: Power Calculation in Practice Power calculations involve some guess work. At times we do not have the right information to conduct it very properly However, it is important to spend effort on them: Avoid launching studies that will have no power at all: waste of time and money Devote the appropriate resources to the studies that you decide to conduct (and not too much). 18