SAMPLING AND SAMPLE SIZE

Similar documents
Abdul Latif Jameel Poverty Action Lab Executive Training: Evaluating Social Programs Spring 2009

Sampling for Impact Evaluation. Maria Jones 24 June 2015 ieconnect Impact Evaluation Workshop Rio de Janeiro, Brazil June 22-25, 2015

Planning Sample Size for Randomized Evaluations.

Statistical Power Sampling Design and sample Size Determination

Glossary From Running Randomized Evaluations: A Practical Guide, by Rachel Glennerster and Kudzai Takavarasha

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

ECONOMIC EVALUATION IN DEVELOPMENT

Sheila Barron Statistics Outreach Center 2/8/2011

Power & Sample Size. Dr. Andrea Benedetti

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA

Where does "analysis" enter the experimental process?

Sampling and Power Calculations East Asia Regional Impact Evaluation Workshop Seoul, South Korea

Chapter 7: Descriptive Statistics

APPENDIX N. Summary Statistics: The "Big 5" Statistical Tools for School Counselors

Lesson 11.1: The Alpha Value

Applied Statistical Analysis EDUC 6050 Week 4

Chapter 8 Estimating with Confidence

Planning sample size for impact evaluations

Psychology Research Process

Sample Size, Power and Sampling Methods

Confidence in Sampling: Why Every Lawyer Needs to Know the Number 384. By John G. McCabe, M.A. and Justin C. Mary

Statistical Significance and Power. November 17 Clair

Randomization as a Tool for Development Economists. Esther Duflo Sendhil Mullainathan BREAD-BIRS Summer school

Tutorial. Understanding the Task. People don t often read editorials critically, believing the writer may know more about the subject than they do.

AP STATISTICS 2008 SCORING GUIDELINES (Form B)

Power of a Clinical Study

Fixed Effect Combining

Welcome to this series focused on sources of bias in epidemiologic studies. In this first module, I will provide a general overview of bias.

Still important ideas

Checking the counterarguments confirms that publication bias contaminated studies relating social class and unethical behavior

PSYCHOLOGY 300B (A01) One-sample t test. n = d = ρ 1 ρ 0 δ = d (n 1) d

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Outline. 1.What is sampling? 2.How large of a sample size do we need? 3.How can we increase statistical power?

t-test for r Copyright 2000 Tom Malloy. All rights reserved

Business Statistics Probability

Population. population. parameter. Census versus Sample. Statistic. sample. statistic. Parameter. Population. Example: Census.

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

The Logic of Causal Order Richard Williams, University of Notre Dame, Last revised February 15, 2015

Previous Example. New. Tradition

Belief behavior Smoking is bad for you I smoke

Statistical inference provides methods for drawing conclusions about a population from sample data.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

5. is the process of moving from the specific to the general. a. Deduction

Conduct an Experiment to Investigate a Situation

Ch. 1 Collecting and Displaying Data

Statistics for Psychology

Chapter 19. Confidence Intervals for Proportions. Copyright 2010 Pearson Education, Inc.

REVIEW FOR THE PREVIOUS LECTURE

Biostatistics 3. Developed by Pfizer. March 2018

Psychology Research Process

Chapter 12: Introduction to Analysis of Variance

I. Introduction and Data Collection B. Sampling. 1. Bias. In this section Bias Random Sampling Sampling Error

Risk Aversion in Games of Chance

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ

Chapter 12. The One- Sample

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

Political Science 15, Winter 2014 Final Review

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Lecture 2: Learning and Equilibrium Extensive-Form Games

First Problem Set: Answers, Discussion and Background

MATH-134. Experimental Design

Reliability, validity, and all that jazz

Infinity-Valued Logic. A really powerful way to evaluate, grade, monitor and decide.

Population. Sample. AP Statistics Notes for Chapter 1 Section 1.0 Making Sense of Data. Statistics: Data Analysis:

NEED A SAMPLE SIZE? How to work with your friendly biostatistician!!!

10 Intraclass Correlations under the Mixed Factorial Design

The t-test: Answers the question: is the difference between the two conditions in my experiment "real" or due to chance?

Randomized Evaluations

Chapter 19. Confidence Intervals for Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Statistical Tests Using Experimental Data

Still important ideas

CHAPTER ONE CORRELATION

CHAPTER THIRTEEN. Data Analysis and Interpretation: Part II.Tests of Statistical Significance and the Analysis Story CHAPTER OUTLINE

A Case Study: Two-sample categorical data

Attitude toward Fundraising - Positive Attitude toward fundraising refers to how fundraising is valued and integrated within an organization

Statistical Sampling: An Overview for Criminal Justice Researchers April 28, 2016

Research Questions, Variables, and Hypotheses: Part 2. Review. Hypotheses RCS /7/04. What are research questions? What are variables?

Chapter 8 Estimating with Confidence. Lesson 2: Estimating a Population Proportion

Methods for Determining Random Sample Size

STAT 200. Guided Exercise 4

Individual Packet. Instructions

INTERNAL VALIDITY, BIAS AND CONFOUNDING

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Do not copy, post, or distribute

Appendix B Statistical Methods

Evaluating Social Programs Course: Evaluation Glossary (Sources: 3ie and The World Bank)

WRITTEN PRELIMINARY Ph.D. EXAMINATION. Department of Applied Economics. January 17, Consumer Behavior and Household Economics.

e.com/watch?v=hz1f yhvojr4 e.com/watch?v=kmy xd6qeass

Inferential Statistics

Reflection Questions for Math 58B

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Chapter 8 Estimating with Confidence. Lesson 2: Estimating a Population Proportion

Online Introduction to Statistics

Handout 16: Opinion Polls, Sampling, and Margin of Error

Final Exam Practice Test

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Chi Square Goodness of Fit

Never P alone: The value of estimates and confidence intervals

Transcription:

SAMPLING AND SAMPLE SIZE Andrew Zeitlin Georgetown University and IGC Rwanda With slides from Ben Olken and the World Bank s Development Impact Evaluation Initiative

2 Review We want to learn how a program can affect a group Set up: assume we have performed our lottery and we have identified our treatment and the control groups What we would like to measure is difference: Effect of the Program = Mean in treatment - Mean in control Example: average farmers' income who adopted fertilizer because of new incentive program vs the farmers in the control group who didn t receive any incentives

3 Bias vs. Noise What if we let farmers choose whether or not to use fertilizer? Our study would be biased! If only the most wealthy, educated farmers choose to use fertilizer, then we will not be able to see the effect of fertilizer between treatment and control. We would see the effect of being wealthy, educated and using fertilizer. This is why we randomize! To remove other factors that might create bias.

4 Bias vs. Noise What if we only pick ten farmers in treatment and control, and we randomly get the four richest farmers in our control group, and only poor farmers in the control group? The fact that the randomization did not balance farmer wealth means our groups are not really similar? The control group may make greater gains (due to more resources) despite the fertilizer. Bottom line: randomization removes bias, but it does not remove random noise in the data. This is why we worry about sampling!

What do we mean by noise? 5

What do we mean by noise? 6

Why random is necessary but not enough Random does not mean balanced! It just means it is not unbalanced for any reason. Which of the following coin flips was random? T, H, T, T, H, H, T, H, H, T T, T, H, H, H, H, H, T, H, H This one we made up This was a random coin flip 5 heads in a row!

A random sample is accurate, but may not be precise What is the average age of the people in this room? If I pick the youngest looking person in the room and ask their age, I am biasing the type of response I am likely to get. If I pick someone at random and ask their age, that is not biased, but still doesn t tell me much since it is just one person. If I everyone except for one person at random I am likely to get close to the right average age: this is a good random sample. If I ask everyone, it is no longer a sample of the room it is the universe

Which of these is more accurate? I. II. 88% A. I. B. II. C. Don t know 12% 0% A. B. C.

Accuracy versus Precision Precision (Sample Size) es(mates truth Accuracy (Randomization)

11 Real World Constraints Random sampling can be noisy! In a world with no budget constraint we could collect data on ALL the individuals (universe) in the treatment and in the control groups. In practice, we do not observe the entire population, just a sample. Example: we do not have data for all farmers of the country/ region, but just for a random sample of them in treatment and control groups Bottom line: Estimated Effect = True Effect + Noise

THE basic questions in statistics How confident can you be in your results? à How big does your sample need to be?

13 Hypothesis Testing In criminal law, most institutions follow the rule: innocent until proven guilty The presumption is that the accused is innocent and the burden is on the prosecutor to show guilt The jury or judge starts with the null hypothesis that the accused person is innocent The prosecutor has a hypothesis that the accused person is guilty

Hypothesis Testing In program evaluation, instead of presumption of innocence, the rule is: presumption of insignificance The Null hypothesis (H 0 ) is that there was no (zero) impact of the program The burden of proof is on the evaluator to show a significant effect of the program

Hypothesis Testing: Conclusions If it is very unlikely (less than a 5% probability) that the difference is solely due to chance: We reject our null hypothesis We may now say: our program has a statistically significant impact

16 Two Types of Mistakes (1) First type of error: conclude that the program has an effect, when in fact at best it has no effect Significance level of a test: Probability that you will falsely conclude that the program has an effect, when in fact it does not If you find an effect with a level of 5%, you can be 95% confident in the validity of your conclusion that the program had an effect Common levels are: 5%, 10%, 1%

17 Two Types of Mistakes (2) Second Type of Error: You conclude that the program has no effect when indeed it had an effect, but it was not measured with enough precision (or noise got in the way) Power of a test: Probability to find a significant effect if there truly is an effect Higher power is better since I am more likely to have an effect to report

Practical steps There are two, related ways one might apply this logic: 1. Start from the sample size that you can afford. Figure out what would be the smallest true effect that you could detect with reasonable confidence and power. Ø This is known as the minimum detectable effect for a given design. 2. Start from a plausible effect size, and figure out how big a sample you need in order to be able to detect this with reasonable confidence and power. Ø We will focus on this second approach.

19 Practical Steps Ø Set a pre-specified confidence level (5%) i.e. just set the initial point of the line in the graph Ø Decide a level of power. Common values used are 80% or 90%. Intuitively, the larger the sample, the larger the power. Power is a planning tool: one minus the power is the probability to be disappointed. Ø Set a range of pre-specified effect sizes (what you think the program will do) Ø What is the smallest effect that should prompt a policy response?

Picking an Effect Size to choose sample We can guess an effect size using economics past data on the outcome of interest or even past evaluations What is the smallest effect that should justify the program to be adopted? Cost of this program v the benefits it brings Cost of this program v the alternative use of the money

Underpowered Common danger: picking effect sizes that are too optimistic the sample size may be set too low to detect an actual effect! Example: Evaluators believe a program will increase high school graduation by 15 percentage points. They survey enough schools to see increases of 12 percentage points or more. The program increased graduation rates by 10 percentage points, but they missed that entirely due to lack of power! They report the program had no statistically significant effect, even though it actually had one!

22 How difficult is this to do? Proposition I: There exists at least one statistician in the world who has already put into a magic formula the optimal sample size required to address this problem Proposition II: The rule has also been implemented for almost all computer software Not difficult to do, and only requires simple calculations to understand the logic (really simple!)

Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

Larger effect= More Power to Detect A device detects all animals over six feet (1.8 meters) tall. Power to detect adult men: Under 10% Power to detect adult women: Under 1% Power to detect adult mice: 0% Power to detect adult giraffes: 100% The taller the animal (effect size) we care about, the more power we have (and the less we need)

Effect Size: 1*SE Hypothesized effect size determines distance between 0.5 means 0.45 0.4 1 Standard Deviation 0.35 H 0 0.3 0.25 H β control 0.2 treatment 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

Effect Size = 1*SE 0.5 0.45 0.4 0.35 H 0 0.3 0.25 H β control treatment 0.2 significance 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

Power: 26% If the true impact was 1*SE 0.5 H 0 0.45 0.4 0.35 H β 0.3 0.25 0.2 control treatment power 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6 The Null Hypothesis would be rejected only 26% of the time

Effect Size: 3*SE 0.5 0.45 3*SE 0.4 0.35 0.3 0.25 0.2 control treatment 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6 Bigger hypothesized effect sizeà distributions farther apart

Effect size 3*SE: Power= 91% 0.5 0.45 0.4 0.35 H 0 0.3 0.25 0.2 0.15 H β control treatment power 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6 Bigger Effect size means more power

What effect size should you use when designing your experiment? A. Smallest effect size that is still cost effective B. Largest effect size you expect your program to produce C. Both D. Neither 50% 50% 0% 0% A. B. C. D.

Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

By increasing sample size you increase 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-4 - 3-2 - 1 0 1 2 3 4 5 6 control treatment power 50% 33% A. Accuracy B. Precision C. Both D. Neither E. Don t know 17% 0% 0% A. B. C. D. E.

Larger sample size= More power to detect We want to know the average age in the city If we randomly pick one person in the city, we might pick a 100 year old. If we randomly pick 2000 people, even if we pick the 100 year old as one of them, he will be balanced out by the other random selections. This intuition extends to effect sizes.

Power: Effect size = 1SD, Sample size = N 0.5 0.45 0.4 0.35 0.3 control 0.25 0.2 treatment significance 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

Power: Sample size = 4N 0.5 0.45 0.4 0.35 0.3 0.25 0.2 control treatment significance 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

Power: 64% 0.5 0.45 0.4 0.35 0.3 0.25 0.2 control treatment power 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

Power: Sample size = 9N 0.5 0.45 0.4 0.35 0.3 0.25 0.2 control treatment significance 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

Power: 91% 0.5 0.45 0.4 0.35 0.3 0.25 0.2 control treatment power 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

More variance= Less power to detect Imagine the following intervention: Giving away ten bags of rice In this example, this program has a large effect on ALL poor people, and no effect on ALL rich people. Low Variance: If our population is all poor, we only need to sample one person to see the true effect of giving away rice High Variance: If our population is half poor, and half rich (high variance) and we randomly sample twenty people, what happens if only 5 are poor?

What are typical ways to reduce the underlying (population) variance A. Include covariates B. Increase the sample C. Do a baseline survey D. All of the above E. A and B F. A and C 80% 20% 0% 0% 0% 0% A. B. C. D. E. F.

Variance There is sometimes very little we can do to reduce the noise The underlying variance is what it is We can try to absorb variance: using a baseline controlling for other variables In practice, controlling for other variables (besides the baseline outcome) buys you very little

Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

More balanced treatment assignment => More power What s better? 99 people who get the treatment and one control 50 treatment and 50 control This logic continues. What s better? 60 people who get the treatment and 40 control 50 treatment and 50 control

Sample split: 50% C, 50% T 0.5 0.45 0.4 0.35 H 0 0.3 0.25 H β control treatment 0.2 significance 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

Power: 91% 0.5 0.45 0.4 0.35 0.3 0.25 0.2 control treatment power 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

If it s not 50-50 split? What happens to the relative fatness if the split is not 50-50. Say 25-75?

Sample split: 25% C, 75% T 0.5 0.45 0.4 H 0 0.3 H β control 0.35 0.25 treatment 0.2 significance 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

Power: 83% 0.5 0.45 0.4 0.35 0.3 0.25 0.2 control treatment power 0.15 0.1 0.05 0-4 -3-2 -1 0 1 2 3 4 5 6

How unbalanced is too unbalanced? Bloom (2006): Because precision erodes slowly until the degree of imbalance becomes extreme (roughly P 0.2 or P 0.8), there is considerable latitude for using an unbalanced allocation. This helps if Politics dictate a small control group Costs dictate a small treatment group

Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

Clustered design: definition In sampling: When clusters of individuals (e.g. schools, communities, etc) are randomly selected from the population, before selecting individuals for observation In randomized evaluation: When clusters of individuals are randomly assigned to different treatment groups

Clustered design: intuition You want to know how close the upcoming national elections will be Method 1: Randomly select 50 people from entire Indian population Method 2: Randomly select 5 families, and ask ten members of each family their opinion

Low intra-cluster correlation (ICC) aka ρ (rho)

HIGH intra-cluster correlation (ρ)

All uneducated people live in one village. People with only primary education live in another. College grads live in a third, etc. ICC (ρ) on education will be.. A. High B. Low C. No effect on rho D. Don t know

If ICC (ρ) is high, what is a more efficient way of increasing power? A. Include more clusters in the sample B. Include more people in clusters C. Both D. Don t know

Further topics: Imperfect compliance In some cases, policymakers/researchers can assign individuals to a given treatment arm, but this doesn t mean they will take it up. What does this mean for power? Consider an extreme cases in which nobody in the treatment group takes up. In that case, no matter how big the sample size, you can t detect the treatment s impact because you never see it. Alternatively, what happens if everybody ends up getting the treatment in both treatment and control groups? The required sample size is inversely proportional to (c d) 2 where c is the fraction of treated who comply, and d is fraction of control who defy

60 Wrap-up on Power Power calculations look scary but they are just a formalization of common sense At times we do not have the right information to conduct it very properly However, it is important to spend effort on them: Avoid launching studies that will have no power at all: waste of time and money, potentially harmful Devote the appropriate resources to the studies that you decide to conduct (and not too much)

Appendix: The nuts and bolts (1) For an experimental design with perfect compliance and individual-level assignment (no clustering), Minimum detectable effect, for sample size N MDE=( t 1 κ + t α/2 ) σ 2 /P(1 P)N Minimum sample size, for hypothesized effect size β N= ( t 1 κ + t α/2 ) 2 σ 2 / β 2 P(1 P)

Appendix: The nuts and bolts (2) When compliance becomes imperfect, with c the fraction of those assigned to treatment who take up and d the fraction of control who do likewise. Minimum detectable effect, for sample size N MDE= ( t 1 κ + t α/2 )/(c d) σ 2 /P(1 P)N Minimum sample size, for hypothesized effect size β N= ( t 1 κ + t α/2 ) 2 σ 2 / (c d) 2 β 2 P(1 P)

Appendix: The nuts and bolts (3) For an experimental design with perfect compliance and group-based assignment, Minimum detectable effect, for J groups with n members MDE=( t 1 κ + t α/2 ) σ 2 /JP(1 P) (ρ+ 1 ρ/n ) Minimum number of groups, for hypothesized effect size β J= ( t 1 κ + t α/2 ) 2 σ 2 / β 2 P(1 P) (ρ+ 1 ρ/n )