SAMPLING AND SAMPLE SIZE

Size: px

Start display at page:

Download "SAMPLING AND SAMPLE SIZE"

Henry Sharp
5 years ago
Views:

1 SAMPLING AND SAMPLE SIZE Andrew Zeitlin Georgetown University and IGC Rwanda With slides from Ben Olken and the World Bank s Development Impact Evaluation Initiative

2 2 Review We want to learn how a program can affect a group Set up: assume we have performed our lottery and we have identified our treatment and the control groups What we would like to measure is difference: Effect of the Program = Mean in treatment - Mean in control Example: average farmers' income who adopted fertilizer because of new incentive program vs the farmers in the control group who didn t receive any incentives

3 3 Bias vs. Noise What if we let farmers choose whether or not to use fertilizer? Our study would be biased! If only the most wealthy, educated farmers choose to use fertilizer, then we will not be able to see the effect of fertilizer between treatment and control. We would see the effect of being wealthy, educated and using fertilizer. This is why we randomize! To remove other factors that might create bias.

4 4 Bias vs. Noise What if we only pick ten farmers in treatment and control, and we randomly get the four richest farmers in our control group, and only poor farmers in the control group? The fact that the randomization did not balance farmer wealth means our groups are not really similar? The control group may make greater gains (due to more resources) despite the fertilizer. Bottom line: randomization removes bias, but it does not remove random noise in the data. This is why we worry about sampling!

5 What do we mean by noise? 5

6 What do we mean by noise? 6

7 Why random is necessary but not enough Random does not mean balanced! It just means it is not unbalanced for any reason. Which of the following coin flips was random? T, H, T, T, H, H, T, H, H, T T, T, H, H, H, H, H, T, H, H This one we made up This was a random coin flip 5 heads in a row!

8 A random sample is accurate, but may not be precise What is the average age of the people in this room? If I pick the youngest looking person in the room and ask their age, I am biasing the type of response I am likely to get. If I pick someone at random and ask their age, that is not biased, but still doesn t tell me much since it is just one person. If I everyone except for one person at random I am likely to get close to the right average age: this is a good random sample. If I ask everyone, it is no longer a sample of the room it is the universe

9 Which of these is more accurate? I. II. 88% A. I. B. II. C. Don t know 12% 0% A. B. C.

10 Accuracy versus Precision Precision (Sample Size) es(mates truth Accuracy (Randomization)

11 11 Real World Constraints Random sampling can be noisy! In a world with no budget constraint we could collect data on ALL the individuals (universe) in the treatment and in the control groups. In practice, we do not observe the entire population, just a sample. Example: we do not have data for all farmers of the country/ region, but just for a random sample of them in treatment and control groups Bottom line: Estimated Effect = True Effect + Noise

12 THE basic questions in statistics How confident can you be in your results? à How big does your sample need to be?

13 13 Hypothesis Testing In criminal law, most institutions follow the rule: innocent until proven guilty The presumption is that the accused is innocent and the burden is on the prosecutor to show guilt The jury or judge starts with the null hypothesis that the accused person is innocent The prosecutor has a hypothesis that the accused person is guilty

14 Hypothesis Testing In program evaluation, instead of presumption of innocence, the rule is: presumption of insignificance The Null hypothesis (H 0 ) is that there was no (zero) impact of the program The burden of proof is on the evaluator to show a significant effect of the program

15 Hypothesis Testing: Conclusions If it is very unlikely (less than a 5% probability) that the difference is solely due to chance: We reject our null hypothesis We may now say: our program has a statistically significant impact

16 16 Two Types of Mistakes (1) First type of error: conclude that the program has an effect, when in fact at best it has no effect Significance level of a test: Probability that you will falsely conclude that the program has an effect, when in fact it does not If you find an effect with a level of 5%, you can be 95% confident in the validity of your conclusion that the program had an effect Common levels are: 5%, 10%, 1%

17 17 Two Types of Mistakes (2) Second Type of Error: You conclude that the program has no effect when indeed it had an effect, but it was not measured with enough precision (or noise got in the way) Power of a test: Probability to find a significant effect if there truly is an effect Higher power is better since I am more likely to have an effect to report

18 Practical steps There are two, related ways one might apply this logic: 1. Start from the sample size that you can afford. Figure out what would be the smallest true effect that you could detect with reasonable confidence and power. Ø This is known as the minimum detectable effect for a given design. 2. Start from a plausible effect size, and figure out how big a sample you need in order to be able to detect this with reasonable confidence and power. Ø We will focus on this second approach.

19 19 Practical Steps Ø Set a pre-specified confidence level (5%) i.e. just set the initial point of the line in the graph Ø Decide a level of power. Common values used are 80% or 90%. Intuitively, the larger the sample, the larger the power. Power is a planning tool: one minus the power is the probability to be disappointed. Ø Set a range of pre-specified effect sizes (what you think the program will do) Ø What is the smallest effect that should prompt a policy response?

20 Picking an Effect Size to choose sample We can guess an effect size using economics past data on the outcome of interest or even past evaluations What is the smallest effect that should justify the program to be adopted? Cost of this program v the benefits it brings Cost of this program v the alternative use of the money

21 Underpowered Common danger: picking effect sizes that are too optimistic the sample size may be set too low to detect an actual effect! Example: Evaluators believe a program will increase high school graduation by 15 percentage points. They survey enough schools to see increases of 12 percentage points or more. The program increased graduation rates by 10 percentage points, but they missed that entirely due to lack of power! They report the program had no statistically significant effect, even though it actually had one!

22 22 How difficult is this to do? Proposition I: There exists at least one statistician in the world who has already put into a magic formula the optimal sample size required to address this problem Proposition II: The rule has also been implemented for almost all computer software Not difficult to do, and only requires simple calculations to understand the logic (really simple!)

23 Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

24 Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

25 Larger effect= More Power to Detect A device detects all animals over six feet (1.8 meters) tall. Power to detect adult men: Under 10% Power to detect adult women: Under 1% Power to detect adult mice: 0% Power to detect adult giraffes: 100% The taller the animal (effect size) we care about, the more power we have (and the less we need)

Effect Size: 1*SE Hypothesized effect size determines distance between 0.5 means 0.45 0.

26 Effect Size: 1*SE Hypothesized effect size determines distance between 0.5 means Standard Deviation 0.35 H H β control 0.2 treatment

27 Effect Size = 1*SE H H β control treatment 0.2 significance

28 Power: 26% If the true impact was 1*SE 0.5 H H β control treatment power The Null Hypothesis would be rejected only 26% of the time

29 Effect Size: 3*SE *SE control treatment Bigger hypothesized effect sizeà distributions farther apart

30 Effect size 3*SE: Power= 91% H H β control treatment power Bigger Effect size means more power

31 What effect size should you use when designing your experiment? A. Smallest effect size that is still cost effective B. Largest effect size you expect your program to produce C. Both D. Neither 50% 50% 0% 0% A. B. C. D.

32 Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

By increasing sample size you increase 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.

33 By increasing sample size you increase control treatment power 50% 33% A. Accuracy B. Precision C. Both D. Neither E. Don t know 17% 0% 0% A. B. C. D. E.

34 Larger sample size= More power to detect We want to know the average age in the city If we randomly pick one person in the city, we might pick a 100 year old. If we randomly pick 2000 people, even if we pick the 100 year old as one of them, he will be balanced out by the other random selections. This intuition extends to effect sizes.

35 Power: Effect size = 1SD, Sample size = N control treatment significance

36 Power: Sample size = 4N control treatment significance

37 Power: 64% control treatment power

38 Power: Sample size = 9N control treatment significance

39 Power: 91% control treatment power

40 Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

41 More variance= Less power to detect Imagine the following intervention: Giving away ten bags of rice In this example, this program has a large effect on ALL poor people, and no effect on ALL rich people. Low Variance: If our population is all poor, we only need to sample one person to see the true effect of giving away rice High Variance: If our population is half poor, and half rich (high variance) and we randomly sample twenty people, what happens if only 5 are poor?

42 What are typical ways to reduce the underlying (population) variance A. Include covariates B. Increase the sample C. Do a baseline survey D. All of the above E. A and B F. A and C 80% 20% 0% 0% 0% 0% A. B. C. D. E. F.

43 Variance There is sometimes very little we can do to reduce the noise The underlying variance is what it is We can try to absorb variance: using a baseline controlling for other variables In practice, controlling for other variables (besides the baseline outcome) buys you very little

44 Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

45 More balanced treatment assignment => More power What s better? 99 people who get the treatment and one control 50 treatment and 50 control This logic continues. What s better? 60 people who get the treatment and 40 control 50 treatment and 50 control

46 Sample split: 50% C, 50% T H H β control treatment 0.2 significance

47 Power: 91% control treatment power

48 If it s not split? What happens to the relative fatness if the split is not Say 25-75?

49 Sample split: 25% C, 75% T H H β control treatment 0.2 significance

50 Power: 83% control treatment power

51 How unbalanced is too unbalanced? Bloom (2006): Because precision erodes slowly until the degree of imbalance becomes extreme (roughly P 0.2 or P 0.8), there is considerable latitude for using an unbalanced allocation. This helps if Politics dictate a small control group Costs dictate a small treatment group

52 Power: main ingredients 1. Effect Size 2. Sample Size 3. Variance 4. Proportion of sample in T vs. C 5. Clustering

53 Clustered design: definition In sampling: When clusters of individuals (e.g. schools, communities, etc) are randomly selected from the population, before selecting individuals for observation In randomized evaluation: When clusters of individuals are randomly assigned to different treatment groups

54 Clustered design: intuition You want to know how close the upcoming national elections will be Method 1: Randomly select 50 people from entire Indian population Method 2: Randomly select 5 families, and ask ten members of each family their opinion

55 Low intra-cluster correlation (ICC) aka ρ (rho)

56 HIGH intra-cluster correlation (ρ)

57 All uneducated people live in one village. People with only primary education live in another. College grads live in a third, etc. ICC (ρ) on education will be.. A. High B. Low C. No effect on rho D. Don t know

58 If ICC (ρ) is high, what is a more efficient way of increasing power? A. Include more clusters in the sample B. Include more people in clusters C. Both D. Don t know

59 Further topics: Imperfect compliance In some cases, policymakers/researchers can assign individuals to a given treatment arm, but this doesn t mean they will take it up. What does this mean for power? Consider an extreme cases in which nobody in the treatment group takes up. In that case, no matter how big the sample size, you can t detect the treatment s impact because you never see it. Alternatively, what happens if everybody ends up getting the treatment in both treatment and control groups? The required sample size is inversely proportional to (c d) 2 where c is the fraction of treated who comply, and d is fraction of control who defy

60 60 Wrap-up on Power Power calculations look scary but they are just a formalization of common sense At times we do not have the right information to conduct it very properly However, it is important to spend effort on them: Avoid launching studies that will have no power at all: waste of time and money, potentially harmful Devote the appropriate resources to the studies that you decide to conduct (and not too much)

61 Appendix: The nuts and bolts (1) For an experimental design with perfect compliance and individual-level assignment (no clustering), Minimum detectable effect, for sample size N MDE=( t 1 κ + t α/2 ) σ 2 /P(1 P)N Minimum sample size, for hypothesized effect size β N= ( t 1 κ + t α/2 ) 2 σ 2 / β 2 P(1 P)

62 Appendix: The nuts and bolts (2) When compliance becomes imperfect, with c the fraction of those assigned to treatment who take up and d the fraction of control who do likewise. Minimum detectable effect, for sample size N MDE= ( t 1 κ + t α/2 )/(c d) σ 2 /P(1 P)N Minimum sample size, for hypothesized effect size β N= ( t 1 κ + t α/2 ) 2 σ 2 / (c d) 2 β 2 P(1 P)

63 Appendix: The nuts and bolts (3) For an experimental design with perfect compliance and group-based assignment, Minimum detectable effect, for J groups with n members MDE=( t 1 κ + t α/2 ) σ 2 /JP(1 P) (ρ+ 1 ρ/n ) Minimum number of groups, for hypothesized effect size β J= ( t 1 κ + t α/2 ) 2 σ 2 / β 2 P(1 P) (ρ+ 1 ρ/n )

Abdul Latif Jameel Poverty Action Lab Executive Training: Evaluating Social Programs Spring 2009

MIT OpenCourseWare http://ocw.mit.edu Abdul Latif Jameel Poverty Action Lab Executive Training: Evaluating Social Programs Spring 2009 For information about citing these materials or our Terms of Use,