Probability Models for Sampling

Probability Models for Sampling Chapter 18 May 24, 2013 Sampling Variability in One Act Probability Histogram for ˆp

Act 1 A health study is based on a representative cross section of 6,672 Americans age 18 to 79. A sociologist wants to interview these people but she only has money to interview 100 of them. To avoid bias, she is going to draw the sample at random. The following is her conversation with a statistician. Soc: It seems like a lot of work to write all the 6,672 names on separate tickets, put them in a box and draw out 100 at random. Stat: The computer has a random number generator. It picks a number at random from 1 to 6,672. The person with that code number goes into the sample. Then it picks a second number at random, different from the first. That s the second person to go into the same. It keeps going like this until it gets 100 people.

Act 1 Stat: I drew a sample to show you. Look, this sample has 51 men and 49 women. That s pretty close. Soc: Something isn t right. There are 3,091 men and 3,581 women in the survey: 46% men. I should only have 46 men in my sample.

Act 1 Stat: Not true. Remember, the people in sample are drawn at random. Just by the luck of the draw, you could get too many men or too few. I had the computer take a lot of samples for you, 250 in all. The number of men ranged from a low of 34 to a high of 58. Only 17 samples out of the lot had exactly 46 men. Here s a histogram.

Act 1 Soc: So what stops the number from being exactly 46? Stat: Chance variability. Each time the computer chooses a person for the sample, it either gets a man or a woman. So the number of men either stays the same or goes up by one. The chances are 46 to 54 each time. Soc: What happens if we increase the size of the sample? Won t it come out out more like the population? Stat: Right. Suppose we increase the sample size by four to 400. I got the computer to draw another 250 samples, this time with 400 people in each sample. With some of these samples, the percentage of men is below 46%, with the others it is above. The low is 39% and the high is 54%. Here s a histogram.

Act 1 Stat: Multiplying the sample size by four cuts the likely size of chance error in the % by a factor of two.

Act 1 Soc: Can you explain what chance error is? Stat: Sure, here is an equation: ˆp = p + chance error Of course the chance error in ˆp will be different from sample to sample. Soc: So if I let you draw a sample with this random number business, can you tell me exactly how big the chance error will be in ˆp for my sample? Stat: Not exactly, but I can tell you it s likely or typical size. Soc: O.K. good, but wait, there is a point I missed earlier.

Act 1 Soc: How can you have 250 different samples with 100 people each? I mean 250 100 = 25, 000 and we only started with 6,672 people. Stat: Ah. The samples are all different, but they have some people in common. Look at the sketch. The inside of the circle is like the 6,672 people and each shaded strip is a sample. The strips are different, but they overlap.

2.0 What do we expect ˆp to be? The sociologist took a sample of size 100 from a population of 3,091 men and 3,581 women. The box can be represented as: 3,091 1 3,581 0 or 46% 1 54% 0 100 draws are made without replacement. And ˆp, the proportion of men in the sample is calculated as: ˆp = sum of 1 s drawn/100. On each draw, how likely is she to get a male? How many men should we expect in a sample of size 100?

2.1 What is the S.D. of ˆp? Defn: The S.D. of ˆp is S.D. of ˆp = S.D. of Numbers in Box/ 100 What is the standard deviation of the numbers in the box? What is the likely size of the chance error in ˆp?

2.2 Expected Value and S.D. of ˆp In summary: The expected value of ˆp is p. The S.D. of ˆp is p(1 p) n These formulas are exact when drawing with replacement. They are good approximations when drawing without replacement, provided the number of draws is small relative to the number of tickets in the box

2.3 Example A university has 25,000 students of whom 10,000 are older than 25. The registrar draws a S.R.S. of 400 students. Find the expected value and the S.D. of the proportion of students in the sample who are older than 25.

3.0 Using the Normal Curve Recall the Central Limit Theorem: Suppose an experiment consists of drawing n tickets (at random, with replacement) from a box of numbered tickets and the outcome is the sum (or average, count, proportion) of the numbers drawn. Then the probability histogram of the sum (or average, count, proportion) will converge to the normal curve as n gets bigger. Therefore the probability histogram of ˆp will converge (get closer and closer to) a normal curve as n gets bigger. The mean of this curve is and the S.D. of this curve is

3.1 Example In a certain town, the telephone company has 100,000 subscribers. It plans to take a S.R.S. of 400 for a market research study. According to the Census data, 20% of the company subscribers earn more than $50,000 a year. What is the chance that between 18% and 22% of the people in the sample will each more than $50,000 a year?

3.2 Example In 1965, the U.S. Supreme Court decided the case of Swain v. Alabama. Swain, a black man was convicted in Talladega County of raping a white woman and sentenced to death. The case was appealed on the grounds that there were no blacks on the jury. At that time in Talladega Country there were 16,000 men of whom 26% were black. If 100 people were chosen at random from this population, what is the chance that 8 or fewer would be black? What do you conclude? Aside: The appeal was denied on the grounds that there were 8 blacks on a panel of 100 from which the jury was selected.