Statistics Lecture 13 Sampling Distributions (Chapter 18) fe1. Definitions again

fe1. Defiitios agai Review the defiitios of POPULATIO, SAMPLE, PARAMETER ad STATISTIC. STATISTICAL IFERECE: a situatio where the populatio parameters are ukow, ad we draw coclusios from sample outcomes (those are statistics) to make statemets about the value of the populatio parameters. Whe radom samples are draw from a populatio of iterest to represet the whole populatio, they are geerally ubiased ad represetative. The key to uderstadig why samples behave this way is a difficult cocept: THE SAMPLIG DISTRIBUTIO. The samplig distributio is a theoretical/coceptual/ideal probability distributio of a statistic. A theoretical probability distributio is what the outcomes (i.e. statistics) of some radom process (e.g. drawig a sample from populatio ) would look like if you could repeat the radom process over ad over agai ad had iformatio (that is statistics) from every possible sample. ote that a samplig distributio is the theoretical probability distributio of a statistic. The samplig distributio shows how a statistic varies from sample to sample ad the patter of possible values a statistic takes. We do ot actually see samplig distributios i real life, they are simulated. 2. Samplig Distributios for Meas Let s suppose that the 1,428 or so people this example are a populatio. Ad here is the mea µ y (mu) ad stadard deviatio y (sigma) of our populatio: age ------------------------------------------------------------- Percetiles Smallest 1% 19 18 5% 22 18 10% 25 18 Obs 1425 25% 32 18 Sum of Wgt. 1425 50% 42 Mea 45.42035 Largest Std. Dev. 17.11534 75% 56 89 90% 72 89 Variace 292.9348 95% 79 89 Skewess.5865022 99% 87 89 Kurtosis 2.504332 Suppose we draw a simple radom sample of size from a large populatio. Call the observed values 1, 2,...,. A example: draw a simple radom sample (SRS) of 25 from the 1,425 persos with measured age. Measure the average age from the sample of size 25 ad compare it to the populatio average. Variable Obs Mea Std. Dev. Mi Max -------------+-------------------------------------------------------- age 25 42.68 14.25868 21 71 A statistic: The mea of the sample of 25, 42.68 is just the old mea (from Chapter 5), here are the ages of the 25 people who were sampled:

71 60 55 55 43 41 25 30 24 43 24 50 36 66 57 32 29 21 41 43 26 58 43 55 39 We defie the mea of a sigle sample as y = y 1 + y 2 +... + y this is from chapter 5, ad we defie 2 ( i ) i= 1 the Stadard deviatio of a sigle sample as S y = 1 also from chapter 5 y ca be thought of as the mea of a sigle sample of size 25 selected at radom from all possible samples of size 25 that could have bee geerated from the populatio. RULE 1: The mea of all possible sample meas (all possible y ) is deoted µ which i theory should be equal to µ y (the true populatio mea). I other words, the mea of sample meas ( µ ) calculated from all possible samples of the same size from the same populatio should be equal to the true populatio mea µ y. We ca check this usig a simulatio. If I were to draw 10,000 samples of size 25 (with replacemet) from our populatio of 1,428 (with mea age of 45.42035 years) the mea of all 10,000 sample meas will be equal to, i theory, our true populatio mea. r(mea) ------------------------------------------------------------- Percetiles Smallest 1% 37.72 34.8 5% 39.88 35.12 10% 41.04 35.16 Obs 10000 25% 43 35.24 Sum of Wgt. 10000 50% 45.36 Mea 45.39324 Largest Std. Dev. 3.43353 75% 47.68 56.6 90% 49.84 56.64 Variace 11.78913 95% 51.16 58.16 Skewess.1262797 99% 53.76 58.48 Kurtosis 2.951363 This is the overall average of 10,000 sample meas from samples of size 25 draw with replacemet from our origial populatio of 1,428. We got 45.39324 as the mea of the 10,000 sample meas of all of our samples of size 25, this is very close to the true µ (or µ y ) of 45.42035 Here s graph of the 10,000 sample meas from our 10,000 samples of size 25:

35 40 45 50 55 60 r(mea) Desity 0.05.1.15 Look familiar? The mea of all sample meas µ is cosidered a ubiased estimator of µ y (the true populatio mea) whe it comes from a radom sample. If your samples are ot radom, this relatioship will ot hold. For our first sample of 25 people, the mea of the sample is 42.68 but the mea of all 10,000 of the sample meas is 45.39 ad it s ot too differet from the true populatio mea of 45.42 RULE 2. The theoretical stadard deviatio of all possible y 's from all possible samples of size is y = where y is the stadard deviatio of the populatio. I our populatio data, y is 17.11534 so the theoretical stadard deviatio for a distributio of all possible sample meas from samples of size 25 should be y = 17.11534 = = 3.423068 25 We ca check whether this holds true or ot by examiig the results of a simulatio from the output above, the stadard deviatio for our 10,000 sample meas (from our samples of size 25) is 3.43353, agai, very close to what we get from the theory (3.423068). This rule is approximately correct as log as your sample is o larger tha 5% of your populatio. So please make a ote of this:

o A sample has a mea y ad it has a stadard deviatio s. o A populatio has a mea µ y ad a stadard deviatio y o A samplig distributio or a distributio of all possible sample statistics, i this case o the sample mea, also has a mea deoted µ ad i theory it s equal to µ y but with a y stadard deviatio of =. our sample (or ay real-life sample) is just oe sigle realizatio of all possible samples from a populatio of samples.. The stadard deviatio y = of all the SAMPLE MEAS will be smaller tha the stadard deviatio for a sigle sample. I other words, it is easier to predict the mea of may observatios tha it is to predict the value of a sigle observatio (or to predict the average of small samples). What is causig this? Examie the formula for the stadard deviatio of the samplig distributio, ote the effect of sample size o the stadard deviatio of all sample meas. The bigger the sample size gets, the smaller y = becomes. Some thigs to cosider How close is y to µ or i other words, how accurate will our samples be? I order to do this, you will eed to kow the stadard deviatio of the populatio y ad the sample size ote how the stadard deviatio of the samplig distributio chages with sample size. For big samples, the stadard deviatio for the sample mea will be small ad for small samples, the stadard deviatio for the sample mea is large. 3. RULES 3 & RULE 4: ormal Distributios ad The Cetral Limit Theorem Give a simple radom sample of size from a populatio havig mea µ y ad stadard deviatio y, the sample mea y will come from a samplig distributio of all possible sample meas with mea y µ ad stadard deviatio =. A. Basic Distributioal Result If the origial populatio had a ormal distributio, the the distributio of the sample mea will also be ormally distributed. This is good, because it meas we ca use the ormal table to make ifereces about a particular sample with a statemet of probability or chace. Example. IQ scores are ormally distributed with a mea of 100 ad a stadard deviatio of 15. A sample of 25 persos is draw. How likely is it to get a sample average of 108 or more? (0.38%) How likely is it for the first score to be 108 or more? (29.8%) B. The Cetral Limit Theorem (p. 343) o matter what the distributio of the origial populatio (recall our origial oe is left skewed), if the sample size is "sufficiet", the distributio of the possible sample meas will be close to the ormal distributio. It is a very powerful theorem ad it is the reaso why the ormal distributio is so well studied.

C. Summary Take a simple radom sample from a populatio with mea µ y ad stadard deviatio y. Let y be the average of the samples take from the populatio. If either the origial populatio is ormally distributed OR the sample size is sufficietly large, the all the y will be ormally distributed with mea µ =µ y ad stadard deviatio = y If the histogram for the populatio follows a ormal curve, or if the sample size is large eough each time, the the histogram for the possible values for y will follow a ormal curve that has a mea of µ y y ad a stadard deviatio of =. Thus, about 68% of the y will be withi oe stadard deviatio of the true populatio mea about 95% of the y will be withi two stadard deviatios ad 99.7% of the y will be withi 3 SEs Let's go back to our first sample of 25 with its mea of 42.68. The chace of gettig a mea that low or 42.68 45.42 17.11534 25 lower is: (1) calculate Z = = -.80.. Z about -.80, the (2) do a look-up from stadard ormal table ad you get.2119 i the area beyod Z. So the chace (probability) of drawig a sample of size 25 with a average of 42.68 or lower whe you were expectig the average to be 45.42 was about 21.19% our iterpretatio is that about 21% of time you would get a sample average as low as the oe got. This suggests that it s ot too uusual to be this far from the true average eve though you have doe everythig correctly (e.g. radom sample). OTE: The Cetral Limit Theorem oly applies to the distributio of possible sample averages (i.e. the samplig distributio) it says othig about the distributio of idividual scores i either the sample or i the populatio. For example, here is a graph of our age variable for the populatio followed by a graph of our sample of size 25 (from the begiig of this lecture)

20 40 60 80 100 age Desity 0.01.02.03.04 20 30 40 50 60 70 age Desity 0.05.1.15 otice: either oe are ormal, but we ca use the ormal curve to help us make statemets of chace ad accuracy because of the samplig distributio (it s ormal as log as the sample size is sufficietly large)

4. A special case of meas: The proportio A proportio could be thought of as the mea of a special kid of populatio. The populatio oly has values of 1 or 0. If a populatio has that feature, the populatio mea is p which is the proportio of 1 s i your populatio. Ad the populatio stadard deviatio is p = p * For example: q where q is the value of (1.0 proportio of p s) clito Freq. Percet Cum. ------------+----------------------------------- 0 379 43.36 43.36 1 495 56.64 100.00 ------------+----------------------------------- Total 874 100.00 clito ------------------------------------------------------------- Percetiles Smallest 1% 0 0 5% 0 0 10% 0 0 Obs 874 25% 0 0 Sum of Wgt. 874 50% 1 Mea.5663616 Largest Std. Dev..4958603 75% 1 1 90% 1 1 Variace.2458775 95% 1 1 Skewess -.2678155 99% 1 1 Kurtosis 1.071725 Proportios also have a samplig distributio, it s a distributio of sample proportios ad this distributio has a mea of p ad a stadard deviatio of p = pq Ad if I were to ru a simulatio of samples of size 25 for 10000 samples clito ------------------------------------------------------------- Percetiles Smallest 1%.2666667.0833333 5%.3529412.0833333 10%.4.0909091 Obs 10000 25%.4705882.1 Sum of Wgt. 10000 50%.5714286 Mea.5657526 Largest Std. Dev..1289778 75%.6470588.9411765 90%.7333333.9444444 Variace.0166353 95%.7692308 1 Skewess -.0970473 99%.8571429 1 Kurtosis 2.948515

0.2.4.6.8 1 r(mea) Desity 0 1 2 3 4 We ca see that proportios behave like the mea, i theory it wats to ceter o the value of p (the true populatio proportio) ad have a stadard deviatio of p = pq