Statistics Lecture 8 Samplig Distributios (Chapter 6-, 6-3). Defiitios agai Review the defiitios of POPULATION, SAMPLE, PARAMETER ad STATISTIC. STATISTICAL INFERENCE: a situatio where the populatio parameters are ukow, ad we draw coclusios from sample outcomes (those are statistics) to make statemets about the value of the populatio parameters. (p. 94 text refers to measurig sample reliability/trustworthiess) Whe radom samples are draw from a populatio of iterest to represet the whole populatio, they are geerally ubiased ad represetative. The key to uderstadig why samples behave this way is a difficult cocept: THE SAMPLING DISTRIBUTION. The samplig distributio is a theoretical/coceptual/ideal probability distributio of a statistic. A theoretical probability distributio is what the outcomes (i.e. statistics) of some radom process (e.g. drawig a sample from populatio) would look like if you could repeat the radom process over ad over agai ad had iformatio (that is the statistics) from every possible sample. Note that a samplig distributio is the theoretical probability distributio of a statistic. The samplig distributio shows how a statistic varies from sample to sample ad the patter of possible values a statistic takes. We do ot actually see samplig distributios i real life, they are simulated.. Samplig Distributios for Meas Geerally, the objective i samplig is to estimate a populatio mea µ from sample iformatio Let s suppose that the 78,455 or so people i this example are a populatio. Ad here is the mea µ (mu) ad stadard deviatio (sigma) of our populatio: HINC ------------------------------------------------------------- Percetiles Smallest % 500 0 5% 0000 90 0% 6970 400 Obs 78455 5% 36090 450 Sum of Wgt. 78455 50% 63000 Mea 7863.53 Largest Std. Dev. 6360.55 75% 03000 409000 90% 5300 48000 Variace 4.05e+09 95% 9000 437740 Skewess.05389 99% 330000 634000 Kurtosis 0.744 Desity 0.0e-06 4.0e-06 6.0e-06 8.0e-06.0e-05 0 00000 00000 300000 400000 500000 600000 HINC Suppose we draw a simple radom sample of size from a large populatio. A simple radom sample is a sample where () each member of the populatio had the same chace of beig selected (ubiased) () the selectio of oe member has o effect o the probability of aother member beig selected (idepedet). Sice the sample observatios come from the same populatio, we say that the observatios are idepedet, idetically distributed (i.i.d.) For the samples i this class, you should assume this coditio. Let us call the observed values from the sample,,...,. A example: draw a simple radom sample (SRS) of 5 from the 78,455 households with measured household icome. Measure the average from the sample of size 5 ad compare it to the populatio average. Variable Obs Mea Std. Dev. Mi Max -------------+-------------------------------------------------------- hic 5 853.6 775.65 0 385000
Statistics Lecture 8 Samplig Distributios (Chapter 6-, 6-3) A statistic: The mea of the sample of 5, $8,53.60 is just the plai old mea (from Chapter page 34), here are the household icomes of the 5 people who were sampled: hic. 4900. 33000 3. 30000 4. 385000 5. 6040 6. 47300 7. 5000 8. 5000 9. 56030 0. 0000. 5000. 60000 3. 400 4. 37640 5. 56500 6. 33700 7. 6500 8. 4000 9. 04500 0. 04390. 5700. 80 3. 0 4. 450 5. 86000 x + x +... + x We defie the mea of a sigle sample as x = this is from chapter, ad we defie the Stadard deviatio of a sigle sample as ( i ) i = Sx = also from chapter. This ca be thought of as the mea of a sigle sample of size 5 selected at radom from all possible samples of size 5 that could have bee geerated from the populatio. A. The Expected Value of the Sample Mea We certaily would have liked to have doe better, that is a sample mea of $8,53.60 is ot the same as the populatio mea of 78,63.53. Is the sample mea a good estimator of the true populatio mea µ? Theory says let s thik of a sample from a populatio as beig a set of radom variables i other words, while we might kow what might be possible with respect to household icomes, we do t kow what the sample will look like util it s actually draw. The sample mea (from page 96 of your text) is defied as a combiatio of radom variables: [ + + ]... + the sample mea, beig a liear combiatio of radom variables is itself a radom variable. So ow we ask the questio: what is the expected value ad the variace (or stadard deviatio) of the sample mea, a radom variable. [ E( ) + E( ) +... E( )] E ( ) + but a radom variable will have a distributio p(x) with ( )... mea µ. So E()=E()= = µ ad the [ µ + µ + + µ ] = [ µ ] = µ E the iterpretatio, that o average, the sample mea will be expected to be or should be equal to µ RULE : The mea of all possible sample meas (all possible x of the same size sampled from the same populatio ) is deoted which i theory should be equal to µ (the true populatio mea).
Statistics Lecture 8 Samplig Distributios (Chapter 6-, 6-3) I other words, the mea of sample meas calculated from all possible samples of the same size from the same populatio should be equal to the true populatio mea. We ca check this usig a simulatio. If I were to draw 0,000 samples of size 5 (with replacemet) from our populatio of 78,455 (with mea icome of $78,63.53) the mea of all 0,000 sample meas will be equal to, i theory, our true populatio mea. r(mea) ------------------------------------------------------------- Percetiles Smallest % 5905 40.4 5% 5954.4 4048 0% 6887. 4397.6 Obs 0000 5% 69445.4 47.8 Sum of Wgt. 0000 50% 7750.6 Mea 7809.64 Largest Std. Dev. 437.97 75% 8695. 354.8 90% 94498.4 33668 Variace.55e+08 95% 99679 34684 Skewess.376673 99% 0537. 37485.6 Kurtosis 3.889 This is the overall average of 0,000 sample meas from samples of size 5 draw with replacemet from our origial populatio of 78,455. We got $78,09.64 as the mea of the 0,000 sample meas of all of our samples of size 5, this is very close to the true populatio mea of $78,63.53 (we are off by.059%) Here s graph of the 0,000 sample meas from our 0,000 samples of size 5: Desity 0.0e-05.0e-05 3.0e-05 40000 60000 80000 00000 0000 40000 r(mea) Does it look familiar? The mea of all sample meas is cosidered a ubiased estimator of µ (the true populatio mea) whe it comes from a radom sample. If your samples are ot radom, this relatioship will ot hold. For our first sample of 5 households, the mea of the sample is $8,53.60 but the mea of all 0,000 of the sample meas is $78,09.64 ad it s ot too differet from the true populatio mea of $78,63.53.
Statistics Lecture 8 Samplig Distributios (Chapter 6-, 6-3) B. The (Variace ad) Stadard Deviatio of the Sample Mea Recall that whe we talk about meas, we eed to talk about stadard deviatios because they give us a sese of the typical distaces betwee values. For the sample distributio, we eed to recogize that a differet sample would give us a differet result, the questio becomes how differet? The aswer is foud i calculatig the variace of the samplig distributio. Recall that variaces add icely if the radom variables are idepedet: [ Var( ) + Var( ) +... + Var( )] = Var( ) = [ + + + ] ( )... Var( ) = [ ] = (page 97 of your text) Var this reduces dow to RULE. The theoretical stadard deviatio of all possible x 's from all possible samples of size is called the STANDARD ERROR or SE (to distiguish it from the stadard deviatio) ad it is: SE = = this is paired with the mea of all sample meas above where is the stadard deviatio of the populatio. I our populatio data, is 6360.55 so the theoretical stadard deviatio for a distributio of all possible sample meas from samples of size 5 should be = N = 6360.55 = 7. 5 We ca check whether this holds true or ot by examiig the results of a simulatio from the output above, the stadard deviatio for our 0,000 sample meas (from our samples of size 5) is 437.97, agai, very close to what we get from the theory (7. we are off by about %). This rule is approximately correct as log as your sample is o larger tha 5% of your populatio. So please make a ote of this: o A sample has a mea x ad it has a stadard deviatio s ad variace s. o A populatio has a mea µ ad a stadard deviatio ad variace o A samplig distributio or a distributio of all possible sample statistics, i this case the sample mea, also has a mea deoted µ ad i theory it s equal to µ but with a stadard deviatio (called STANDARD ERROR) of =. Your sample (or ay real-life sample) is just oe sigle realizatio of all possible samples from a populatio of samples. The stadard error = of all the SAMPLE MEANS will be smaller tha the stadard deviatio for a sigle sample ad also smaller tha the stadard deviatio for the populatio. I other words, it is easier to predict the mea of may observatios tha it is to predict the value of a sigle observatio (or to predict the average of
Statistics Lecture 8 Samplig Distributios (Chapter 6-, 6-3) small samples). What is causig this? Examie the formula for the stadard error of the samplig distributio, ote the effect of sample size o the stadard error of all sample meas. The bigger the sample size gets, the smaller = becomes. 3. Normal Distributios ad The Cetral Limit Theorem Give a simple radom sample of size from a populatio havig mea µ ad stadard deviatio, the sample mea x will come from a samplig distributio of all possible sample meas with mea µ ad stadard deviatio (called the stadard error to make a distictio) = A. Basic Distributioal Result If the origial populatio had a ormal distributio, the the distributio of the sample mea will also be ormally distributed. This is good, because it meas we ca use the ormal table to make ifereces about a particular sample with a statemet of probability or chace. Example. IQ scores are ormally distributed with a mea of 00 ad a stadard deviatio of 5. A sample of 5 persos is draw. How likely is it to get a sample average of 08 or more? (Usig Z scores 0.4% or.004 from Table IV) How likely is it for the very first score to be 08 or more? (9.8% or.98 from Table IV) B. The Cetral Limit Theorem (p. 0) No matter what the distributio of the origial populatio (recall our origial oe is highly right skewed), if the sample size is "large", the distributio of the possible sample meas will be close to the ormal distributio (ofte 0 to 0 is large eough). It is a very powerful theorem ad it is the reaso why the ormal distributio is so well studied, we are iterested i estimatig meas ad the CLT helps us to uderstad what to expect. C. Normal Approximatio Rule (p. 0) I radom samples of size, the sample mea will fluctuate aroud the populatio mea with a stadard error of. Therefore, as icreases i size, the samplig distributio of the sample meas cocetrates more ad more aroud the populatio mea (this is why bigger samples are better, they icrease accuracy). The samplig distributio will become more ad more ormal. Let's go back to our first sample of 5 with its mea of 853.60. The chace of gettig a mea that high or 853.60 7863.53 higher is: () calculate Z = = +. 34 6360.55 5.. Z about.34, the () do a look-up from stadard ormal table ad you get.367 i the area beyod Z. So the chace (probability) of drawig a sample of size 5 with a average of 853.60 or higher whe you were expectig the average to be 78,63.53 was about 36.7% Your iterpretatio is that about 37% of time you would get a sample average as high as the oe you got. This suggests that it s ot too uusual to be this far from the true average eve though you have doe everythig correctly (e.g. radom sample). NOTE: The Cetral Limit Theorem oly applies to the distributio of possible sample averages (i.e. the samplig distributio) it says othig about the distributio of idividual scores i either the sample or i the populatio. For example (ext page), here is a graph of our household icome variable for the populatio followed by a graph of our iitial sample of size 5 (from the begiig of this lecture) Note that either is ormal, but the samplig distributio of all possible samples of size 5 is ormal.
Statistics Lecture 8 Samplig Distributios (Chapter 6-, 6-3) The Origial Populatio of all households (78,455) with mea 78,63.53 Desity 0.0e-06 4.0e-06 6.0e-06 8.0e-06.0e-05 0 00000 00000 300000 400000 500000 600000 HINC Our oe sample of size 5 with these statistics Variable Obs Mea Std. Dev. Mi Max -------------+-------------------------------------------------------- hic 5 853.6 775.65 0 385000 Desity 0 5.0e-06.0e-05.5e-05 0 00000 00000 300000 400000 hic