OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, PDF Free Download

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010 SAMPLING AND CONFIDENCE INTERVALS Learning objectives for this session: 1) Understand how a histogram can be read as a probability distribution 2) Understand the importance of random sampling in Statistics 3) Understand how sample means can have distributions 4) Eplain the behavior (distribution) of sample means and the Central Limit Theorem 5) Know how to interpret confidence intervals as seen in the medical literature 6) Know how to calculate a confidence interval for a mean Outside preparation: Pagano & Gauvreau: Chapter 7, pp. 176-185; Chapters 8 and 9 In the previous section we introduced the normal distribution and how to reference individual values in a distribution when given the mean and standard deviation. To see ways in which this information is applied, do a quick internet search on the definition of osteoporosis and osteopenia, or the difference between stunting and wasting in terms of malnutrition, or the development of the CDC s definitions of childhood overweight and obesity. In this section we will move beyond data from individuals and discuss the behavior of data summarizing groups, such as the behavior of means. HISTOGRAMS AS PROBABILITY DISTRIBUTIONS What do we mean by the behavior of numbers? It is important to realize that a histogram shows not only the distribution of values in a collection of people (the proportion of people between specified limits) but also shows the probability that an individual selected at random will lie between the specified limits. Recall the normal distribution: For eample, we can ask what is the probability that a randomly selected MCAT test-taker will have a score greater than 12? Well, if we know that the mean is 8 and sd is 2, then therefore only about 2% of people have a score >12. This is equivalent to saying that if we randomly choose someone from this 1

distribution, there is only a 1 in 50 chance (2%) that the randomly selected person will have a score that high. It is crucial to understand this concept: a histogram is also a type of probability plot. If you couple this with the special features of a normal distribution, you have a powerful tool for eplaining the probability of observing certain events. SAMPLING Sampling is basically the process of taking a random selection of individuals from a population. The population is everybody, and we use parameters to describe population characteristics. Greek letters are used when talking about population parameters (the mean is indicated with μ, the standard deviation with σ, etc). A sample is a subset of people chosen from the population, and the term statistic (with Roman symbols) is used when summarizing their characteristics. Suppose we wanted to get a good estimate of some value in the population, such as the average Calories of fat consumed daily by women in the US. It would be very difficult (and somewhat pointless) to query every single member of the population (a process called a census). Instead, we can take a good sample, calculate the mean KCal from fat, and then couch that mean with some sort of numerical acknowledgement of the fact that a different sample would have given rise to a slightly different mean. This process is part of inferential Statistics (notice capital-s) and will be covered later when we discuss Confidence Intervals. What do we mean by good sample? You could take an entire course on sampling design, learning about simple random samples, stratified sampling, convenience samples, etc. For now, let s assume that we are working with a simple random sample, in which every member of the population has an equal probability of being selected in the sample. Mmmmm. Pepperoni Imagine that your local pizza place claims that their delivery times are normally distributed, with a mean delivery time of 30 minutes and a standard deviation of 10 minutes. You order up a pizza and it takes 42 minutes for the pie to arrive at your house. Do you have reason to disbelieve the chain s claim? In a sense, if the company is telling the truth, when you call to order a pizza the subsequent delivery time is simply a random value chosen from a normal distribution with a mean of 30 and a standard deviation of 10 (this is known as a random variable 1, in that you don t know how long a given pizza delivery is going to take until you actually call up and order that pizza!). Sometimes, it might take 25 minutes to get there, sometimes 35, and sometimes 40 but most of the time it will be around the mean of 30 minutes. This time, your particular zesty pie took 42 minutes, which is equivalent to a z- score of 1.2. Deliveries this slow happen about 11% of the time (you can look at the histogram to estimate the percentile, or look at a z-table to get the eact value). Now, that s not very rare. If you tried to go on Judge Judy and make the claim that the company committed false advertising, you wouldn t have much of a case. After all, if the mean is 30 and the standard deviation is 10, you would epect that some pizzas take that long to get there. It s the definition of a distribution. So far this is all a review of last class. But let s add a twist. Imagine that you are skeptical of the chain s claim, and you convince 20 of your random friends in 20 random locations to randomly call 1 A random variable can be defined as a potential quantity whose values are determined by a chance-governed mechanism. In the real world, obviously, pizza delivery time wouldn t eactly meet this definition, but let s pretend. 2

and order pizzas. You then calculate the average delivery time of these 20 pizzas, and it is 42 minutes. Do you now have reason to doubt the chain s claim? Now just stop and think about this. It feels different, doesn t it? When it was just you, ordering just one pizza, 42 minutes didn t seem crazy you could probably just chalk it up to bad luck. But now you have TWENTY FLIPPING PIZZAS, and the average delivery of those 20 pizzas is 42 minutes. If the company really delivered pizzas in about a half hour, wouldn t you epect the average of your 20 pizzas to be closer to that half hour? Like maybe 35 minutes? Or maybe 32 minutes? Or maybe even 30.5 minutes? What would a judge say with this evidence? Where do we draw the cut-off? How do we determine what is weird or rare enough to bring forth as evidence that the chain s claim is false? It doesn t make sense to use the same histogram above, with its mean of 30, standard deviation of 10, and z-score of 1.2. We need to have a distribution on which to place our observed mean of 42, and your gut probably says (correctly) that it should be a distribution with a relatively tighter spread. That appropriate distribution is known as distribution of sample means. THE DISTRIBUTION OF SAMPLE MEANS Many students struggle with the concept of a distribution of means. How can means have a distribution? Dr. Tybor, isn t a mean simply a statistic, a number that summarizes my data? You have to resist the temptation of thinking about a sample mean as a specific number. Yes, it is true that in your sample, your friends observed a mean pizza delivery time of 42 minutes. But if you had randomly chosen different friends, or if your friends had been in different random locations, or if your friends had called at different times, you would have, by definition, taken a different sample. In each different sample that you could possibly have taken, the observed mean could be higher or lower than 42. So the value that you observe (the sample mean) depends on the random friends that you chose and the random locations where they are this is a chance-governed mechanism that results in the observed value. In other words, it is a random variable. In a sense, when you collect data, you are randomly choosing one random sample mean from the universe of all possible sample means. We can represent this on a histogram. But what does that histogram look like? The distribution of sample means is driven by something called the Central Limit Theorem. It is central because almost all statistical techniques are built upon it. (If the CLT were not true, then you would be at Clery's right now instead of in the library reading these notes.) Central Limit Theorem says that if you theoretically take many random samples from a population, and then calculate the means of each sample and plot them, the distribution has the following characteristics: 1. The distribution of sample means will be normally distributed. This is true even if the original population characteristic does not follow a normal distribution. 2. The mean of the sample means (-bar) is equal to the population mean. 3. The standard deviation of the distribution of sample means depends on both the size of the sample and the standard deviation of the population distribution. This spread of sample means is called the Standard Error of the Mean (SEM). 3

SEM = (standard deviation) / (square root of sample size) That should give you a mental image like this: Let s look at each of these points in turn. CLT says the distribution of sample means is Gaussian This is nice because we know that normal distributions have special features that make it easy to reference values somewhere in the distribution. We can easily calculate z-scores or percentiles to determine how rare an observed outcome is. This importance of this will be evident in later topics. CLT says the mean of the sample means is equal to the population mean This should intuitively make sense. If the average delivery time of a pizza is 30 minutes, and you take a sample of a large number of pizzas, your best guess of the mean of that sample will be 30 minutes. It would be surprising if the mean of the sample means were not equal to the population mean. CLT says the spread of sample means can be described with the SEM The Standard Error of the Mean, which is basically the standard deviation of the distribution of sample means, depends on two factors. First is the underlying spread of the population itself. It makes sense that if there was relatively little variability in the delivery time of pizzas (say for eample that all pizza deliveries took between 29 and 31 minutes), then the distribution of the mean delivery times would also be small. It also makes sense that the spread of the distribution is contingent on the size of the 4

sample the larger the sample you take, the better its mean will estimate the true population mean. These two common-sense observations can be seen in the calculation of the SEM. SEM = (SD of population) / (square root of sample size) Should I sue the pizza chain or not? We are now equipped with the tools to evaluate the veracity of the chain s claim. They claim that pizza delivery times are normally distribution with μ = 30 and σ = 10. We have collected a sample of size 25 (n = 25) and observed a sample mean of 42 (-bar = 42). How weird is this outcome if we assume that the chain is telling the truth? The SEM describes the distribution of all possible sample means (of n = 25) that would be observed if the chain s claim were correct. We can calculate the SEM and visualize the distribution of sample means. SEM = (10) / (sqrt (25)) = 2 Notice where 42 falls on this distribution. It is 6 SEMs above the mean. Think about this histogram from the perspective of probability if the chain s claim were correct, it would be very very unlikely that a random sample of 25 pizzas would result in a mean delivery time of 42 minutes. With this evidence, it would be reasonable to conclude that the pizza chain is not telling the truth. 24 26 28 30 32 34 36 In just a few short pages, we have shown how the special features of the normal distribution and the Central Limit Theorem can be utilized to make powerful statements about the probability of observing certain outcomes. Now let s build upon this knowledge. Confidence Intervals One of the major goals of statistics is to make inferences about a population based on a sample. Suppose, for eample, that a study found that in a sample of diabetic adolescents, the mean resting blood glucose level was 135 mg/dl. The number 135 mg/dl is a point estimate, a statistic summarizing our sample. (In this eample, the point estimate is a mean, but the point estimate could 5

be any statistics - an odds ratio, risk ratio, proportion, etc.) The point estimate is our best estimate of what the population parameter might be. However it would be silly to conclude that the true population average glucose level in all diabetic adolescents is equal to eactly 135 mg/dl. It seems reasonable that the true population parameter is probably a little bit higher or lower, and that the point estimate is just an approimation. In order to approimate the population parameter of interest using the point estimate, we employ the use of a confidence interval. A confidence interval is a range of values that we epect includes or contains the true population parameter we wish to estimate. The sample mean, for eample, is the best estimate of the population parameter, whereas the confidence interval is the interval that is the most likely to include the population parameter. In the blood glucose eample above, a potential confidence interval for the mean might consist of all the numbers between 130 and 140 mg/dl. This confidence interval would typically be written as (130, 140). In this case, the endpoints or lower and upper bounds of the confidence interval are 130 and 140, respectively. If the true, unknown population mean μ is 133 mg/dl, then we would say that the true mean is included in the confidence interval. If the unknown population mean is 142 mg/dl, however, this means that μ is not included or contained in the confidence interval. In statistics, we can never be completely certain that the true population parameter falls within the confidence interval. Only God knows, and She s not telling. Therefore, we establish confidence levels, which provide the statistical level of certainty that the true population parameter is contained in the confidence interval. Typically, the confidence level chosen in biomedical research is 95%; occasionally 90% or 99% confidence intervals are used. A 95% confidence interval, for eample, is constructed in such a way that 95% of all possible samples from the population include the true population parameter of interest. Thinking in terms of probability, this is equivalent to saying that there is a 95% chance that my confidence interval from my study (computed from my random sample) includes the population parameter. Look at this picture, showing the theoretical situation in which we take 20 different samples from a population (as if we had enough money to repeat our study 20 different times!). Each horizontal bar represents the sample mean (-bar) and confidence bounds for each of the twenty samples taken. Note that in 19 of the 20 samples, the population mean is contained in the confidence interval, and in one sample, the population mean is not. Interpretation of Confidence Intervals μ Illustration of sampling distribution and 20 confidence intervals etracted from different samples Population or sampling distribution of 6

Assume that from our sample, we calculated a 95% confidence interval of (120, 160). This means that we are 95% certain that the true population mean glucose falls between 120 and 160 mg/dl. Remember that this process can also be done for statistics other than means. For eample, another study might find that the odds ratio for the risk factor of cigar smoking on the outcome of emphysema is 1.20, with a 95% confidence interval of 1.07 and 1.34. This means that we are 95% confident that the true population odds ratio for cigar smoking and emphysema falls between 1.07 and 1.34. The best estimate is the point estimate of 1.20. In other words, we are 95% confident that cigar smokers have between 7% and 34% greater odds of emphysema than non-cigar smokers. The general formula for a confidence interval is Point Estimate ± Multiplier*StandardError To calculate the confidence interval for a mean, use this formula: X ± Zσ/ n where Z is the z-score for 2-tailed tests, σ is the standard deviation, and n is the sample size. Please note that the confidence interval is calculated for μ, not for X. For 95% confidence intervals, the formula is X ± 1.96*σ/ n. The z-scores corresponding to 90% and 99% confidence limits are 1.65 and 2.58 (or, approimately 3), respectively. Thus, there are three scenarios that can make a confidence interval wider: 1. Increasing the confidence level; say, from 95% to 99%. 2. Increasing the standard deviation. 3. Decreasing sample size. In this way, confidence intervals are related to the power of the study to detect an effect or difference. In general, the larger the sample size is in a study, the greater power the study has to detect an effect. The larger the sample size, the narrower the confidence interval will be, given the same standard deviation and confidence level. We say also that studies with large sample sizes have high power to detect effects or differences. A more formal discussion of how power is used in hypothesis testing is provided in the net lecture. There are certain assumptions required for different types of confidence intervals. Generally speaking, confidence intervals for a mean require the assumption that the distribution is approimately normal or Gaussian. This assumption is not valid for all measures, however, such as odds ratios, proportions, and others. In the above eample of calculating a confidence interval for a mean, the confidence intervals would be symmetric. That is, the difference between the mean and the lower boundary of the confidence interval is eactly equal to the difference between the upper boundary of the confidence interval and the mean. Confidence intervals are symmetric in many, but not all instances. You ll find that confidence intervals are symmetric for certain measures, such as for means, differences in means, and related measures where the data are assumed to be derived from a symmetric distribution, such as the normal distribution. However, in the case of odds ratios and risk ratios (relative risks), where the confidence intervals are calculated on the log scale, the arithmetic difference between the upper boundary of the interval and the point estimate will tend to be larger than the difference between the point estimate and the lower boundary of the confidence interval. Thanks to Dr. Steve Cohen for his contributions to these notes 7