V. Gathering and Exploring Data

V. Gathering and Exploring Data With the language of probability in our vocabulary, we re now ready to talk about sampling and analyzing data.

Data Analysis We can divide statistical methods into roughly two categories: Statistics Exploratory Data Analysis Inference Exploratory analysis answers the question: What do we observe? If h i B d h b Inference answers the question: Based upon what we observe, what conclusions can we draw about the underlying population?

Example V.A According to a recent available General Social Survey (see http://www.norc.org/projects/general+social+survey.htm), 26.2% of 4492 randomly polled Americans attend church at least weekly. Does this result provide any evidence that a minority of Americans attend weekly religious services? What do we observe? We observe 1177 of 4492 randomly sampled adults (26%) go to church at least weekly. What can we infer about the underlying population? What does this sample say about the proportion of all Americans who attend weekly services? Does this survey provide evidence that t a majority do not attend weekly? Can we make a reliable judgment about general opinion using a sample of this size?

The Purpose of Statistics (parameters) (estimates, statistics) Our object is to characterize the underlying population from which the sample was taken i.e., we want to infer something about the population using information from the sample.

Key Terminology A parameter is a quantity that characterizes the population (such as the population mean or population variance). A statistic ttiti is a quantity computed dfrom the sample. Statistics i are often used to estimate population parameters. For example, the sample mean X is used to estimate the population p mean µ. A random sample is one for which every subject in the population has some chance of selection. A simple random sample (or SRS) is a random sample for which each subject in the population has equal chance of selection.

Representative Samples No statistical ti ti methods no matter how sophisticated t will help in understanding the underlying population if the sample is not representative of the population of interest! A sample is representative if it is random. A random sample with n = 5 is better than a nonrandom (i.e., biased) sample with n = 10,000! Example V.B Read the handout Samples in History How did nonrandom Read the handout Samples in History. How did nonrandom sampling lead to biased observations?

Controlled Experiments One way of gathering data is through a controlled experiment, such as a randomized trial. For example, in order to test the efficacy of a new drug or treatment, we may randomly divide the study subjects into two groups: treatment and control. Controlled experiments are often considered the gold standard, in the sense that they eliminate spurious associations due to confounding.

Confounding Factors Confounding is best understood through examples. It is easy to demonstrate, for instance, that incidents of drowning increase when more ice cream is consumed. Also, drunk driving deaths increase when with increased chocolate sales. Does eating ice cream cause drowning, or does eating chocolate cause accidents due to impaired i driving? i Can you think of explanations for these associations? The confounder in each case is a third variable that is associated with both the cause and effect.

Observational Studies Sometimes, it is unethical or infeasible to randomize subjects to treatment groups. For example, how might we carry out a randomized study to determine the effects of smoking? In the absence of a controlled study, we rely on observational data (or a survey). However, we need always use caution when interpreting the results of an observational study. Remember: ASSOCIATION does not necessarily imply CAUSATION.

Example V.C For years, investigators reported based on observational data that hormone replacement therapy (HRT) among postmenopausal women leads to other health benefits, including lowering the risk of heart disease. Note that this assumption has widespread repercussions: HRT was prescribed almost universally to help ease uncomfortable symptoms of menopause. Recently, a randomized HRT trial reached a seemingly opposite conclusion: women on HRT were at higher risk of heart disease (on average) than women on placebo. For example, see: http://www.nhlbi.nih.gov/health/women/pht_facts.pdf hlbi ih /h l h/ / h Can you think of an explanation for these seemingly contradictory findings?

Example V.D Read the handout titled Readings on Controlled Experiments. Consider these questions: Why might a controlled experiment be more ethical than an observational study? When is a controlled experiment not ethical? Considering the last question, how should a researcher design a study to ensure she is following ethical practices?

Exploratory Data Analysis: Visualizing Data Recall that there are two types of variables we can measure: Categorical Variables (e.g., gender, race, political party affiliation). Continuous Variables (e.g., height, weight, age, income). For categorical variables, we can visualize data using a bar chart or pie chart. For continuous variables, we often visualize the distribution of the data using a histogram, stem-and-leaf plot, or boxplot.

Bar Charts A bar chart is plot where each possible category of the variable is represented by a bar. The height of a given bar is proportional to the number of observations from the sample that fall in the associated itdcategory. Example V.E Sketch a bar chart for gender, using our class as a sample.

Pie Charts A pie chart is plot where the whole sample is represented by a circle, or pie. The slice of pie associated with a given category is sized proportionally to the number of subjects that fall in that t category. For example, if 1/3 of the sample subjects fall in a given category, then the interior angle of the slice for that category is (1/3)360 = 120. Example V.F Sketch a pie chart for gender, using our class as a sample.

Histograms A histogram helps us to visualize the distribution of a sample of continuous data. We construct a histogram by following these steps: Divide the range of possible values of the variable into intervals, called bins. The width of the intervals is sometimes referred to as the binwidth. Count the number of observations that fall within each interval. Represent each interval with a bar, whose height is proportional to the number of observations falling within that interval, or bin.

Additional Comments on Histograms Don t confuse a histogram and a bar chart. The vertical axis of a histogram contains intervals over a continuum, whereas the bar chart has categories that are possibly unordered (like race, for example). The binwidth is somewhat arbitrary. The width should be chosen so that the plot is aesthetic not overly smooth (width too wide) and not too ragged (width too narrow). Statistical software packages will select a default binwidth, but will generally also give you the option of explicitly choosing a different binwidth.

Example V.G The following table shows the number of O-ring incidents id (failures) versus temperature in space shuttle take-offs up to and including the Challenger disaster of 1986. O-ring Incidents Temperature (degrees Fahrenheit) None 66 67 67 67 68 68 70 70 72 73 75 76 76 78 79 80 81 One 57 58 63 70 70 Two 75 Three 53

Example V.G, cont d Construct three histograms for all of the launch temperatures combined: one with a binwidth of 2.5 degrees, a second with a binwidth of 5 degrees, and a third with a binwidth of 10 degrees. Take the minimum value of all three to be 50 degrees. Which h of the three do you prefer, and why?

Interpreting a Histogram The purpose of a histogram is to visually assess the distribution of your sample. We are generally interested in four characteristics: 1. Symmetry versus skewness: is the distribution roughly symmetric? If not, is it right-skewed (or positively skewed)? ) Is it left-skewed (negatively g y skewed)? ) 2. Center: around what approximate value do the observations cluster? 3. Mode: roughly how many peaks do we observe in the distribution? One (is it unimodal)? Two (is it bimodal)? 4. Outliers: are there any observations that lie relatively far away from the main body of data?

Skewness A Symmetric distribution appears to have roughly equal numbers of observations on either side of the midpoint of the distribution. A sample from a normally distributed population (such as height or blood pressure measurements, for example) would appear to have a symmetric distribution. A right (or positively) skewed distribution appears to have a bulk of the data clustered toward the lower end of the distribution, with the proportion of larger values tailing off to the right. Measurements such as annual income tend to have positively skewed distributions. A left (or negatively) skewed distribution appears to have a bulk of the data clustered toward the upper end of the distribution, with the proportion of smaller values tailing off to the left. Homework scores in this class have a leftskewed distribution most students tend to score high (between 20-25) with the proportion p of lower scores tailing to the left.

Example V.H The data used for the following three histograms were randomly g g y generated using computer software. Can you describe the distribution in each case?

Example V.H, cont d

Example V.I Given the histogram for the space shuttle temperatures from Example V.G (seen below with a binwidth of 5), describe the distribution.

Stem-and-Leaf Plots A type of plot closely related to the histogram is the stem-and- leaf plot (or stemplot, for short). Both plots have the same purpose: to help us to quickly characterize the distribution of our sample. The stemplot, however, allows us to view the actual measurements in the sample while still providing a graphical description of their distribution. In a stemplot, the stems are analogous to the bins of a histogram. The leaves represent individual observations bl belonging to each stem. The more leaves in a stem, the more observations fall within that range of the data. THE IDEA: Choose stems that correspond to a base digit of the measurements (such as the 10 s or 100 s digit, for example).

Example V.J Construct a stem-and-leaf plot for the shuttle launch temperatures in Example V.G. Use a stem corresponding to the tens digit in each temperature, with ten leaves per stem. Construct another stemplot that splits the stems of your first plot in two (i.e., five leaves per stem). Construct a third stemplot that splits the stems of the first plot into five (i.e., two leaves per stem). Which do you prefer, and why?

Choosing Stems Y th t th i di bi idth f hi t You can see that the same issues regarding binwidth for histograms apply to stemplots: you don t want to split stems too much, or the plot will look too ragged. Not splitting enough results in an oversmooth plot. Either extreme results in a graph that is difficult to interpret.

Exploratory Data Analysis: Summarizing Data The purpose of exploratory data analysis is to describe the distribution of the sample. Summary statistics help to reduce the data to just a handful of numbers that help an investigator to quickly characterize the distribution. ib ti We will discuss two types of summary statistics: Measures of Center: mean, median, mode. Measures of Spread (Variability): variance, interquartile range, range.

Summary Statistics: Measures of Center The sample mean is defined as X 1 n 1 X n i i The sample median is defined as the middle observed value. In other words, let X (1), X (2),, X (n) represent the sample, sorted from the smallest to the largest value. If n is odd, then the sample median is the middle observation, or X ([n+1]/2). If n is even, then the sample median is the average of the middle two ordered observations, or (X (n/2) + X ([n/2]+1) )/2. The sample mode is the most frequently observed value in the data The sample mode is the most frequently observed value in the data set..

We observe the data 4, 3, 4, 2, 8. Example V.K What are the sample mean, median, and mode?

Interpreting and Comparing the Sample Mean and Median The sample median is often described d as robust. This simply means that it is not greatly affected by extreme or outlying observations. The sample mean, on the other hand, is not robust it can be highly sensitive to extreme observations. Suppose, for example, that we also observe the value 1000 in addition to the data sampled in Example V.K. VK How does this affect the values of the sample mean and median?

Interpreting and Comparing the Sample Mean and Median Because the sample mean is more sensitive to extreme observations than the sample median, how do you think the mean and median compare (i.e., which will be larger) under the following circumstances? Distribution is right-skewed. Distribution is left-skewed. Distribution is symmetric. When reporting about individual or household income, why do p g, y researchers most often use median income as a measure of center?

The Trimmed Mean The sample trimmed mean is a sort of compromise between the sample mean and median. The α100% sample trimmed mean is the sample mean of the middle (1 2α)100% of the ordered observations, where α is some proportion of the data that will be trimmed from each tail. The quantity α is typically chosen to be something like 0.10 or 0.20. What is the 20% trimmed mean for Example VK? V.K? While the sample trimmed mean is an interesting way of producing a sample mean that is more robust, it is actually not often used in practice.

Summary Statistics: Measures of Spread The sample variance s 2 is defined as s 1 1 n 2 ( ) n 2 X X X nx 2. i 1 i i 1 i n 1 n 1 2 n The interquartile range (or IQR) is defined as IQR = Q3 Q1, where Q1 is the first quartile (sample 25 th percentile) and Q3 is the third quartile (sample 75 th percentile). Note that the second quartile Q2 is the sample median. The range is defined as the largest observed value less the The range is defined as the largest observed value less the smallest observed value, or X (n) X (1).

Computing the Quartiles Note that the p th sample percentile is the value (based on the sample) such that p% of the sample lies below that value. The quartiles Q1 and Q3 are hence the sample 25 th and 75 th percentiles. How do we find these quantiles? Divide the ordered data set in half (e.g., if n = 20, then we use X (1),, X (10) to find Q1 and X (11),, X (20) to find Q3; if n = 21, then we use X (1),, X (10) to find Q1 and X (12),,X X (21) to find Q3). To determine Q1, if the number of observations comprising the first half of the ordered data is odd, then Q1 is the middle of these. If the number of observations is even, then Q1 is computed as the value lying 25% of the distance between the middle two of these observations. Q3 is found in a similar manner, applying the same logic to the second half of the ordered observations.

Example V.L What are the sample variance s 2, the standard deviation s, the IQR, and the range for the data in Example V.K? How do these values change if an additional observation equal to 1000 is observed?

The Boxplot A boxplot (or box-and-whisker plot) ) allows us to visualize the distribution of our sample via the so called five-number summary, which consists of the three quartiles along with the minimum and the maximum. The diagram below shows how we construct such a chart. X (1) Q1 Q2 Q3 X (n) Computer software packages sometimes add other features, like an asterisk that indicates the location of the sample mean.

Example V.M Construct a boxplot for the data in Example V.K.

Example V.N Side-by-side boxplots are often useful in comparing the distributions of a continuous variable between two groups. The boxplots below show the distribution of launch temperatures for shuttle missions (through 1986) with zero O-ring failures, versus the distribution for missions with at least one O-ring failure. What do you observe? 55 60 65 70 75 80 No Failures 1 Failure

Example V.O Stat 3000 Statistics for Scientists and Engineers The following data are pre-azt serum antigen levels measured in a study of 20 AIDS patients. Patient Serum Antigen Level (pg/ml) Patient Serum Antigen Level (pg/ml) 1 149 11 180 2 0 12 0 3 0 13 84 4 259 14 89 5 106 15 212 6 255 16 554 7 0 17 500 8 52 18 424 9 340 19 112 10 65 20 2600

Example V.O, cont d What are the mean, median, and mode? What are the variance, standard deviation, and IQR? Draw a boxplot for these data.