1 2 Outline Finish sampling slides from Tuesday. Study design what do you do with the subjects/units once you select them? (OI Sections 1.4-1.5) Observational studies vs. experiments Descriptive statistics and data visualization (OI Sections 1.6-1.7) Lab: Introduction to R and RStudio Practice Discuss with a partner: 1. A researchers is interested in the opinions of MSU students about updating gym equipment. A surveyor stands at the gym entrance door and uses the next 50 people who enter as a sample and asks each their opinion about updating gym equipment. What type of sample and bias? 2. There are 25 sections of Stat 216 offered at MSU this semester. How would you use cluster sampling to choose a sample from all Stat 216 students? Multistage sampling? 3. What are the two main differences between a stratified random sample and a cluster sample? 3 4 Observational Studies vs Experiments An observational study is a study which observes individuals and measures variables, but does not attempt to manipulate or influence the responses. In prospective observational studies, investigators choose a sample and collect new data generated from that sample they look forward in time. In retrospective observational studies, investigators look backwards in time and use data that have already been collected. : In a case-control study, the researchers select a sample of cases (e.g., lung-cancer patients) and a sample of controls (e.g., patients similar to the cases but without lung cancer) and ask them about past behavior (e.g., smoking). Retrospective or prospective? A study that follows marijuana users in Colorado for 5 years. A study of illegal immigrant activity last year in Arizona. Observational Studies vs Experiments An experiment is a study in which treatment(s) are deliberately imposed on individuals in order to observe their response. A randomized experiment is an experiment where treatments are randomly assigned to subjects. The treatments are levels of an explanatory variable. Don t confuse random assignment with random sampling! 5 6 Confounding Variables In an observational study, we cannot show cause-andeffect relationships because there is the possibility that the response is affected by some variable(s) other than the ones being measured a confounding variable is a variable that both: 1. is related to the explanatory variable, and 2. may have an effect on the response variable. Discuss What are the disadvantages and advantages of an observational study compared to an experiment? In a randomized experiment, the random assignment of levels of the explanatory variable should balance out (on average) any possible confounding variables, allowing us to examine cause-and-effect relationships.
7 8 Houndstongue (a noxious weed) is found in abundance on private and public lands that have been grazed by cattle. Houndstongue is rarely found on lands that have been grazed by mountain goats. One investigator concluded that houndstongue infestations could be reduced by importing mountain goats to the infested areas. 1. Variables and types? Explanatory and response? 2. Sampling bias? To what population can we generalize? 3. Observational study or experiment? Prospective or retrospective? 4. Confounding variables? A 1993 study by UCI researchers Rauscher, Shaw, and Ky published in Nature tested 36 college students performance on a set of three standard IQ spatial reasoning tasks, each proceeded by one of three conditions: (1) listening to 10 minutes of a Mozart sonata, (2) listening to 10 minutes of relaxation instruction, or (3) listening to 10 minutes of silence. All three conditions were tested on each student, in a random order (to compensate for possible practice efffect). Results showed that IQ scores were significantly higher after the Mozart condition than the other two. 1. Variables and types? Explanatory and response? 2. Sampling bias? To what population can we generalize? 3. Observational study or experiment? Prospective or retrospective? 4. Confounding variables? 9 10 Does Prayer Lower Blood Pressure? A study followed a random sample of 2391 people for 6 years, and concluded that, Attending religious services lowers blood pressure more than tuning into religious TV or radio. USA Today headline: Prayer can lower blood pressure 1. Variables and types? Explanatory and response? 2. Sampling bias? To what population can we generalize? 3. Observational study or experiment? Prospective or retrospective? 4. Confounding variables? Each of 22,071 male physicians between the ages of 40 and 84 was randomly assigned to one of two treatment groups: (1) one aspirin per day, (2) one placebo tablet per day. Results showed that the aspirin group had a lower incidence of heart attacks than the placebo group. 1. Variables and types? Explanatory and response? 2. Sampling bias? To what population can we generalize? 3. Observational study or experiment? Prospective or retrospective? 4. Confounding variables? 11 12 What can we conclude? How Study is Conducted: (Cause-and-effect) Randomized Experiment Observational Study How Sample is Collected: (Generalizability) Random Sample from the Population Causal relationship, and can extend results to population. Cannot conclude causal relationship, but can extend results to population. Non-random Sample from the Population Causal relationship, but cannot extend results to a population. Cannot conclude causal relationship, and cannot extend results to a population. Principles of Experimental Design 1. Controlling 2. Randomization (random assignment) 3. Replication 4. Blocking
13 14 Experimental Design: Control An extraneous factor is a variable that is not of primary interest and yet affects the response variable. An extraneous factor is only called a confounding variable if it also is related to the explanatory variable. In an experiment, researchers try to hold extraneous factors constant for all units so that the effects of the extraneous factor are not confounded with the factors of interest. Experimental Design: Control : If the treatment is a new drug administered through a pill, researchers will give the control group a placebo pill a pill that has no active treatment, yet looks/tastes the same as the new drug. : The subjects and researchers recording the response should be blind they do not know which treatment was received to avoid unconscious expectations. If both subjects and researchers are blind, we say the study is double-blind. If only one is blind, we say the study is single-blind. 15 16 Experimental Design: Randomization Levels of the explanatory variable (treatments) are randomly assigned to experimental units in order to create similar experimental groups. This balances out values of the extraneous factors. Experimental Design: Replication Within one experiment, we use replication by assigning many experimental units (large sample sizes) to each treatment group to reduce the role of random variation due to uncontrolled extrameous variables. Groups of scientists should replicate entire studies to verify earlier findings. 17 18 Experimental Design: Blocking Prior to random assignment, experimental units are classified into homogenous subgroups or blocks so that the extraneous factors are held constant within each block. Treatments are randomly assigned to units within each block. Block what you can, randomize what you cannot. George Box Discuss: What is the difference between blocking in an experiment and stratifying in choosing a sample? Two Basic Experimental Designs 1. Completely Randomized Design Experimental units are randomly assigned to each treatment (using experimental design principles 1-3). 2. Randomized Block Design Experimental units are classified into blocks that are similar with respect to extraneous variable(s), then units are randomly assigned to treatments independently within each block (using experimental design principles 1-4) A matched pairs design is a randomized block design where each block consists of a pair of experimental units.
19 20 Practice Statistical Investigation Process (Tintle et al., 2016) Suppose a researcher wants to know whether taking caffeine an hour before swimming affects the time it takes swimmers to complete a 1-mile swim and that 50 volunteers are available for the study. 1. What are the experimental units? 2. What is the response variable? the explanatory variable? types of these variables? 3. What should be the treatments? 4. What are some potential extraneous factors? 5. Describe how you could design a: a. Completely randomized design b. Randomized block design c. Matched-pairs design Descriptive statistics (summary statistics) and plots Sampling methods and experimental design 22 Summary Statistics SUMMARIZING DATA Descriptive Statistics and Data Visualization Type of variable One categorical Two categorical One quantitative Two quantitative Summary statistics frequency (count) or relative frequency (proportion) in each category contingency (two-way) table Center: mean, median, mode Variability: range, quartiles, inter-quartile range (IQR), variance, standard deviation Shape: skewness, kurtosis (we won t cover these) correlation coefficient (r), coefficient of determination (R 2 ) Plots Type of variable One categorical Two categorical Plot Bar plot (do not use pie charts! why?), mosaic plot Segmented bar plot (or side-by-side bar plot) 23 : Nightlights and Nearsightedness Survey of n = 479 children. Those who slept with nightlight or in fully lit room before age 2 had higher incidence of nearsightedness (myopia) later in childhood. 24 One quantitative Histogram, density plot (smoothed histogram), dotplot, boxplot, stem-and-leaf plot Two quantitative One quantitative and one categorical Two quantitative and one categorical Scatterplot Side-by-side boxplots Scatterplot with different colors/shapes for categories What are the observational units? Note that now we are measuring two categorical variables. What are they? Is there an association between the two variables? Why? How could we construct a bar graph for these data? How could we detect association between the two variables in the bar graph?
: Nightlights and Nearsightedness Response: Degree of Myopia Explanatory: Amount of Sleeptime Lighting You could create a segmented bar plot of these data by stacking the three colored bars within each lighting condition. 25 Bar Graphs: Important Notes When creating a bar graph displaying two categorical variables: Categories of explanatory variable à x-axis Categories of response variable à differing colors/shadings; include in legend. y-axis reports row percentages (relative frequencies) percent (not frequency) in each response category within each explanatory category Heights of bars in each explanatory variable category should add to 100%. Always include x-axis and y-axis labels! 26 27 What to look for in the distribution of one quantitative variable? 1. Center where is the distribution centered at? 2. Variability (spread) how spread out are the values of the variable? 3. Shape what shape is the distribution? (e.g., symmetric, right/positive skewed, left/negative skewed, unimodal, bimodal, multimodal) 4. Outliers are there any observations that do not fit the overall pattern of the distribution? Frequency 0 10 20 30 40 Histogram of 272 Eruption Times for Old Faithful Geyser 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Length of Eruption (min) Center Notation for raw data: You have n observations of a quantitative variable, denoted by x, x, 1 2!, x n n is called the sample size. Mean = arithmetic average = x-bar x = x 1 + x 2 + + x n n x i = n The balancing point; the value such that the sum of all the deviations from the mean is zero: n (x i x) = 0. i=1 29 Center (cont) Median = middle value = M Arrange the values in increasing order. If n is odd, the median M is located in position (n+1)/2. If n is even, the median M is the average of the middle two values, at positions (n/2) and (n/2)+1. Mode = most frequent value The mode need not be unique. A distribution is called unimodal if there is a single prominent peak; bimodal if there are two prominent peaks. 30
Shape 31 Variability Range = highest value (max) lowest value (min) Interquartile range (IQR) = Q 3 Q 1 where Q 3 = upper quartile (75 th percentile) = median of upper ½ Q 1 = lower quartile (25 th percentile) = median of lower ½ 32 (a) Skewed left (negative) (b) Normal distribution (c) Skewed right (example of a symmetric (positive) distribution) Note: The direction of skewness is the direction in which the tail is pulling the distribution. Median Variability (cont) Standard deviation measures variability by summarizing how far individual data values generally are from the mean. It s most useful for bell-shaped data. Interpret the standard deviation as roughly the average distance values fall from the mean. 33 Calculating the Sample Standard Deviation Formula for the (sample) standard deviation: s = (x 1 x) 2 + + (x n x) 2 n 1 An equivalent formula, easier to compute, is: s = 2 ( x i ) nx 2 n 1 The value of s 2 is called the (sample) variance. = ( x i x) 2 n 1 34 Data: 90, 90, 100, 110, 110 (n = 5 observations) (90+90+100+110+110)/5 = 100 x = Observation Deviation Squared Deviation x i x i x x i x 90-10 100 90-10 100 100 0 0 110 10 100 110 10 100 Sum = 500 Sum = 0 Sum = 400 ( ) 2 s 2 = 400 35 Sample Variance: 5 1 =100 Sample Std Dev: s = 100 =10 The following dotplots represent ratings of statistics from three different classes. Which class has the largest variability in their ratings? Why? (No calculations are necessary.) A. B. C. 36
The following dotplots represent ratings of statistics from two different classes. Which class would you say has more variability in their ratings? (No calculations are necessary.) 37 Outliers 38 A. B. An outlier is a data value that doesn t fit the pattern of the majority of the data. Rule of thumb: An observation is considered an outlier if it is either greater than Q 3 + 1.5 IQR, or less than Q 1 1.5 IQR. Influence of Outliers and Shape on the Mean and Median Outliers have a larger influence on the mean than on the median. Why? : 1, 2, 3, 4, 5 Mean = 3 Median = 3 1, 2, 3, 4, 5, 100 Mean = 19.2 Median = 3.5 Symmetric à Mean = Median Skewed à Mean pulled in the direction of skew: Skewed left: Mean < Median Skewed right: Mean > Median The median is a robust estimate (resistant to outliers); the mean is not a robust estimate. 39 Boxplots Picture by John Landers: http://www.causeweb.org 40 Five-Number Summary 1. Minimum (smallest value) 2. Q 1 = Lower Quartile = 25 th Percentile = median of lower half of the ordered data values (not including median) 3. Median = middle value = 50 th Percentile 4. Q 3 = Upper Quartile = 75 th Percentile = median of lower half of the ordered data values (not including median) 5. Maximum (largest value) 41 : Fastest Speeds Ever Driven Ordered data (in rows of 10 values) for n = 87 males: 55 60 80 80 80 80 85 85 85 85 90 90 90 90 90 92 94 95 95 95 95 95 95 100 100 100 100 100 100 100 100 100 101 102 105 105 105 105 105 105 105 105 109 110 110 110 110 110 110 110 110 110 110 110 110 112 115 115 115 115 115 115 120 120 120 120 120 120 120 120 120 120 124 125 125 125 125 125 125 130 130 140 140 140 140 145 150 Asked Penn State statistics students, what is the fastest speed you have every driven? (mph) Find the five-number summary for these data. 42
How to Draw a Boxplot and Identify Outliers 1. Draw horizontal (or vertical) axis with equally-spaced values from lowest to highest in data. Make sure you label your axis with the variable name and units of measurement (e.g., Fastest speed driven (mph) ). 2. Draw rectangle (box) with ends at quartiles. 3. Draw line in box at value of median. 4. Compute 1.5 x IQR. Any value more than this distance from closest quartile is considered an outlier. 5. Draw line (whisker) from each end of box extending to farthest data value that is not an outlier. If no outlier, then whiskers extend to min or max. 6. Draw asterisks/dots to indicate the outliers. Draw boxplot for fastest speeds. 43 Boxplot for Fastest Speeds 1. Draw horizontal line from 55 to 150 and label it. 2. Draw rectangle with ends at 95 and 120. 3. Draw line in box at median of 110. 4. Compute IQR = 120 95 = 25. 5. Compute 1.5(IQR) = 1.5(25) = 37.5; outlier is any value below 95 37.5 = 57.5 or above 120 + 37.5 = 157.5. 6. Draw line from each end of box extending down to 60 (smallest data value not an outlier) and up to 150. 7. Draw asterisk/dot at outlier of 55 mph. Always label your axes! 60 80 100 120 140 Fastest Speed Driven (mph) 44 Comparing two groups Fastest Speed Driven (mph) 40 60 80 100 120 140 45 What to look for in the relationship between two quantitative variables? 1. Form what is the form of the relationship between the two variables? (e.g., linear, quadratic, piece-wise linear) 2. Strength how strong is the relationship? (e.g., how close do the points follow the form?) 3. Direction what is the direction of the relationship? Positive association = as one variable increases, the other tends to increase Negative association = as one variable increases, the other tends to decrease 4. Outliers are there any observations that do not fit the overall pattern of the distribution? 46 Female Male http://www.npr.org/2017/08/18/544265493/chart-the-relationship-between-seeing-discrimination-and-voting-fortrump