Math 075 Activities and Worksheets Book 2: Linear Regression Name: 1
Scatterplots Intro to Correlation Represent two numerical variables on a scatterplot and informally describe how the data points are distributed and any apparent relationship that exists between the two variables (e.g., between time spent on homework and grade level). Write positive correlation, negative correlation, or no correlation to describe each relationship. 1. 2. 3. 4. 5. 6. 2
7. Use the given data to make a (year, units of CD s) scatter plot. 8. What kind of correlation is there between the year and the number of CD s sold? 9. Use the given data to make a (year, units of cassettes) scatter plot. 10. What kind of correlation is there between the year and the number of cassettes sold? 11. The scatter plot to the right shows the average traffic volume and average vehicle speed on a I-80 for 50 days in 2009. Which statement best describes the relationship between average traffic volume and average vehicle speed shown on the scatter plot? A As traffic volume increases, vehicle speed increases. B As traffic volume increases, vehicle speed decreases. C As traffic volume increases, vehicle speed increases at first, then decreases. D As traffic volume increases, vehicle speed decreases at first, then increases. 3
Understanding Scatterplots Match each description for a set of measurements (A and B) to a scatterplot, and briefly explain your reasoning. Each graph in this packet can only be used once. Scatterplot 1: Scatterplot 2: 1. If x = city miles per gallons and y = highway miles per gallon for 10 cars, describe which scatter plot is likely the correct graph. Explain your reasoning. a. What does a dot represent? 2. If x = sodium (milligrams/serving) and y = Consumer Reports quality rating for 10 salted peanut butters, describe which scatterplot is likely the correct graph. Explain your reasoning. a. What does a dot represent? 4
These scatterplots show body measurements for 34 adults who are physically active. Some measurements are a girth, which is a measure of length around a body part. Match each description (A, B, and C) to a scatterplot. Briefly explain your reasoning. A. B. C. 3. x = forearm girth (centimeters), y = bicep girth (cm). The bicep is above the elbow. a. What does a dot represent? 4. x = calf girth (cm), y = bicep girth (cm). The calf is below the knee. a. What does a dot represent? 5. x = age (years), y = bicep girth (cm) a. What does a dot represent? 5
Match each description of a set of measurements to a scatterplot. Then describe what a dot represents in each graph. 6. x = average outdoor temperature and y = heating costs of a residence during the winter 7. x = height (inches) and y = shoe size for a random sample of adults 8. x = height (inches) and y = score on an intelligence test for a random sample of teenagers (15-17) Lines have been added to some of the scatterplots used in the Lesson 3.1.1 to summarize the relationship between the ingredient and the Consumer Reports rating for breakfast cereals. You will learn more about summary lines in future lessons. 9. Which ingredients (sugar, protein, and/or fat) are negatively associated with ratings? 10. Which of the negatively associated ratings is the strongest? 6
House Prices: Correlation Your lab report should include a well written response to each of the following questions and all relevant supporting graphs and analyses performed using StatCrunch. Submit your assignment through CANVAS by uploading it as a document (either in word format, or in pdf). Remember to put your name on the document itself. Regression: Find the data set house prices.txt from the class StatCrunch group and load it into StatCrunch. The data set house prices contains information collected on characteristics of houses that were sold in a suburban community. House prices (the price at which the house sold in thousands of dollars), its size (in square feet), and other characteristics of the house that are usually recorded when a house is on the market. In task #1 we want to investigate the relationship between the price of a house and its size. 1. Which variable is the explanatory variable and which variable is the response variable when investigating the relationship between the price of a house and its size? 2. Construct a scatterplot for the data. Graph -> Scatter Plot -> For x variable, select the explanatory variable. For y variable, select the response variable. Click Compute. Copy the graph below: 3. How can we describe the nature of the relationship between house price and size from the scatterplot? (Think form, direction, strength.) Do you notice any outliers or deviations from the general pattern? 4. Compute the correlation for house price and size. Stat -> Summary Stats -> Correlation. Choose the two variable names for which you want to calculate the correlation. Click Compute. 5. What does the correlation you found say about the nature of their relationship? (Think about what the correlation measures.) 7
Diamond Linear Regression Worksheet An article in the Journal of Statistics Education reported the price of diamonds of different sizes in Singapore dollars (SGD). The following data set contains a data set that is consistent with this data, adjusted to US dollars in 2004 Open the Diamond Data set in our StatCrunch Group, and answer the questions. 1. What is the response variable and what is the explanatory variable in this model? - Explanatory variable: - Response variable: 2. Explain why you chose the way you did for Number 1 3. Construct a Scatterplot. How do you describe the scatterplot relationship that you observe? form: direction: strength: 4. Find the least square s regression line that describes the price of a diamond in relation to it s carat size. 5. What is the slope, in units? Interpret the slope using a complete sentence. 8
6. What is the y intercept with units? Interpret the y intercept using a complete sentence. 7. Using the regression equation estimate the cost of a diamond that is 0.32 carats big. 8. Nick bought a diamond that is 0.32 and was included in the data set given. What is his residual? What does that mean? Did Nick overpay or underpay? 9. Calculate the correlation. What does the correlation say about the nature of the relationship between diamond size and price? 10. How much variability in price does the carat size explain? What number are you using for your answer? 9
Residual Plots: 11. Construct a residual plot. a. Go to Stat > Regression > Simple Linear. Select the appropriate explanatory and response variables. b. On the editing page: scroll down to graphs. Under graphs, scroll down and highlight Residual vs X- values. Click compute. c. Which graph from (a) (f) from above does your residual plot look like? Notice that the red line must be at y = 0 d. If a linear model is a good fit, the graph will look like (a), scattered everywhere! Do you think that a linear model is a good fit for our data? Why or why not? 10
Now construct a histogram of the residuals under the same menu. 12. If a linear model is a good fit the histogram of the residuals should be normal, centered around zero, like below. Do you think a linear model is a good fit for our data? Why or why not? 13. What is the standard error Se? This is found by going to Stat > Regression > Simple Linear. Select the appropriate explanatory and response variables, and the output should have a Estimate of error standard deviation. This is the Se. 14. Se represents the average distance that the observed values fall from the regression line. Write a statement in context of the data we are writing about. 11
USPS Postal Linear Regression Classwork In an effort to decide if there is an association between the year of a postal increase and the new postal rate for first class mail, the data were gathered from the United States Postal Service. In 1981, the United States Postal Service changed their rates on March 22 and November 1. This information is shown in the data set below. Find it on StatCrunch, and load it. 1. Choose an appropriate year representation for t = 0. We do not want to use such big numbers for our model. Make a new column that has the title Years since 1970. Put in the appropriate numbers. 2. What is the response variable and what is the explanatory variable in this model? - Response variable: - Explanatory variable: 3. Explain why you chose the way you did for Number 1 4. Construct a Scatterplot. How do you describe the scatterplot relationship that you observe? form: direction: strength: 5. Find the least square s regression line that describes relationship between the year and postal rate. 12
6. What is the slope, in units? Interpret the slope using a complete sentence. 7. What is the y intercept with units? Interpret the y intercept using a complete sentence. 8. Using the regression equation estimate the cost of a postage stamp in 1977. (Hint: you re not plugging in 1977!) 9. The actual postage stamp cost $0.13. What is the residual? (Remember the residual is the actual value minus the predicted value). 10. Calculate the correlation. What does the correlation say about the nature of the relationship between years since 1970 and the postage rate? 11. How much variability in postage rate does the year explain? What number are you using for your answer? 13
Residual Plots: 12. Construct a residual plot. a. Go to Stat > Regression > Simple Linear. Select the appropriate explanatory and response variables. b. On the editing page: scroll down to graphs. Under graphs, scroll down and highlight Residual vs X- values. Click compute. c. Which graph from (a) (f) from above does your residual plot look like? Notice that the red line must be at y = 0 d. If a linear model is a good fit, the graph will look like (a), scattered everywhere! Do you think that a linear model is a good fit for our data? Why or why not? 14
Now construct a histogram of the residuals under the same menu. 13. If a linear model is a good fit the histogram of the residuals should be normal, centered around zero, like below. Do you think a linear model is a good fit for our data? Why or why not? 14. What is the standard error Se? This is found by going to Stat > Regression > Simple Linear. Select the appropriate explanatory and response variables, and the output should have a Estimate of error standard deviation. This is the Se. 15. Se represents the average distance that the observed values fall from the regression line. Write a statement in context of the data we are writing about. 15
Olympics Long Jump The following data set contains a data set with the winning jump lengths (in meters) for the Olympics Men s Long Jump Winners. Open the data set in our StatCrunch Group, and answer the questions. 1. What is the response variable and what is the explanatory variable in this model? - Explanatory variable: - Response variable: 2. Explain why you chose the way you did for Number 1 3. Construct a Scatterplot. How do you describe the scatterplot relationship that you observe? form: direction: strength: 4. Find the least square s regression line that describes the length of a winning jump in relation to the year 5. What is the slope, in units? Interpret the slope using a complete sentence. 16
6. There was no data for 1940? Google search if there was Olympics in 1940 and explain why there s no data. 7. If there had been Olympics in 1940, predict what the winning long jump would have been using your regression model. Does this number seem reasonable? 8. Is it okay to predict the future? Predict what the winning long jump will be in 2180. Does this number seem reasonable? Why or why not? 9. Calculate the correlation. What does the correlation say about the nature of the relationship between the winning long jump distance and the year? 10. What is the standard error Se? This is found by going to Stat > Regression > Simple Linear. Select the appropriate explanatory and response variables, and the output should have a Estimate of error standard deviation. This is the Se. What does this number tell us? 17
Fatalities Worksheet: Linear Regression We are going to analyze the association between the number of drunk driving fatalities and the years after 1980. 1. What would be your explanatory and response variables in this analysis? Explanatory Variable: Response Variable: 2. Create a fitted line plot that describes the linear association between the years after 1980 and the number of drunk driving fatalities. Be sure to choose the appropriate explanatory and response variables. Be sure to label the axes correctly including the appropriate units being measured. Minitab: Drunk Driving Fatal Accidents 20000 18000 16000 14000 12000 Fitted Line Plot Drunk Driving Fatal Accidents = 19141-357.3 Yr Since 80 S 1227.20 R-Sq 81.6% R-Sq(adj) 80.8% 10000 0 5 10 15 Yr Since 80 20 25 3. What is the equation of the least squares regression line? 4. What is the slope of your model including units? Write a sentence that interprets this slope. 5. What is the intercept of your model? Write a sentence to interpret your intercept in context. 18
6. What is the correlation coefficient of your linear model? What does this value tell you about the strength of the linear relationship? 7. Find the predicted number of drunk driving fatalities in 1992. Show work. 8. Find the estimated residual (error) for the number of drunk driving fatalities in 1992. (Remember the residual is the actual value minus the predicted value). 9. What is the Coefficient of Determination (r 2 )? Write a sentence to interpret it. 10. What is the Standard Error of Regression (S e)? Write a sentence to interpret it. 19
Play Ball!! And Do Statistics!!! Objective: You will be playing soccer today, represent the data on a scatterplot, and analyze the data. Materials: Soccer board and spinner 1 large paper clip per group A ball such as a penny Directions: 1. Flip your penny to decide who goes first 2. Put the penny on the dark line in the center of the game board. 3. Players take turns (one goes, then the other person goes) by spinning the paper clip on the spinner and moving that many yards toward his/her opponent s goal (each line on the game board represents ten yards). a. Keep track of the total number of yards for both players in the table below. b. This means, for instance if the first person who goes spins 10, then the first data point would be (1,10) If the second person scores 20, the next data point would be (2,30) since it s the TOTAL yards. 4. Each time a player gets to his/her opponent s goal, s/he scores one point Collecting Data: Collect the following data, stop after 20 turns (each person will have 10 kicks of the ball/spins) Turn 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Total yards traveled Keep Track of your scores! 20
Now let s do some statistics: 1. Enter the Turn and Total yards traveled into StatCrunch 2. Which variable is the explanatory variable and which variable is the response variable when investigating the relationship between turns and total yards? Why did you choose this way? 3. Construct a scatterplot for the data. Graph -> Scatter Plot -> For x variable, select the explanatory variable. For y variable, select the response variable. Click Compute. Copy the graph below: 4. How can we describe the nature of the relationship between Turn and Total yards traveled from the scatterplot? (Think form, direction, strength.) Do you notice any outliers or deviations from the general pattern? 5. Compute the correlation for Turn and Total yards traveled. Stat -> Summary Stats -> Correlation. Choose the two variable names for which you want to calculate the correlation. Click Compute. 6. What does the correlation you found say about the nature of their relationship? (Think about what the correlation measures.) Is it a strong or weak correlation? 7. Find the LSRL Least Squares Regression Line by going to Stat -> Regression -> Simple Linear For x variable, select the explanatory variable. For y variable, select the response variable. Click Compute. Report the Least Squares regression line. (You don t need to give me the rest of the output, just the estimated regression equation.) 8. What is the slope, including the units? Then write a statement interpreting the slope. 9. What is the y intercept, including the units? Then write a statement interpreting the y intercept. 21
22
StatCrunch CW/HW: Linear Regression Refresher Your lab report should include a well written response to each of the following questions and all relevant supporting graphs and analyses performed using StatCrunch. Submit your assignment through CANVAS by uploading it as a document (either in word format, or in pdf). Remember to put your name on the document itself. Linear Regression: Open the Amount in Savings ($) in the StatCrunch group. The savings account was opened in 1990. The following ordered pairs give the number of years since 1990 and the amount of money in a savings account. 1. What is the response variable and what is the explanatory variable in this model? - Response variable: - Explanatory variable: 2. Explain why you chose the way you did for Number 1 3. Construct a Scatterplot. Post the scatterplot below 4. How do you describe the scatterplot relationship that you observe? form: direction: strength: 23
5. Find the least square s regression line that describes the linear relationship 6. What is the slope, in units? Interpret the slope using a complete sentence. 7. What is the y intercept with units? Interpret the y intercept using a complete sentence. 8. Calculate the correlation. What does the correlation say about the nature of the relationship between the two variables we are looking at? 9. How much variability in cost does the size explain? What number are you using for your answer? 24
Residual Plots: 10. Construct a residual plot. a. Go to Stat > Regression > Simple Linear. Select the appropriate explanatory and response variables. b. On the editing page: scroll down to graphs. Under graphs, scroll down and highlight Residual vs X-values. Click compute. c. Which graph from (a) (f) from above does your residual plot look like? Notice that the red line must be at y = 0. Post the graph below: d. If a linear model is a good fit, the graph will look like (a), scattered everywhere! Do you think that a linear model is a good fit for our data? Why or why not? 25
11. Now construct a histogram of the residuals under the same menu. Post the graph below 12. If a linear model is a good fit the histogram of the residuals should be normal, centered around zero. Do you think a linear model is a good fit for our data? Why or why not? 13. What is the standard error Se? This is found by going to Stat > Regression > Simple Linear. Select the appropriate explanatory and response variables, and the output should have a Estimate of error standard deviation. This is the Se. 14. Se represents the average distance that the observed values fall from the regression line. Write a statement in context of the data we are writing about. 26