Lab 5a Exploring Correlation The correlation coefficient measures how tightly the points on a scatterplot cluster around a line. In this lab we will examine scatterplots and correlation coefficients for many pairs of variables. We will look at data from the EPA evaluation of the fuel economy of the 2013 model year cars (see www.fueleconomy.gov), and from a Statistics class survey. AT THE COMPUTER In this lab, we begin by gaining some practice at judging the values of correlations by looking at the scatterplots. We will begin by looking at a data set that deals with cars and gas mileage. The data set Cars2013 lists characteristics of car models from the 2013 model year. With the recent increases in fuel cost many people are concerned with fuel mileage. Let s study the relationship of mileage with other variables. Correlations and scatterplots can help us understand relationships between variables for these cars. 1. We begin by examining engine displacement (liters) and the city fuel mileage. Engine displacement indicates the size of a vehicle s engine. In general, large or high performance vehicles have larger engines. Remember that in a positive correlation, as one variable increases the other also increases. In a negative correlation, as one variable increases the other decreases. Do you think that the engine displacement of the car and the fuel mileage (miles per gallon) would be positively related, negatively related, or near zero? 2. Next, think about the amount of luggage space (cubic feet) a car has. What type of relationship do you feel this variable would have with fuel mileage? 3. Next let s consider the relationship between city gas mileage and highway gas mileage. What type of relationship do you feel these variables would have? 4. Finally consider the relationship between the amount of passenger space of the car and the amount of luggage space. What type of relationship do you feel these variables would have? 1
Software Tip: Creating Scatterplots In Data Desk select response variable of interest (place the Y on this variable). To select the independent variable, hold the shift key while selecting the variable (an X will be placed on the variable). Then choose Scatterplot under the Plot menu. In CrunchIt click Graphics>Scatterplot. Choose the variables of interest and put them in the Y and X boxes. If you d like more help, watch the CrunchIt help video on Correlation. Now let s see what the actual data indicates for these variables by making scatterplots of each pair of variables. Open the Cars2013 data file and make scatterplots of the pairs of variables we previously discussed. 5. How did your predictions compare with the actual scatterplots? Did you predict any positive correlations to be negative or vise versa? Mention any differences here. 6. Examine the scatterplots you have created. a. Which of the correlations appears to be the strongest? Remember that a strong correlation is one that is tightly packed near a straight line. b. By looking at the scatterplots, what correlation would you expect for these variables? Make a guess rounded to one decimal place along with a direction (positive or negative). Write your guess in the appropriate space below. Variables My correlation guess Actual correlation displacement and mpg:city space:luggage and mpg:city mpg:city and mpg:highway 2
space:luggage and space:passenger 3
c. Now calculate the actual correlation using software, and record those correlations in the table provided. Which of your guesses was off by the most? Software Tip: Calculating Correlation To calculate correlations in Data Desk use the hyperview triangle in the upper-left corner of the scatterplot you created for that variable. Choose the Correlation option. In CrunchIt click Statistics> Correlation. Click to choose the variables of interest. It is good practice to first take a look at the scatter plot before calculating the correlation coefficient in order to see if it is an appropriate measure of the strength of the association. For example, you should look for evidence about whether the pattern of association between the two variables is linear and possible explanations for outliers. 7a. Does there seem to be a nonlinear relationship between any of the pairs of variables you examined? Which ones? 7b. Look at the scatter plot of city mileage versus highway mileage for the cars in the data set. Try color-coding the points using some of the other variables like Drive Type and whether the car is a gas-electric hybrid. Explain what you learn from each picture. To add a color code: click on the variable that codes the colors and then use Modify>Colors>Add>by Group (in DataDesk) or use the Group by option in the Crunchit Scatterplot dialog box. 4
How does changing the unit of measurement change the correlation between variables? We can explore this by examining the conversion of the engine displacement and the mileage of the car. For the last decade engine displacement has been given in liters, but previously most American cars listed their engine displacement in cubic inches. How do the correlations change when we convert cubic inches to liters? We can find out by calculating a new variable that multiplies engine size by 61 (there are approximately 61 cubic inches in a liter). Software Tip: Creating a New Variable To create a new variable click Manip>Transform>New Derived Variable. Give the variable a name of your choice and click OK. A window will appear in which you should type the formula for the new variable. In Data Desk be sure to put the variable name in single quotes. For example: displacement *61. In CrunchIt click Insert>Evaluate Formula. In the formula box, type the name of the variable and the calculation you want. In CrunchIt put the name of the variable in square brackets, for example: [Displacement]*61. The new variable will be inserted in the next column of the worksheet. 8. Create a scatterplot of your new variable and the mileage variable. Examine this scatterplot and the scatterplot of displacement and city mileage you made earlier. How does the pattern of this scatterplot compare with the previous scatterplot of these variables? 9. Calculate the correlation between these variables. How does this compare with the correlation you found between these variables previously? Explain. 5
Now let s switch to another data set that deals with data from a survey completed by students in your class. Open the data file called Class_Survey. We ll look at the variables height (the students height in inches), mate height (the height of the students ideal mate ), year (of birth), age (in years), HS GPA (high school grade point average on a four-point scale), and OSU GPA (grade point average at Ohio State). 10. Before looking at the data, make a guess at the size of the correlation for the pairs of variables listed below. Record your guess in the table below. Next make a plot of each pair of variables from the Class_survey1 data file. Look at each plot and try to guess the value of the correlation. Record your guess in the table. Finally, use the software to find the actual value of the correlation between each pair of variables and record that value. Variables My guess at the correlation My guess after looking at the plot Actual correlation height and mate height year and age HS GPA and OSU GPA 11. Which of the three correlations in the previous question were the most difficult for you to guess? How did the three correlations differ from your expectations with respect to direction and/or strength? 12. Many people are surprised at the direction of the correlation between the students height and the height of their ideal mate in this survey. Think of an explanation for this paradox and use the software to investigate your explanation. Show (sketch or cut-and-paste) the results below that you used to test your explanation. 6