Chapter 3: Investigating associations between two variables Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables Extract from Study Design Key knowledge correlation coefficient, r, its interpretation, the issue of correlation and cause and effect Key skills construct scatterplots and use them to identify and describe associations between two numerical variables calculate the correlation coefficient, r, and interpret it in the context of the data answer statistical questions that require a knowledge of the associations between pairs of variables determine the equation of the least squares line giving the coefficients correct to a required number of decimal places or significant figures as specified distinguish between correlation and causation Chapter Sections Questions to be completed 3A Response and explanatory variables 1,2 3B Investigating associations between categorical variables 1,2,3,4 3C Investigating the associations between a numerical and a 1,2,3 categorical variable 3D Investigating associations between two numerical variables 1,2,3,4,5 3E How to interpret a scatterplot 1,2,3 3F Calculating the correlation coefficient 1,3 3G The coefficient of determination 1,2,3 3H Correlation and causality 1,2,3,4,5,6,7 3I Which graph? 1,2 Chapter 3 Review All questions Page 1 of 16
CORE: Data analysis Table of Contents CHAPTER 3 INVESTIGATING ASSOCIATIONS BETWEEN TWO VARIABLES... 1 EXTRACT FROM STUDY DESIGN... 1 KEY KNOWLEDGE... 1 KEY SKILLS... 1 TABLE OF CONTENTS... 2 3A. RESPONSE AND EXPLANATORY VARIABLES... 3 IDENTIFYING RESPONSE AND EXPLANATORY VARIABLES... 3 EXAMPLE 1: IDENTIFYING THE RESPONSE AND EXPLANATORY VARIABLES... 3 EXAMPLE 2: IDENTIFYING THE RESPONSE AND EXPLANATORY VARIABLES... 3 3B. INVESTIGATING ASSOCIATIONS BETWEEN CATEGORICAL VARIABLES... 4 USING A TWO- WAY FREQUENCY TABLE TO INVESTIGATE AN ASSOCIATION... 4 EXAMPLE 3: IDENTIFYING & DESCRIBING ASSOCIATIONS BETWEEN 2 CATEGORICAL VARIABLES... 4 EXAMPLE 4: IDENTIFYING & DESCRIBING ASSOCIATIONS BETWEEN 2 CATEGORICAL VARIABLES (NO ASSOCIATION)... 4 EXAMPLE 5... 5 3C. INVESTIGATING THE ASSOCIATION BETWEEN A NUMERICAL AND A CATEGORICAL VARIABLE... 6 USING PARALLEL BOX PLOTS TO IDENTIFY AND DESCRIBE ASSOCIATIONS... 6 EXAMPLE 6: USING PARALLEL DOT PLOTS TO IDENTIFY AND DESCRIBE ASSOCIATIONS... 7 EXAMPLE 7: USING A BACK- TO- BACK STEM PLOT TO IDENTIFY AND DESCRIBE ASSOCIATIONS... 7 3D. INVESTIGATING ASSOCIATIONS BETWEEN TWO NUMERICAL VARIABLES... 8 HOW TO CONSTRUCT A SCATTERPLOT (CAS CALCULATOR)... 8 3E. HOW TO INTERPRET A SCATTERPLOT... 9 DIRECTION AND OUTLIERS... 9 EXAMPLE 8: DIRECTION OF ASSOCIATION... 9 FORM... 10 EXAMPLE 9: FORM OF AN ASSOCIATION... 10 STRENGTH OF A LINEAR RELATIONSHIP: THE CORRELATION COEFFICIENT... 11 GUIDELINES FOR CLASSIFYING STRENGTH OF A LINEAR ASSOCIATION... 11 3F. CALCULATING THE CORRELATION COEFFICIENT... 12 CALCULATING R USING (CAS- CALCULATOR)... 12 3G. THE COEFFICIENT OF DETERMINATION... 13 CALCULATING THE COEFFICIENT OF DETERMINATION... 13 INTERPRETING THE COEFFICIENT OF DETERMINATION... 13 EXAMPLE 10... 13 3H. CORRELATION AND CAUSALITY... 14 THERE IS A STRONG RELATIONSHIP BETWEEN THE NUMBER OF NOBEL PRIZES A COUNTRY HAS WON AND THE NUMBER OF IKEA STORES IN THAT COUNTRY (R = 0.82). THE SCATTERPLOT BELOW SHOWS THE ASSOCIATION BETWEEN THE TWO VARIABLES.... 14 NON- CAUSAL EXPLANATIONS... 15 EXAMPLE: ICE CREAM SALES... 15 CORRELATION DOES NOT IMPLY CAUSALITY... 15 VIDEO... 15 EXAMPLE: CAUSATION... 15 3I. WHICH GRAPH?... 16 Page 2 of 16
Chapter 3: Investigating associations between two variables 3A. Response and explanatory variables Bivariate data is when the association between two variables is studied. Identifying response and explanatory variables Response variable is dependent on the explanatory variable. It is located on the y- axis. Explanatory variable is used to explain the changes that might be observed in the response variable. It is located on the x- axis. Example 1: Identifying the response and explanatory variables We wish to investigate the question, Does the time it takes a student to get to school depend on their mode of transport? The variables here are time and mode of transport. Which is the response variable (RV) and which is the explanatory variable (EV)? Example 2: Identifying the response and explanatory variables Can we predict people s height (in cm) from their wrist measurement? The variables in this investigation are height and wrist measurement. Which is the response variable (RV) and which is the explanatory variable (EV)? Another way to determine the explanatory and response variables is to ask the question Can we predict people s wrist measurement from their height? Height would be the explanatory variable and wrist measurement would be the response variable. Note: The explanatory variable is sometimes called the independent variable (IV) and response variable the dependent variable (DV) Page 3 of 16
CORE: Data analysis 3B. Investigating associations between categorical variables If two variables are related or linked in some way, we can say they are associated. Using a two- way frequency table to investigate an association Used to investigate the association between two categorical variables. Response variables are the rows Explanatory variable are the columns Example 3: Identifying & describing associations between 2 categorical variables A survey was conducted with 100 people. As part of this survey, people were asked whether or not they supported banning mobile phones in cinemas. The results are summarised in the table. Is there an association between support for banning mobile phones in cinemas and the sex of the respondent? Write a brief response quoting appropriate percentages. Example 4: Identifying & describing associations between 2 categorical variables (no association) In the same survey people were asked whether or not they supported Sunday racing. The results are summarised in the table. Is there an association between support for Sunday racing and the sex of the respondent? Write a brief response quoting appropriate percentages. Note: As a rule of thumb, a difference of at least 5% would be required to classify a difference as significant. Page 4 of 16
Chapter 3: Investigating associations between two variables Example 5 1 A survey was conducted with 1000 males under 50 years old. As part this survey, they were asked to rate their interest in sport as high, medium and low. Their age group was also recorded as under 18, 19-25, 26-35 and 36-50. The results are displayed in the table. a) Which is the explanatory variable, interest in sport or age group? b) Is there an association between interest in sport and age group? Write a brief response quoting appropriate percentages. 1 https://seniormaths.cambridge.edu.au/lessonsection/lesson.action#/resources/video/112756/ Page 5 of 16
CORE: Data analysis 3C. Investigating the association between a numerical and a categorical variable Using parallel box plots to identify and describe associations Used to display the relationship between numerical data and categorical data e.g. Salary (numerical data) vs age group (categorical data). Comparison between boxplots can be made in the way in which the distribution changes between categories in terms of shape, centre and spread. If there is no association between groups, the distributions will be similar for all groups. Comparing medians Comparing IQRs and/or ranges Comparing shapes Note: Any one of these reports by themselves can be used to claim an association between salary and age. However, using all three gives a more complete description of this relationship. Page 6 of 16
Example 6: Using parallel dot plots to identify and describe associations Chapter 3: Investigating associations between two variables The parallel dot plot below displays the distribution of the number of sit- ups performed by 15 people before and after they had completed a gym program. Do the parallel dot plots support the contention that the number of sit- ups performed is associated with completing the gym program? Write a brief explanation that compares medians. Note: Because it is often difficult to clearly identify the shape of a distribution with a small amount of data, we usually confine ourselves to comparing medians when using dot plots and back- to- back stem plots. Example 7: Using a back- to- back stem plot to identify and describe associations The back- to- back stem plot shows the distribution of life expectancy (in years) for 13 countries in 2010 and 1970. Do the back- to- back stem plots support the contention that life expectancy is increasing over time? Write a brief explanation based on your comparisons of the two medians. Page 7 of 16
CORE: Data analysis 3D. Investigating associations between two numerical variables Scatterplot compares two numerical variables Response variable is on the y- axis Explanatory variable is on the x- axis How to construct a scatterplot (CAS Calculator) Page 8 of 16
3E. How to interpret a scatterplot Chapter 3: Investigating associations between two variables When looking at a scatterplot the first thing to do is to decide if there is a clear pattern. In the example opposite, there is no clear pattern in the points. The points are just scattered randomly across the plot. Conclude that there is no association. For the three examples opposite, there is a clear (but different) pattern in each set of points. Conclude that there is an association. Having found a clear pattern, there are several things we look for in the pattern of points. Direction and outliers (if any) Form strength Direction and outliers random scatter of points no association between the variables height and age for this group of footballers. There is an outlier; the one who is 201 cm tall. clear pattern in the scatterplot of weight v. height for footballers. The two variables are associated. The points go upwards as you move to the right. This is a positive association between the variables. Tall players tend to be heavy and vice versa. In this scatterplot There are no outliers. the scatterplot of working hours against university participation rates for 15 countries shows a clear pattern. The two variables are associated. In this case the points go downwards as you move to the right. There is a negative association between the variables. Countries with high working hours tend to have low university participation rates and vice versa. There are no outliers. Example 8: Direction of association Classify each of the following scatterplots as exhibiting positive, negative or no association. Where there is an association, describe the direction of the association in terms of the variables in the scatterplot and what it means in terms of the variable involved. Page 9 of 16
CORE: Data analysis Form Looking for a pattern in points that has a linear form. If the points in the scatterplot appear to be random fluctuations around a straight line, then it is said that the scatterplot has a linear form. Example 9: Form of an association Classify the form of the association in each of scatterplot as linear or non- linear. Page 10 of 16
Strength of a linear relationship: the correlation coefficient Chapter 3: Investigating associations between two variables The strength of a linear association is an indication of how closely the points in the scatterplot fit in a straight line. The correlation coefficient, r, measures the strength of a linear relationship. The correlation coefficient is a value between - 1 and +1. Guidelines for classifying strength of a linear association Page 11 of 16
CORE: Data analysis 3F. Calculating the correlation coefficient Calculating r using (CAS- calculator) Page 12 of 16
3G. The coefficient of determination Chapter 3: Investigating associations between two variables Calculating the coefficient of determination Numerically, the coefficient of determination = r 2. Thus, if the correlation between weight and height is r = 0.8, then the Coefficient of determination = r 2 = 0.8 2 = 0.64 or 0.64 100% = 64% Interpreting the coefficient of determination The coefficient of determination (as a percentage) tells us the variation in the response variable that is explained by the variation in the explanatory variable. Let s look at the relationship between weight and height. The coefficient of determination is 0.64 or 64%. The coefficient of determination tells us that: Example 10 64% of the variation in people s weight is explained by the variation in their height. For the relationship described by this scatterplot, the coefficient of determination = 0.5210. Determine the value of the correlation coefficient, r. Example 11 Carbon monoxide (CO) levels in the air and traffic volume are linearly related, with: r $% '()(',+,-../0 )1'23( = +0.985 Determine the value of the coefficient of determination, write it in percentage terms and interpret. In this relationship, traffic volume is the explanatory variable. Example 12 Scores on tests of verbal and mathematical ability are linearly related with: r 3-+;(3-+/0-',)(,<-' = +0.275 Determine the value of the coefficient of determination, write it in percentage terms, and interpret. In this relationship, verbal ability is the explanatory variable. Page 13 of 16
CORE: Data analysis 3H. Correlation and causality There is a strong relationship between the number of Nobel prizes a country has won and the number of IKEA stores in that country (r = 0.82). The scatterplot below shows the association between the two variables. Does this mean that one way to increase the number of Australian Nobel prize winners is to build more IKEA stores? Identifying that there is a high degree of correlation between two variables may be interesting and can often flag the need for further, more detailed investigation, it in no way gives us any basis to comment on whether or not one variable causes particular values in another variable. Correlation is a statistic measure that defines the size and direction of the relationship between two variables. Causation states that one event is the result of the occurrence of the other event (or variable). This is also referred to as cause and effect, where one event is the cause and this makes another event happen, this being the effect. An example of a cause and effect relationship could be an alarm going off (cause - happens first) and a person waking up (effect - happens later). It is also important to realise that a high correlation does not imply causation. For example, a person smoking could have a high correlation with alcoholism but it is not necessarily the cause of it, thus they are different. Testing causality One way to test for causality is experimentally, where a control study is the most effective. 1. This involves splitting the sample or population data and making one a control group (e.g. one group gets a placebo and the other get some form of medication). Another way is via an observational study which also compares against a control variable, but the researcher has no control over the experiment (e.g. smokers and non- smokers who develop lung cancer). They have no control over whether they develop lung cancer or not. Page 14 of 16
Chapter 3: Investigating associations between two variables Non- causal explanations Although we may observe a strong correlation between two variables, this does not necessarily mean that an association exists. In some cases the correlation between two variables can be explained by a common response variable which provides the association. Example: Ice cream Sales These graphs show a correlation between ice cream sales and (from left to right) drowning deaths, forest fires and shark attacks. It may appear that to save lives we should ban ice cream, but clearly there is another reason. In this case, the reason is that all these things occur in Summer so it is temperature which is the cause. Correlation does not imply causality Video To help you with this concept, you should watch the video The Question of Causation, which can be accessed through the link below. It is well worth 15 minutes of your time. http://cambridge.edu.au/redirect/?id=6103 Example: Causation Returning to the scatterplot of the number of Nobel prizes a country has won and the number of IKEA stores in that country. Comment on the relationship. The between the number of a country has and the number of in that country, does imply that the of the is the of. Page 15 of 16
CORE: Data analysis 3I. Which graph? The following guidelines may assist in deciding which is the most appropriate graph to display data Page 16 of 16