Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit us on the World Wide Web at: www.pearsoned.co.uk Pearson Education Limited 2014 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a licence permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd, Saffron House, 6 10 Kirby Street, London EC1N 8TS. All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this book by such owners. ISBN 10: 1-292-02395-3 ISBN 13: 978-1-292-02395-3 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Printed in the United States of America
4 ASSESS YOUR UNDERSTANDING VOCABULARY AND SKILL BUILDING 1. What is meant by a marginal distribution? What is meant by a conditional distribution? 2. Refer to Table 9. Is constructing a conditional distribution by level of education different from constructing a conditional distribution by employment status? If they are different, explain the difference. 3. Explain why we use the term association rather than correlation when describing the relation between two variables in this section. 4. Explain the idea behind Simpson s Paradox. In Problems 5 and 6, (a) Construct a frequency marginal distribution. (c) Construct a conditional distribution by x. (d) Draw a bar graph of the conditional distribution found in part (c). 5. x 1 x 2 x 3 6. y 1 20 25 30 y 2 30 25 50 APPLYING THE CONCEPTS 7. Made in America In a recent Harris Poll, a random sample of adult Americans (18 years and older) was asked, When you see an ad emphasizing that a product is Made in America, are you more likely to buy it, less likely to buy it, or neither more nor less likely to buy it? The results of the survey, by age group, are presented in the contingency table below. 18 34 35 44 45 54 55+ Total More likely 238 329 360 402 1329 Less likely 22 6 22 16 66 Neither more 282 201 164 118 765 nor less likely Total 542 536 546 536 2160 Source : The Harris Poll x 1 x 2 x 3 y 1 35 25 20 y 2 65 75 80 (a) How many adult Americans were surveyed? How many were 55 and older? (c) What proportion of Americans are more likely to buy a product when the ad says Made in America? (d) Construct a conditional distribution of likelihood to buy Made in America by age. That is, construct a conditional distribution treating age as the explanatory variable. (e) Draw a bar graph of the conditional distribution found in part (d). (f) Write a couple sentences explaining any relation between likelihood to buy and age. 8. Desirability Traits In a recent Harris Poll, a random sample of adult Americans (18 years and older) was asked, Given a choice of the following, which one would you most want to be? Results of the survey, by gender, are given in the contingency table. 268 Richer Thinner Smarter Younger None of these Total Male 520 158 159 181 102 1120 Female 425 300 144 81 92 1042 Total 945 458 303 262 194 2162 Source : The Harris Poll (a) How many adult Americans were surveyed? How many males were surveyed? (c) What proportion of adult Americans want to be richer? (d) Construct a conditional distribution of desired trait by gender. That is, construct a conditional distribution treating gender as the explanatory variable. (e) Draw a bar graph of the conditional distribution found in part (d). (f ) Write a couple sentences explaining any relation between desired trait and gender. 9. Party Affiliation Is there an association between party affiliation and gender? The following data represent the gender and party affiliation of registered voters based on a random sample of 802 adults. Female Male Republican 105 115 Democrat 150 103 Independent 150 179 Source : Star Tribune Minnesota Poll (a) Construct a frequency marginal distribution. (c) What proportion of registered voters considers themselves to be Independent? (d) Construct a conditional distribution of party affiliation by gender. (e) Draw a bar graph of the conditional distribution found in part (d). (f ) Is gender associated with party affiliation? If so, how? 10. Feelings on Abortion The Pew Research Center for the People and the Press conducted a poll in which it asked about the availability of abortion. The table is based on the results of the survey. Generally available Allowed, but more limited Illegal, with few exceptions High School or Less Some College College Graduate 90 72 113 51 60 77 125 94 69 Never permitted 51 14 17 Source : Pew Research Center for the People and the Press (a) Construct a frequency marginal distribution. (c) What proportion of college graduates feel that abortion should never be permitted? (d) Construct a conditional distribution of people s feelings about the availability of abortion by level of education.
(e) Draw a bar graph of the conditional distribution found in part (d). (f ) Is level of education associated with opinion on the availability of abortion? If so, how? 11. Health and Happiness The General Social Survey asks questions about one s happiness and health. One would think that health plays a role in one s happiness. Use the data in the table to determine whether healthier people tend to also be happier. Treat level of health as the explanatory variable. Poor Fair Good Excellent Total Not too happy 696 1,386 1,629 732 4,443 Pretty happy 950 3,817 9,642 5,19519,604 Very happy 350 1,382 4,520 5,09511,347 Total 1,996 6,585 15,791 11,022 35,394 Source: General Social Survey 12. Happy in Your Marriage? The General Social Survey asks questions about one s happiness in marriage. Is there an association between gender and happiness in marriage? Use the data in the table to determine if gender is associated with happiness in marriage. Treat gender as the explanatory variable. Male Female Total Very happy 7,609 7,94215,551 Pretty happy 3,738 4,4478,185 Not too happy 259 460 719 Total 11,606 12,849 24,455 Source: General Social Survey 13. Smoking Is Healthy? Could it be that smoking actually increases survival rates among women? The following data represent the 20-year survival status and smoking status of 1314 English women who participated in a cohort study from 1972 to 1992. Smoking Status Smoker (S) Nonsmoker (NS) Total Dead 139 230369 Alive 443 502945 Total 582 732 1314 Source: David R. Appleton et al. Ignoring a Covariate: An Example of Simpson s Paradox. American Statistician 50(4), 1996 (a) What proportion of the smokers was dead after 20 years? What proportion of the nonsmokers was dead after 20 years? What does this imply about the health consequences of smoking? The data in the table above do not take into account a variable that is strongly related to survival status, age. The data shown next give the survival status of women and their age at the beginning of the study. For example, 14 women who were 35 to 44 at the beginning of the study were smokers and dead after 20 years. Age Group 18 24 25 34 35 44 45 54 55 64 65 74 75 or older S NS S NS S NS S NS S NS S NS S NS Dead 2 1 3 5 14 7 27 12 51 40 29 101 13 64 Alive 53 61 121 152 95 114 103 66 64 81 7 28 0 0 (b) Determine the proportion of 18- to 24-year-old smokers who were dead after 20 years. Determine the proportion of 18- to 24-yearold nonsmokers who were dead after 20 years. (c) Repeat part (b) for the remaining age groups to create a conditional distribution of survival status by smoking status for each age group. (d) Draw a bar graph of the conditional distribution from part (c). (e) Write a short report detailing your findings. 269
14. Treating Kidney Stones Researchers conducted a study to determine which of two treatments, A or B, is more effective in the treatment of kidney stones. The results of their experiment are given in the table. Treatment A Treatment B Total Effective 273 289562 Not effective 77 61138 Total 350 350 700 Source: C. R. Charig, D. R. Webb, S. R. Payne, and O. E. Wickham. Comparison of Treatment Real Calculi by Operative Surgery, Percutaneous Nephrolithotomi, and Extracorporeal Shock Wave Lithoripsy. British Medical Journal 292(6524): 879 882. (a) Which treatment appears to be more effective? Why? The data in the table above do not take into account the size of the kidney stone. The data shown next indicate the effectiveness of each treatment for both large and small kidney stones. Small Stones Large Stones A B A B Effective 81 234 273 55 Not effective 6 36 77 25 (b) Determine the proportion of small kidney stones that were effectively dealt with using treatment A. Determine the proportion of small kidney stones that were effectively dealt with using treatment B. (c) Repeat part (b) for the large stones to create a conditional distribution of effectiveness by treatment for each stone size. (d) Draw a bar graph of the conditional distribution from part (c). (e) Write a short report detailing your findings. Technology Step-By-Step Contingency Tables and Association MINITAB 1. Enter the values of the row variable in column C1 and the corresponding values of the column variable in C2. The frequency for the cell is entered in C3. For example, the data in Table 9 would be entered as follows: Frequencies enter C3. Click the Options button and make sure the radio button for Display marginal statistics for Rows and columns is checked. Click OK. Click the Categorical Variables button and then select the summaries you desire. Click OK twice. 2. Select the Stat menu and highlight Tables. Then select Descriptive Statistics... 3. In the cell For Rows: enter C1. In the cell For Columns: enter C2. In the cell StatCrunch 1. Enter the contingency table into the spreadsheet. The first column should be the row variable. For example, for the data in Table 9, the first column would be employment status. Each subsequent column would be the counts of each category of the column variable. For the data in Table 9, enter the counts for each level of education. Title each column (including the first column indicating the row variable). 2. Select Stat, highlight Tables, select Contingency, then highlight with summary. 3. Select the column variables. Then select the label of the row variable. For example, the data in Table 9 has four column variables (Did Not Finish High School, and so on) and the row label is employment status. Click Next>. 4. Decide what values you want displayed. Typically, we choose row percent and column percent for this section. Click Calculate. 270
REVIEW Summary In this chapter we looked at describing the relation between two quantitative variables (Sections 1 to 3) and between two qualitative variables ( Section 4 ). The first step in describing the relation between two quantitative variables is to draw a scatter diagram. The explanatory variable is plotted on the horizontal axis and the corresponding response variable on the vertical axis. The scatter diagram can be used to discover whether the relation between the explanatory and the response variables is linear. In addition, for linear relations, we can judge whether the linear relation shows positive or negative association. A numerical measure for the strength of linear relation between two quantitative variables is the linear correlation coefficient. It is a number between -1 and 1, inclusive. Values of the correlation coefficient near -1 are indicative of a negative linear relation between the two variables. Values of the correlation coefficient near +1 indicate a positive linear relation between the two variables. If the correlation coefficient is near 0, then little linear relation exists between the two variables. Be careful! Just because the correlation coefficient between two quantitative variables indicates that the variables are linearly related, it does not mean that a change in one variable causes a change in a second variable. It could be that the correlation is the result of a lurking variable. Once a linear relation between the two variables has been discovered, we describe the relation by finding the least-squares regression line. This line best describes the linear relation between the explanatory and response variables. We can use the least-squares regression line to predict a value of the response variable for a given value of the explanatory variable. The coefficient of determination, R 2, measures the percent of variation in the response variable that is explained by the least-squares regression line. It is a measure between 0 and 1, inclusive. The closer R 2 is to 1, the more explanatory value the line has. Whenever a least-squares regression line is obtained, certain diagnostics must be performed. These include verifying that the linear model is appropriate, verifying the residuals have constant variance, and checking for outliers and influential observations. Section 4 introduced methods that allow us to describe any association that might exist between two qualitative variables. This is done through contingency tables. Both marginal and conditional distributions allow us to describe the effect one variable might have on the other variable in the study. We also construct bar graphs to see the association between the two variables in the study. Again, just because two qualitative variables are associated does not mean that a change in one variable causes a change in a second variable. We also looked at Simpson s Paradox, which represents situations in which an association between two variables inverts or goes away when a third (lurking) variable is introduced into the analysis. Vocabulary Bivariate data Response variable Explanatory variable Predictor variable Scatter diagram Positively associated Negatively associated Linear correlation coefficient Correlation matrix Lurking variable Residual Least-squares regression line Slope y-intercept Outside the scope of the model Coefficient of determination Deviation Total deviation Explained deviation Unexplained deviation Residual plot Constant error variance Outlier Influential observation Contingency (or two-way) table Row variable Column variable Cell Marginal distribution Conditional distribution Simpson s Paradox Formulas Correlation Coefficient a a x i - x s x ba y i - y b s y r = n - 1 Equation of the Least-Squares Regression Line yn = b 1 x + b 0 where yn is the predicted value of the response variable b 1 = r # s y s x is the slope of the least-squares regression line b 0 = y - b 1 x is the y-intercept of the least-squares regression line Coefficient of Determination, R 2 explained variation R 2 = total variation unexplained variation = 1 - total variation = r 2 for the least@squares regression model yn = b 1 x + b 0 271
Objectives Section You should be able to... Example Review Exercises 1 1 Draw and interpret scatter diagrams 1,3 2(b), 3(a), 6(a), 13(a) 2 Describe the properties of the linear correlation coefficient 18 3 Compute and interpret the linear correlation coefficient 2,3 2(c), 3(b), 13(b) 4 Determine whether a linear relation exists between two variables 4 2(d), 3(c) 5 Explain the difference between correlation and causation 5 14, 17 2 1 Find the least-squares regression line and use the line to make predictions 2 Interpret the slope and y-intercept of the least-squares regression line 2,3 1(a), 1(b), 4(a), 4(d), 5(a), 5(c), 6(d), 12(a), 13(c), 19(c) 1(c), 1(d), 4(c), 5(b), 19(b) 3 Compute the sum of squared residuals 4 6(f), 6(g) 3 1 Compute and interpret the coefficient of determination 1 1(e), 10(a), 11(a) 2 Perform residual analysis on a regression model 2 5 7 9, 10(b) and (c), 11(b) and (c), 13(d) and (e), 19(d) 3 Identify influential observations 6 10(d), 10(e), 11(d), 12(b), 19(e) 4 1 Compute the marginal distribution of a variable 1 and 2 15( b) 2 Use the conditional distribution to identify association among categorical data 3 5 15(d), 15(e), 15(f) 3 Explain Simpson s Paradox 6 16 Review Exercises 1. Basketball Spreads In sports betting, Las Vegas sports books establish winning margins for a team that is favored to win a game. An individual can place a wager on the game and will win if the team bet upon wins after accounting for the spread. For example, if Team A is favored by 5 points and wins the game by 7 points, then a bet on Team A is a winning bet. However, if Team A wins the game by only 3 points, then a bet on Team A is a losing bet. For NCAA Division I basketball games, a least-squares regression with explanatory variable home team Las Vegas spread, x, and response variable home team winning margin, y, is yn = 1.007x - 0.012. Source: Justin Wolfers. Point Shaving: Corruption in NCAA Basketball (a) Predict the winning margin if the home team is favored by 3 points. (b) Predict the winning margin (of the visiting team) if the visiting team is favored by 7 points (this is equivalent to the home team being favored by -7 points). (c) Interpret the slope. (d) Interpret the y-intercept. (e) The coefficient of determination is 0.39. Interpret this value. 2. Fat and Calories in Cheeseburgers A nutritionist was interested in developing a model that describes the relation between the amount of fat (in grams) in cheeseburgers at fastfood restaurants and the number of calories. She obtains the following data from the Web sites of the companies. Sandwich (Restaurant) Fat Content (g) Calories Quarter-pound Single with Cheese (Wendy s) 20 430 Whataburger (Whataburger) 39 750 Cheeseburger (In-n-Out) 27 480 Big Mac (McDonald s) 29 540 Quarter-pounder with cheese (McDonald s) 26 510 Whopper with cheese (Burger King) 47 760 Jumbo Jack (Jack in the Box) 35 690 Double Steakburger with cheese (Steak n Shake) 38 632 Source: Each company s Web site (a) The researcher wants to use fat content to predict calories. Which is the explanatory variable? (b) Draw a scatter diagram of the data. (c) Compute the linear correlation coefficient between fat content and calories. (d) Does a linear relation exist between fat content and calories in fast-food restaurant sandwiches? 272