The Whopper has been Burger King s signature sandwich since 1957.

Similar documents
COVER THE CATERPILLAR

What happened on the Titanic at 11:40 on the night of April 14, 1912,

3.2 Least- Squares Regression

Sample Size and Screening Size Trade Off in the Presence of Subgroups with Different Expected Treatment Effects

1 Thinking Critically With Psychological Science

Section 3.2 Least-Squares Regression

Chapter 3: Examining Relationships

An investigation of ambiguous-cue learning in pigeons

Chapter 7: Descriptive Statistics

Prevent, Promote, Provoke: Voices from the Substance Abuse Field

Getting started on Otezla

Statistical Analysis of Method Comparison Data

Statistics: Interpreting Data and Making Predictions. Interpreting Data 1/50

Preview and Preparation Pack. AS & A2 Resources for the new specification

Register studies from the perspective of a clinical scientist

CHAPTER ONE CORRELATION

Relationships. Between Measurements Variables. Chapter 10. Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Standardization of the One-stage Prothrombin Time for the Control of Anticoagulant Therapy

Chapter 3 CORRELATION AND REGRESSION

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

Recommendations. for the Governance & Administration of Destination Marketing Fees

Eating and Sleeping Habits of Different Countries

3. Which word is an antonym

3.2A Least-Squares Regression

Gage R&R. Variation. Allow us to explain with a simple diagram.

Core Practices for Successful Manufacturing Leaders John Hunt and Tami Trout

How can skin conductance responses increase over trials while skin resistance responses decrease?

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Math 075 Activities and Worksheets Book 2:

The Pretest! Pretest! Pretest! Assignment (Example 2)

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

Instantaneous Measurement and Diagnosis

Incentives, information, rehearsal, and the negative recency effect*

15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA

Statisticians deal with groups of numbers. They often find it helpful to use

I don t want to be here anymore. I m really worried about Clare. She s been acting different and something s not right

Controlling Worries and Habits

Quantifying the benefit of SHM: what if the manager is not the owner?

Determinants of Cancer Screening Frequency: The Example of Screening for Cervical Cancer

Talking About. And Dying. A Discussion Tool For Residential Aged Care Facility Staff

ESL Health Unit Unit Four Healthy Aging Lesson Two Exercise

Lesson 1: Distributions and Their Shapes

Never P alone: The value of estimates and confidence intervals

Sections are bigger at the bottom and narrower at the top to show that even within a food group, some foods should be eaten more often than others.

Interviewer: Tell us about the workshops you taught on Self-Determination.

Autism, my sibling, and me

Classification of ADHD and Non-ADHD Using AR Models and Machine Learning Algorithms

The Two Decade Long Puzzle One parent s CMV journey and why awareness is KEY!

Follow-up Call Script and Log

Homework #3. SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Why we get hungry: Module 1, Part 1: Full report

(WG Whitfield Growden, MD; DR Diane Redington, CRNP)

Psychosocial Factors and Success in Clinical SpeechLanguage Graduate Programs

M 140 Test 1 A Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Conduct an Experiment to Investigate a Situation

The ABCs of CBT. Understanding the basic foundation and implementation of Cognitive Behavioral Therapy

Bouncing Ball Lab. Name

This week s issue: UNIT Word Generation. conceive unethical benefit detect rationalize

You probably don t spend a lot of time here, but if you do, you are reacting to the most basic needs a human has survival and protection.

Managing Your Emotions

ORIENTATION SAN FRANCISCO STOP SMOKING PROGRAM

Take new look emotions we see as negative may be our best friends (opposite to the script!)

end-stage renal disease

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

COMBUSTION GENERATED PARTICULATE EMISSIONS

Unit 1 Exploring and Understanding Data

GENETIC AND SOMATIC EFFECTS OF IONIZING RADIATION

Balkan Journal of Mechanical Transmissions (BJMT)

Exemplar for Internal Assessment Resource Mathematics Level 3. Resource title: Sport Science. Investigate bivariate measurement data

Appendix B Statistical Methods

Lesson 8 Setting Healthy Eating & Physical Activity Goals

Regression Equation. November 29, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

Chapter 3: Describing Relationships

Undertaking statistical analysis of

SPSS Correlation/Regression

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

DAY 2 RESULTS WORKSHOP 7 KEYS TO C HANGING A NYTHING IN Y OUR LIFE TODAY!

Helping Your Asperger s Adult-Child to Eliminate Thinking Errors

Making Your Treatment Work Long-Term

Political Science 15, Winter 2014 Final Review

Part 1. For each of the following questions fill-in the blanks. Each question is worth 2 points.

Experimental Methods 2/9/18. What is an Experimental Method?

JEJUNAL AND ILEAL ABSORPTION OF DIBASIC AMINO ACIDS AND AN ARGININE-CONTAINING DIPEPTIDE IN CYSTINURIA

Chapter 19. Confidence Intervals for Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

I ll Do it Tomorrow. READTHEORY Name Date

COPING WITH ANXIETY - HELPING YOUR CHILDREN BE MORE RESILIENT

Problem Situation Form for Parents

Regression. Lelys Bravo de Guenni. April 24th, 2015

Lose Weight. without dieting.

Section 4 - Dealing with Anxious Thinking

CCM6+7+ Unit 12 Data Collection and Analysis

Statistics Coursework Free Sample. Statistics Coursework

Living Well with Diabetes. Meeting 12. Welcome!

Good Communication Starts at Home

(2) In each graph above, calculate the velocity in feet per second that is represented.

Section 4 Decision-making

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

A Fellowship in Pediatric Palliative Care:

Transcription:

CHAPER 8 Linear Regression WHO WHA UNIS HOW Items on the Brger King men Protein content and total fat content Grams of protein Grams of fat Spplied by BK on reqest or at their Web site he Whopper has been Brger King s signatre sandwich since 1957. One Doble Whopper with cheese provides 53 grams of protein all the protein yo need in a day. It also spplies 1020 calories and 65 grams of fat. he Daily Vale (based on a 2000-calorie diet) for fat is 65 grams. So after a Doble Whopper yo ll want the rest of yor calories that day to be fat-free. 1 Of corse, the Whopper isn t the only item Brger King sells. How are fat and protein related on the entire BK men? he scatterplot of the Fat (in grams) verss the Protein (in grams) for foods sold at Brger King shows a positive, moderately strong, linear relationship. Video: Manatees and Motorboats. Are motorboats killing more manatees in Florida? Here s the story on video. Fat (g) 60 45 30 FIGURE 8.1 otal Fat verss Protein for 30 items on the BK men. he Doble Whopper is in the pper right corner. It s extreme, bt is it ot of line? 15 0 0.0 12.5 25.0 37.5 50.0 Protein (g) Activity: Linear Eqations. For a qick review of linear eqations, view this activity and play with the interactive tool. If yo want 25 grams of protein in yor lnch, how mch fat shold yo expect to consme at Brger King? he correlation between Fat and Protein is 0.83, a sign that the linear association seen in the scatterplot is fairly strong. Bt strength of the relationship is only part of the pictre. he correlation says, he linear association between these two variables is fairly strong, bt it doesn t tell s what the line is. 1 Sorry abot the fries. 171

172 CHAPER 8 Linear Regression Statisticians, like artists, have the bad habit of falling in love with their models. George Box, famos statistician Activity: Residals. Residals are the basis for fitting lines to scatterplots. See how they work. Now we can say more. We can model the relationship with a line and give its eqation. he eqation will let s predict the fat content for any Brger King food, given its amont of protein. We met or first model in Chapter 6. We saw there that we can specify a Normal model with two parameters: its mean 1m2 and standard deviation 1s2. For the Brger King foods, we d choose a linear model to describe the relationship between Protein and Fat. he linear model is jst an eqation of a straight line throgh the data. Of corse, no line can go throgh all the points, bt a linear model can smmarize the general pattern with only a cople of parameters. Like all models of the real world, the line will be wrong wrong in the sense that it can t match reality exactly. Bt it can help s nderstand how the variables are associated. Fat (g) 60 45 30 15 0 Residals ˆ y Not only can t we draw a line throgh all the points, the best line might not even hit any of the points. hen how can it be the best line? We want to find the line that somehow comes closer to all the points than any other line. Some of the points will be above the line and some below. For example, the line might sggest that a BK Broiler chicken sandwich with 30 grams of protein shold have 36 grams of fat when, in fact, it actally has only 25 grams of fat. We call the estimate made from a model the predicted vale, and write it as y N (called y-hat) to distingish it from the tre vale y (called, h, y). he difference between the observed vale and its associated predicted vale is called the residal. he residal vale tells s how far off the model s prediction is at that point. he BK Broiler chicken residal wold be y - y N = 25-36 = -11 g of fat. Best Fit Means Least Sqares Activity: he Least Sqares Criterion. Does yor sense of best fit look like the least sqares line? y residal 0.0 12.5 25.0 37.5 50.0 Protein (g) residal = observed vale - predicted vale A negative residal means the predicted vale is too big an overestimate. And a positive residal shows that the model makes an nderestimate. hese may seem backwards ntil yo think abot them. Who s on First In 1805, Legendre was the first to pblish the least sqares soltion to the problem of fitting a line to data when the points don t all fall exactly on the line.he main challenge was how to distribte the errors fairly. After considerable thoght, he decided to minimize the sm of the sqares of what we now call the residals. When Legendre pblished his paper, thogh, Gass claimed he had been sing the method since 1795. Gass later referred to the least sqares soltion as or method (principim nostrm), which certainly didn t help his relationship with Legendre. o find the residals, we always sbtract the predicted vale from the observed one. he negative residal tells s that the actal fat content of the BK Broiler chicken is abot 11 grams less than the model predicts for a typical Brger King men item with 30 grams of protein. Or challenge now is how to find the right line. When we draw a line throgh a scatterplot, some residals are positive and some negative. We can t assess how well the line fits by adding p all the residals the positive and negative ones wold jst cancel each other ot. We faced the same isse when we calclated a standard deviation to measre spread. And we deal with it the same way here: by sqaring the residals. Sqaring makes them all positive. Now we can add them p. Sqaring also emphasizes the large residals. After all, points near the line are consistent with the model, bt we re more concerned abot points far from the line. When we add all the sqared residals together, that sm indicates how well the line we drew fits the data the smaller the sm, the better the fit. A different line will prodce a different sm, maybe bigger, maybe smaller. he line of best fit is the line for which the sm of the sqared residals is smallest, the least sqares line.

Correlation and the Line 173 Least sqares. ry to minimize the sm of areas of residal sqares as yo drag a line across a scatterplot. Yo might think that finding this line wold be pretty hard. Srprisingly, it s not, althogh it was an exciting mathematical discovery when Legendre pblished it in 1805 (see margin note on previos page). 1 2 1 z Fat FIGURE 8.2 1 Correlation and the Line 1 2 he Brger King scatterplot in z-scores. NOAION ALER: z Protein Ptting a hat on it is standard Statistics notation to indicate that something has been predicted by a model. Whenever yo see a hat over a variable name or symbol, yo can assme it is the predicted version of that variable or symbol (and look arond for the model). 1 2 1 z Fat FIGURE 8.3 1 1s x 1 2 rs y z Protein Standardized fat vs. standardized protein with the regression line. Each one standard deviation change in protein reslts in a predicted change of r standard deviations in fat. If yo sspect that what we know abot correlation can lead s to the eqation of the linear model, yo re headed in the right direction. It trns ot that it s not a very big step. In Chapter 7 we learned a lot abot how correlation worked by looking at a scatterplot of the standardized variables. Here s a scatterplot of z y (standardized Fat) vs. z x (standardized Protein). What line wold yo choose to model the relationship of the standardized vales? Let s start at the center of the scatterplot. How mch protein and fat does a typical Brger King food item provide? If it has average protein content, x, what abot its fat content? If yo gessed that its fat content shold be abot average, y, as well, then yo ve discovered the first property of the line we re looking for. he line mst go throgh the point ( x, y). In the plot of z-scores, then, the line passes throgh the origin (0, 0). Yo might recall that the eqation for a line that passes throgh the origin can be written with jst a slope and no intercept: y = mx. he coordinates of or standardized points aren t written (x, y); their coordinates are z-scores: ( z x, z y ). We ll need to change or eqation to show that. And we ll need to indicate that the point on the line corresponding to a particlar z is z N x y, the model s estimate of the actal vale of z y. So or eqation becomes z N y = mz x. Many lines with different slopes pass throgh the origin. Which one fits or data the best? hat is, which slope determines the line that minimizes the sm of the sqared residals? It trns ot that the best choice for m is the correlation coefficient itself, r! (Yo mst really wonder where that stnning assertion comes from. Check the Math Box.) Wow! his line has an eqation that s abot as simple as we cold possibly hope for: z N y = rz x. Great. It s simple, bt what does it tell s? It says that in moving one standard deviation from the mean in x, we can expect to move abot r standard deviations away from the mean in y. Now that we re thinking abot least sqares lines, the correlation is more than jst a vage measre of strength of association. It s a great way to think abot what the model tells s. Let s be more specific. For the sandwiches, the correlation is 0.83. If we standardize both protein and fat, we can write z N Fat = 0.83z Protein. his model tells s that for every standard deviation above (or below) the mean a sandwich is in protein, we ll predict that its fat content is 0.83 standard deviations above (or below) the mean fat content. A doble hambrger has 31 grams of protein, abot 1 SD above the mean. Ptting 1.0 in for z N Protein in the model gives a z Fat vale of 0.83. If yo trst the model, yo d expect the fat content to be abot 0.83 fat SDs above the mean fat level. Moving one standard deviation away from the mean in x moves or estimate r standard deviations away from the mean in y. If r = 0, there s no linear relationship. he line is horizontal, and no matter how many standard deviations yo move in x, the predicted vale for y doesn t

174 CHAPER 8 Linear Regression change. On the other hand, if r = 1.0 or -1.0, there s a perfect linear association. In that case, moving any nmber of standard deviations in x moves exactly the same nmber of standard deviations in y. In general, moving any nmber of standard deviations in x moves r times that nmber of standard deviations in y. How Big Can Predicted Vales Get? Sir Francis Galton was the first to speak of regression, althogh others had fit lines to data by the same method. he First Regression Sir Francis Galton related the heights of sons to the heights of their fathers with a regression line. he slope of his line was less than 1.hat is, sons of tall fathers were tall, bt not as mch above the average height as their fathers had been above their mean. Sons of short fathers were short, bt generally not as far from their mean as their fathers. Galton interpreted the slope correctly as indicating a regression toward the mean height and regression stck as a description of the method he had sed to find the line. Sppose yo were told that a new male stdent was abot to join the class, and yo were asked to gess his height in inches. What wold be yor gess? A safe gess wold be the mean height of male stdents. Now sppose yo are also told that this stdent has a grade point average (GPA) of 3.9 abot 2 SDs above the mean GPA. Wold that change yor gess? Probably not. he correlation between GPA and height is near 0, so knowing the GPA vale doesn t tell yo anything and doesn t move yor gess. (And the eqation tells s that as well, since it says that we shold move 0 * 2 SDs from the mean.) On the other hand, sppose yo were told that, measred in centimeters, the stdent s height was 2 SDs above the mean. here s a perfect correlation between height in inches and height in centimeters, so yo d know he s 2 SDs above mean height in inches as well. (he eqation wold tell s to move 1.0 * 2 SDs from the mean.) What if yo re told that the stdent is 2 SDs above the mean in shoe size? Wold yo still gess that he s of average height? Yo might gess that he s taller than average, since there s a positive correlation between height and shoe size. Bt wold yo gess that he s 2 SDs above the mean? When there was no correlation, we didn t move away from the mean at all. With a perfect correlation, we moved or gess the fll 2 SDs. Any correlation between these extremes shold lead s to move somewhere between 0 and 2 SDs above the mean. (o be exact, the eqation tells s to move r * 2 standard deviations away from the mean.) Notice that if x is 2 SDs above its mean, we won t ever gess more than 2 SDs away for y, since r can t be bigger than 1.0. 2 So, each predicted y tends to be closer to its mean (in standard deviations) than its corresponding x was. his property of the linear model is called regression to the mean, and the line is called the regression line. JUS CHECKING A scatterplot of hose Price (in thosands of dollars) vs. hose Size (in thosands of sqare feet) for hoses sold recently in Saratoga, NY shows a relationship that is straight, with only moderate scatter and no otliers. he correlation between hose Price and hose Size is 0.77. 1. Yo go to an open hose and find that the hose is 1 standard deviation above the mean in size. What wold yo gess abot its price? 2. Yo read an ad for a hose priced 2 standard deviations below the mean. What wold yo gess abot its size? 3. A friend tells yo abot a hose whose size in sqare meters (he s Eropean) is 1.5 standard deviations above the mean. What wold yo gess abot its size in sqare feet? 2 In the last chapter we asserted that correlations max ot at 1, bt we never actally proved that. Here s yet another reason to check ot the Math Box on the next page.

How Big Can Predicted Vales Get? 175 MAH BOX Where does the eqation of the line of best fit come from? o write the eqation of any line, we need to know a point on the line and the slope. he point is easy. Consider the BK men example. Since it is logical to predict that a sandwich with average protein will contain average fat, the line passes throgh the point 1x, y2. 3 o think abot the slope, we look once again at the z-scores. We need to remember a few things: 1. he mean of any set of z-scores is 0. his tells s that the line that best fits the z-scores passes throgh the origin (0,0). 2. he standard deviation of a set of z-scores is 1, so the variance is also 1. his means that a 1z y - z y 2 2 n - 1 3. he correlation is r = a z xz y also important soon. n - 1, = a z y a fact that will be important soon. n - 1 = 1, Ready? Remember that or objective is to find the slope of the best fit line. Becase it passes throgh the origin, its eqation will be of the form z N y = mz x. We want to find the vale for m that will minimize the sm of the sqared residals. Actally we ll divide that sm by n - 1 and minimize this mean sqared residal, or MSR. Here goes: Minimize: Since zn y = mz x : Sqare the binomial: = a 1z y - 02 2 n - 1 Rewrite the smmation: 4. Sbstitte from (2) and (3): Wow! hat simplified nicely! And as a bons, the last expression is qadratic. Remember parabolas from algebra class? A parabola in the form y = ax 2 + bx + c reaches its minimm at its trning point, which occrs when 2 MSR = a 1z y - ẑ y 2 2 n - 1 MSR = a 1z y - mz x 2 2 n - 1 = a 1z y 2-2mz x z y + m 2 z 2 x 2 n - 1 x = -b 2a. = a z 2 y n - 1-2m a z xz y n - 1 + a z 2 x m2 n - 1 = 1-2mr + m 2 We can minimize the mean of sqared residals by choosing m = -1-2r2 = r. 2112 Wow, again! he slope of the best fit line for z-scores is the correlation, r. his stnning fact immediately leads s to two important additional reslts, listed below. As yo read on in the text, we explain them in the context of or contining discssion of Brger King foods. Aslope of r for z-scores means that for every increase of 1 standard deviation in z x there is an increase of r standard deviations in z N y. Over one, p r, as yo probably said in algebra class. ranslate that back to the original x and y vales: Over one standard deviation in x, p r standard deviations in y N. hat s it! In x- and y-vales, the slope of the regression line is b = rs y. s x 3 It s actally not hard to prove this too.

176 CHAPER 8 Linear Regression We know choosing m = r minimizes the sm of the sqared residals, bt how small does that sm get? Eqation (4) told s that the mean of the sqared residals is 1-2mr + m 2. When m = r, 1-2mr + m 2 = 1-2r 2 + r 2 = 1 - r 2. his is the variability not explained by the regression line. Since the variance in z y was 1 (Eqation 2), the percentage of variability in y that is explained by x is r 2. his important fact will help s assess the strength of or models. And there s still another bons. Becase r 2 is the percent of variability explained by or model, r 2 is at most 100%. If r 2 1, then -1 r 1, proving that correlations are always between -1 and +1. (old yo so!) Protein x = 17.2 g s x = 14.0 g he Regression Line in Real Units Why Is Correlation r? In his original paper on correlation, Galton sed r for the index of correlation that we now call the correlation coefficient. He calclated it from the regression of y on x or of x on y after standardizing the variables, jst as we have done. It s fairly clear from the text that he sed r to stand for (standardized) regression. Simlation: Interpreting Eqations. his demonstrates how to se and interpret linear eqations. r = 0.83 Slope b 1 = rs y s x Fat y = 23.5 g s y = 16.4 g When yo read the Brger King men, yo probably don t think in z-scores. Bt yo might want to know the fat content in grams for a specific amont of protein in grams. How mch fat shold we predict for a doble hambrger with 31 grams of protein? he mean protein content is near 17 grams and the standard deviation is 14, so that item is 1 SD above the mean. Since r = 0.83, we predict the fat content will be 0.83 SDs above the mean fat content. Great. How mch fat is that? Well, the mean fat content is 23.5 grams and the standard deviation of fat content is 16.4, so we predict that the doble hambrger will have 23.5 + 0.83 * 16.4 = 37.11 grams of fat. We can always convert both x and y to z-scores, find the correlation, se zn y = rz x, and then convert zn y back to its original nits so that we can nderstand the prediction. Bt can t we do this more simply? Yes. Let s write the eqation of the line for protein and fat that is, the actal x and y vales rather than their z-scores. In Algebra class yo may have once seen lines written in the form y = mx + b. Statisticians do exactly the same thing, bt with different notation: In this eqation, b 0 is the y-intercept, the vale of y where the line crosses the y-axis, and b is the slope. 4 1 First we find the slope, sing the formla we developed in the Math Box. 5 Remember? We know that or model predicts that for each increase of one standard deviation in protein we ll see an increase of abot 0.83 standard deviations in fat. In other words, the slope of the line in original nits is b 1 = rs y s x = Next, how do we find the y-intercept, b 0? Remember that the line has to go throgh the mean-mean point ( x, y). In other words, the model predicts y to be the vale that corresponds to x. We can pt the means into the eqation and write y = b 0 + b 1 x. Solving for, we see that the intercept is jst b 0 = y - b 1 x. b 0 0.83 * 16.4 g fat 14 g protein yn = b 0 + b 1 x. = 0.97 grams of fat per gram of protein. Intercept b 0 = y - b 1 x 4 We changed from mx + b to b 0 + b 1 x for a reason not jst to be difficlt. Eventally we ll want to add more x s to the model to make it more realistic and we don t want to se p the entire alphabet. What wold we se after m? he next letter is n, and that one s already taken. o? See or point? Sometimes sbscripts are the best approach. 5 Several important reslts popped p in that Math Box. Check it ot!

he Regression Line in Real Units 177 otal Fat (g) Units of y per nit of x Get into the habit of identifying the nits by writing down y-nits per x-nit, with the nit names pt in place.yo ll find it ll really help yo to ell abot the line in context. 60 45 30 15 0 FIGURE 8.4 0.0 12.5 25.0 37.5 50.0 Protein (g) Brger King men items in their natral nits with the regression line. For the Brger King foods, that comes ot to g fat b 0 = 23.5 g fat - 0.97 * 17.2 g protein = 6.8 g fat. g protein Ptting this back into the regression eqation gives fat = 6.8 + 0.97 protein. What does this mean? he slope, 0.97, says that an additional gram of protein is associated with an additional 0.97 grams of fat, on average. Less formally, we might say that Brger King sandwiches pack abot 0.97 grams of fat per gram of protein. Slopes are always expressed in y-nits per x-nit. hey tell how the y-variable changes (in its nits) for a one-nit change in the x-variable. When yo see a phrase like stdents per teacher or kilobytes per second think slope. Changing the nits of the variables doesn t change the correlation, bt for the slope, nits do matter. We may know that age and height in children are positively correlated, bt the vale of the slope depends on the nits. If children grow an average of 3 inches per year, that s the same as 0.21 millimeters per day. For the slope, it matters whether yo express age in days or years and whether yo measre height in inches or millimeters. How yo choose to express x and y what nits yo se affects the slope directly. Why? We know changing nits doesn t change the correlation, bt does change the standard deviations. he slope introdces the nits into the eqation by mltiplying the correlation by the ratio of s y to s x. he nits of the slope are always the nits of y per nit of x. How abot the intercept of the BK regression line, 6.8? Algebraically, that s the vale the line takes when x is zero. Here, or model predicts that even a BK item with no protein wold have, on average, abot 6.8 grams of fat. Is that reasonable? Well, the apple pie, with 2 grams of protein, has 14 grams of fat, so it s not impossible. Bt often 0 is not a plasible vale for x (the year 0, a baby born weighing 0 grams,...). hen the intercept serves only as a starting vale for or predictions and we don t interpret it as a meaningfl predicted vale. FOR EXAMPLE A regression model for hrricanes In Chapter 7 we looked at the relationship between the central pressre and maximm wind speed of Atlantic hrricanes. We saw that the scatterplot was straight enogh, and then fond a correlation of -0.879, bt we had no model to describe how these two important variables are related or to allow s to predict wind speed from pressre. Since the conditions we need to check for regression are the same ones we checked before, we can se technology to find the regression model. It looks like this: MaxWindSpeed = 955.27 0.897CentralPressre Qestion: Interpret this model. What does the slope mean in this context? Does the intercept have a meaningfl interpretation? he negative slope says that as CentralPressre falls, MaxWindSpeed increases. hat makes sense from or general nderstanding of how hrricanes work: Low central pressre plls in moist air, driving the rotation and the reslting destrctive winds. he slope s vale says that, on average, the maximm wind speed increases by abot 0.897 knots for every 1-millibar drop in central pressre. It s not meaningfl, however, to interpret the intercept as the wind speed predicted for a central pressre of 0 that wold be a vacm. Instead, it is merely a starting vale for the model. Max Wind Speed (knots) 150 125 100 75 920 940 960 980 1000 Central Pressre (mb)

178 CHAPER 8 Linear Regression With the estimated linear model, fat = 6.8 + 0.97 protein, it s easy to predict fat content for any men item we want. For example, for the BK Broiler chicken sandwich with 30 grams of protein, we can plg in 30 grams for the amont of protein and see that the predicted fat content is 6.8 + 0.971302 = 35.9 grams of fat. Becase the BK Broiler chicken sandwich actally has 25 grams of fat, its residal is fat - fat = 25-35.9 = -10.9 g. o se a regression model, we shold check the same conditions for regressions as we did for correlation: the Qantitative Variables Condition, the Straight Enogh Condition, and the Otlier Condition. JUS CHECKING Let s look again at the relationship between hose Price (in thosands of dollars) and hose Size (in thosands of sqare feet) in Saratoga. he regression model is 4. What does the slope of 94.454 mean? 5. What are the nits of the slope? 6. Yor hose is 2000 sq ft bigger than yor neighbor s hose. How mch more do yo expect it to be worth? 7. Is the y-intercept of -3.117 meaningfl? Explain. Price = -3.117 + 94.454 Size. SEP-BY-SEP EXAMPLE Calclating a Regression Eqation Wildfires are an ongoing sorce of concern shared by several government agencies. In 2004, the Brea of Land Management, Brea of Indian Affairs, Fish and Wildlife Service, National Park Service, and USDA Forest Service spent a combined total of $890,233,000 on fire sppression, down from nearly twice that mch in 2002. hese government agencies join together in the National Interagency Fire Center, whose Web site (www.nifc.gov) reports statistics abot wildfires. Qestion: Has the annal nmber of wildfires been changing, on average? If so, how fast and in what way? Plan State the problem. Variables Identify the variables and report the W s. Check the appropriate assmptions and conditions. I want to know how the nmber of wildfires in the continental United States has changed in the past two decades. I have data giving the nmber of wildfires for each year (in thosands of fires) from 1982 to 2005. Ç Qantitative Variables Condition: Both the nmber of fires and the year are qantitative.

he Regression Line in Real Units 179 Jst as we did for correlation, check the conditions for a regression by making a pictre. Never fit a regression withot looking at the scatterplot first. Fires (thosands) 166 141 116 91 Note: It s common (and sally simpler) not to se for-digit nmbers to identify years. Here we have chosen to nmber the years beginning in 1982, so 1982 is represented as year 0 and 2005 as year 23. 0 5 10 15 20 25 Years since 1982 Ç Straight Enogh Condition: he scatterplot shows a strong linear relationship with a negative association. Ç Otlier Condition: No otliers are evident in the scatterplot. Becase these conditions are satisfied, it is OK to model the relationship with a regression line. Mechanics Find the eqation of the regression line. Smmary statistics give the bilding blocks of the calclation. (We generally report smmary statistics to one more digit of accracy than the data. We do the same for intercept and predicted vales, bt for slopes we sally report an additional digit. Remember, thogh, not to rond off ntil yo finish compting an answer.) 6 Find the slope, b 1. Find the intercept, b 0. Write the eqation of the model, sing meaningfl variable names. Year: Fires: Correlation: x = 11.5 (representing 1993.5) s x = 7.07 years y = 114.098 fires s y = 28.342 fires r = -0.862 b 1 = rs y s x = -0.862(28.342) 7.07 = -3.4556 fires per year b 0 = y - b 1 x = 114.098 - (-3.4556)11.5 = 153.837 So the least sqares line is yn = 153.837-3.4556x, or Fires = 153.837-3.4556 year 6 We warned yo in Chapter 6 that we ll rond in the intermediate steps of a calclation to show the steps more clearly. If yo repeat these calclations yorself on a calclator or statistics program, yo may get somewhat different reslts. When calclated with more precision, the intercept is 153,809 and the slope is -3.453.

180 CHAPER 8 Linear Regression Conclsion Interpret what yo have fond in the context of the qestion. Discss in terms of the variables and their nits. Dring the period from 1982 to 2005, the annal nmber of fires declined at an average rate of abot 3,456 (3.456 thosand) fires per year. For prediction, the model ses a base estimation of 153,837 fires in 1982. Activity: Find a Regression Eqation. Now that we ve done it by hand, try it with technology sing the statistics package paired with yor version of ActivStats. Residals Revisited Why e for Residal? he flip answer is that r is already taken, bt the trth is that e stands for error. No, that doesn t mean it s a mistake. Statisticians often refer to variability not explained by a model as error. he linear model we are sing assmes that the relationship between the two variables is a perfect straight line. he residals are the part of the data that hasn t been modeled. We can write or, eqivalently, Or, in symbols, Data = Model + Residal Residal = Data - Model. e = y - yn. When we want to know how well the model fits, we can ask instead what the model missed. o see that, we look at the residals. FOR EXAMPLE Katrina s residal Recap: he linear model relating hrricanes wind speeds to their central pressres was MaxWindSpeed = 955.27 0.897CentralPressre Let s se this model to make predictions and see how those predictions do. Qestion: Hrricane Katrina had a central pressre measred at 920 millibars. What does or regression model predict for her maximm wind speed? How good is that prediction, given that Katrina s actal wind speed was measred at 110 knots? Sbstitting 920 for the central pressre in the regression model eqation gives MaxWindSpeed = 955.27-0.89719202 = 130.03 he regression model predicts a maximm wind speed of 130 knots for Hrricane Katrina. he residal for this prediction is the observed vale mins the predicted vale: 110-130 = -20kts. In the case of Hrricane Katrina, the model predicts a wind speed 20 knots higher than was actally observed.

he Residal Standard Deviation 181 Residals (g fat) 15.0 7.5 0.0 7.5 FIGURE 8.5 0.0 12.5 25.0 37.5 50.0 Protein (g) he residals for the BK men regression look appropriately boring. Residals help s to see whether the model makes sense. When a regression model is appropriate, it shold model the nderlying relationship. Nothing interesting shold be left behind. So after we fit a regression model, we sally plot the residals in the hope of finding... nothing. Ascatterplot of the residals verss the x-vales shold be the most boring scatterplot yo ve ever seen. It sholdn t have any interesting featres, like a direction or shape. It shold stretch horizontally, with abot the same amont of scatter throghot. It shold show no bends, and it shold have no otliers. If yo see any of these featres, find ot what the regression model missed. Most compter statistics packages plot the residals against the predicted vales yn, rather than against x. When the slope is negative, the two versions are mirror images. When the slope is positive, they re virtally identical except for the axis labels. Since all we care abot is the patterns (or, better, lack of patterns) in the plot, it really doesn t matter which way we plot the residals. JUS CHECKING Or linear model for Saratoga homes ses the Size (in thosands of sqare feet) to estimate the Price (in thosands of dollars): Price = -3.117 + 94.454Size. Sppose yo re thinking of bying a home there. 8. Wold yo prefer to find a home with a negative or a positive residal? Explain. 9. Yo plan to look for a home of abot 3000 sqare feet. How mch shold yo expect to have to pay? 10. Yo find a nice home that size selling for $300,000. What s the residal? he Residal Standard Deviation Why n - 2 rather than n - 1? We sed n - 1 for s when we estimated the mean. Now we re estimating both a slope and an intercept. Looks like a pattern and it is. We sbtract one more for each parameter we estimate. If the residals show no interesting pattern when we plot them against x, we can look at how big they are. After all, we re trying to make them as small as possible. Since their mean is always zero, thogh, it s only sensible to look at how mch they vary. he standard deviation of the residals, s e, gives s a measre of how mch the points spread arond the regression line. Of corse, for this smmary to make sense, the residals shold all share the same nderlying spread, so we check to make sre that the residal plot has abot the same amont of scatter throghot. his gives s a new assmption: the Eqal Variance Assmption. he associated condition to check is the Does the Plot hicken? Condition. We check to make sre that the spread is abot the same all along the line. We can check that either in the original scatterplot of y against x or in the scatterplot of residals. We estimate the standard deviation of the residals in almost the way yo d expect: s e = A e 2 n - 2 We don t need to sbtract the mean becase the mean of the residals e = 0. For the Brger King foods, the standard deviation of the residals is 9.2 grams of fat. hat looks abot right in the scatterplot of residals. he residal for the BK Broiler chicken was -11 grams, jst over one standard deviation.

182 CHAPER 8 Linear Regression It s a good idea to make a histogram of the residals. If we see a nimodal, symmetric histogram, then we can apply the 68 95 99.7 Rle to see how well the regression model describes the data. In particlar, we know that 95% of the residals shold be no larger in size than 2s e. he Brger King residals look like this: 15 # of Residals 10 5 27.6 18.4 9.2 0.0 9.2 18.4 27.6 Residals Sre enogh, almost all are less than 2(9.2), or 18.4, g of fat in size. 45 30 15 0 15 30 FIGURE 8.6 Fat R 2 he Variation Acconted For Residals Compare the variability of total Fat with the residals from the regression. he means have been sbtracted to make it easier to compare spreads. he variation left in the residals is nacconted for by the model, bt it s less than the variation in the original data. Understanding R 2. Watch the nexplained variability decrease as yo drag points closer to the regression line. he variation in the residals is the key to assessing how well the model fits. Let s compare the variation of the response variable with the variation of the residals. he total Fat has a standard deviation of 16.4 grams. he standard deviation of the residals is 9.2 grams. If the correlation were 1.0 and the model predicted the Fat vales perfectly, the residals wold all be zero and have no variation. We coldn t possibly do any better than that. On the other hand, if the correlation were zero, the model wold simply predict 23.5 grams of Fat (the mean) for all men items. he residals from that prediction wold jst be the observed Fat vales mins their mean. hese residals wold have the same variability as the original data becase, as we know, jst sbtracting the mean doesn t change the spread. How well does the BK regression model do? Look at the boxplots. he variation in the residals is smaller than in the data, bt certainly bigger than zero. hat s nice to know, bt how mch of the variation is still left in the residals? If yo had to pt a nmber between 0% and 100% on the fraction of the variation left in the residals, what wold yo say? All regression models fall somewhere between the two extremes of zero correlation and perfect correlation. We d like to gage where or model falls. As we showed in the Math Box, 7 the sqared correlation, r 2, gives the fraction of the data s variation acconted for by the model, and 1 - r 2 is the fraction of the original variation left in the residals. For the Brger King model, r 2 = 0.83 2 = 0.69, and 1 - r 2 is 0.31, so 31% of the variability in total Fat has been left in the residals. How close was that to yor gess? All regression analyses inclde this statistic, althogh by tradition, it is written with a capital letter, R 2, and prononced R-sqared. An R 2 of 0 means that none of the variance in the data is in the model; all of it is still in the residals. It wold be hard to imagine sing that model for anything. 7 Have yo looked yet? Please do.

How Big Shold R 2 Be? 183 Is a correlation of 0.80 twice as strong as a correlation of 0.40? Not if yo think in terms of R 2. A correlation of 0.80 means an R 2 of 0.80 2 = 64%. A correlation of 0.40 means an R 2 of 0.40 2 = 16% only a qarter as mch of the variability acconted for. A correlation of 0.80 gives an R 2 for times as strong as a correlation of 0.40 and acconts for for times as mch of the variability. Becase is a fraction of a whole, it is often given as a percentage. 8 For the Brger King data, R 2 is 69%. When interpreting a regression model, yo need to ell what R 2 means. According to or linear model, 69% of the variability in the fat content of Brger King sandwiches is acconted for by variation in the protein content. R 2 How can we see that is really the fraction of variance acconted for by the model? It s a simple calclation. he variance of the fat content of the Brger King foods is 16.4 2 = 268.42. If we treat the residals as data, the variance of the residals is 83.195. 9 As a fraction, that s 83.195>268.42 = 0.31, or 31%. hat s the fraction of the variance that is not acconted for by the model. he fraction that is acconted for is 100% - 31% = 69%, jst the vale we got for R 2. R 2 FOR EXAMPLE Interpreting R 2 Recap: Or regression model that predicts maximm wind speed in hrricanes based on the storm s central pressre has R 2 = 77.3%. Qestion: What does that say abot or regression model? An of 77.3% indicates that 77.3% of the variation in maximm wind speed can be acconted for by the hrricane s central pressre. Other factors, sch as temperatre and whether the storm is over water or land, may explain some of the remaining variation. R 2 JUS CHECKING Back to or regression of hose Price (in thosands of $) on hose Size (in thosands of sqare feet). he vale is reported as 59.5%, and the standard deviation of the residalsis 53.79. R 2 R 2 11. What does the vale mean abot the relationship of Price and Size? 12. Is the correlation of Price and Size positive or negative? How do yo know? 13. If we measre hose Size in sqare meters instead, wold change? Wold the slope of the line change? Explain. 14. Yo find that yor hose in Saratoga is worth $100,000 more than the regression model predicts. Shold yo be very srprised (as well as pleased)? R 2 How Big Shold Be? is always between 0% and 100%. Bt what s a good vale? he answer depends on the kind of data yo are analyzing and on what yo want to do with it. Jst as with correlation, there is no vale for R 2 that atomatically determines R 2 R 2 R 2 8 By contrast, we sally give correlation coefficients as decimal vales between -1.0 and 1.0. 9 his isn t qite the same as sqaring the s e that we discssed on the previos page, bt it s very close. We ll deal with the distinction in Chapter 27.

184 CHAPER 8 Linear Regression Some Extreme ales One major company developed a method to differentiate between proteins.o do so, they had to distingish between regressions with R 2 of 99.99% and 99.98%. For this application, 99.98% was not high enogh. he president of a financial services company reports that althogh his regressions give R 2 below 2%, they are highly sccessfl becase those sed by his competition are even lower. that the regression is good. Data from scientific experiments often have in the 80% to 90% range and even higher. Data from observational stdies and srveys, thogh, often show relatively weak associations becase it s so difficlt to measre responses reliably. An R 2 of 50% to 30% or even lower might be taken as evidence of a sefl regression. he standard deviation of the residals can give s more information abot the seflness of the regression by telling s how mch scatter there is arond the line. As we ve seen, an R 2 of 100% is a perfect fit, with no scatter arond the line. he s e wold be zero. All of the variance is acconted for by the model and none is left in the residals at all. his sonds great, bt it s too good to be tre for real data. 10 Along with the slope and intercept for a regression, yo shold always report R 2 so that readers can jdge for themselves how sccessfl the regression is at fitting the data. Statistics is abot variation, and R 2 measres the sccess of the regression model in terms of the fraction of the variation of y acconted for by the regression. R 2 is the first part of a regression that many people look at becase, along with the scatterplot, it tells whether the regression model is even worth thinking abot. R 2 Regression Assmptions and Conditions Make a Pictre o se regression, first check that the scatterplot is straight enogh. After yo ve fit the regression, make a residal plot and check that there are no obvios patterns. In particlar, check that there are no obvios bends, the spread of the residals is abot the same throghot, and there are no obvios otliers. he linear regression model is perhaps the most widely sed model in all of Statistics. It has everything we cold want in a model: two easily estimated parameters, a meaningfl measre of how well the model fits the data, and the ability to predict new vales. It even provides a self-check in plots of the residals to help s avoid silly mistakes. Like all models, thogh, linear models don t apply all the time, so we d better think abot whether they re reasonable. It makes no sense to make a scatterplot of categorical variables, and even less to perform a regression on them. Always check the Qantitative Variables Condition to be sre a regression is appropriate. he linear model makes several assmptions. First, and foremost, is the Linearity Assmption that the relationship between the variables is, in fact, linear. Yo can t verify an assmption, bt yo can check the associated condition. A qick look at the scatterplot will help yo check the Straight Enogh Condition. Yo don t need a perfectly straight plot, bt it mst be straight enogh for the linear model to make sense. If yo try to model a crved relationship with a straight line, yo ll sally get exactly what yo deserve. If the scatterplot is not straight enogh, stop here. Yo can t se a linear model for any two variables, even if they are related. hey mst have a linear association, or the model won t mean a thing. For the standard deviation of the residals to smmarize the scatter, all the residals shold share the same spread, so we need the Eqal Variance Assmption. he Does the Plot hicken? Condition checks for changing spread in the scatterplot. Check the Otlier Condition. Otlying points can dramatically change a regression model. Otliers can even change the sign of the slope, misleading s abot the nderlying relationship between the variables. We ll see examples in the next chapter. Even thogh we ve checked the conditions in the scatterplot of the data, a scatterplot of the residals can sometimes help s see any violations even more R 2 10 If yo see an of 100%, it s a good idea to figre ot what happened. Yo may have discovered a new law of Physics, bt it s mch more likely that yo accidentally regressed two variables that measre the same thing.

A ale of wo Regressions 185 clearly. And examining the residals is the best way to look for additional patterns and interesting qirks in the data. Protein x = 17.2 g s x = 14.0 g r = 0.83 A ale of wo Regressions Fat y = 23.5 g s y = 16.4 g Regression slopes may not behave exactly the way yo d expect at first. Or regression model for the Brger King sandwiches was fat = 6.8 + 0.97 protein. hat eqation allowed s to estimate that a sandwich with 30 grams of protein wold have 35.9 grams of fat. Sppose, thogh, that we knew the fat content and wanted to predict the amont of protein. It might seem natral to think that by solving or eqation for protein we d get a model for predicting protein from fat. Bt that doesn t work. Or original model is yn = b 0 + b 1 x, bt the new one needs to evalate an xn based on a vale of y. here s no y in or original model, only yn, and that makes all the difference. Or model doesn t fit the BK data vales perfectly, and the least sqares criterion focses on the vertical errors the model makes in sing to model y not on horizontal errors related to x. Aqick look at the eqations reveals why. Simply solving or eqation for x wold give a new line whose slope is the reciprocal of ors. o model y in terms rs y of x, or slope is b 1 =. o model x in terms of y, we d need to se the slope b 1 = rs s x x s y. Notice that it s not the reciprocal of ors. If we want to predict protein from fat, we need to create that model. he slope is b 1 = 10.832114.02 16.4 = 0.709 grams of protein per gram of fat. he eqation trns ot to be protein = 0.55 + 0.709 fat, so we d predict that a sandwich with 35.9 grams of fat shold have 26.0 grams of protein not the 30 grams that we sed in the first eqation. Moral of the story: hink. (Where have yo heard that before?) Decide which variable yo want to se (x) to predict vales for the other (y). hen find the model that does that. If, later, yo want to make predictions in the other direction, yo ll need to start over and create the other model from scratch. SEP-BY-SEP EXAMPLE Regression Even if yo hit the fast food joints for lnch, yo shold have a good breakfast. Ntritionists, concerned abot empty calories in breakfast cereals, recorded facts abot 77 cereals, inclding their Calories per serving and Sgar content (in grams). Qestion: How are calories and sgar content related in breakfast cereals? Plan State the problem and determine the role of the variables. Variables Name the variables and report the W s. I am interested in the relationship between sgar content and calories in cereals. I ll se Sgar to estimate Calories. Ç Qantitative Variables Condition: I have two qantitative variables, Calories and Sgar content per serving, measred on 77 breakfast cereals. he nits of measrement are calories and grams of sgar, respectively.

186 CHAPER 8 Linear Regression Check the conditions for a regression by making a pictre. Never fit a regression withot looking at the scatterplot first. Calories 150 120 90 60 4 8 12 Sgar (g) Ç Otlier Condition: here are no obvios otliers or grops. Ç he Straight Enogh Condition is satisfied; I will fit a regression model to these data. Ç he Does the Plot hicken? Condition is satisfied. he spread arond the line looks abot the same throghot. Mechanics If there are no clear violations of the conditions, fit a straight line model of the form yn = b 0 + b 1 x to the data. Smmary statistics give the bilding blocks of the calclation. Find the slope. Find the intercept. Write the eqation, sing meaningfl variable names. State the vale of R 2. Calories y = 107.0 calories Sgar x = 7.0 grams Correlation r = 0.564 s y = 19.5 calories s x = 4.4 grams b 1 = rs y s x = 0.564(19.5) 4.4 = 2.50 calories per gram of sgar. b 0 = y - b 1 x = 107-2.50(7) = 89.5 calories. So the least sqares line is yn = 89.5 + 2.50 x or Calories = 89.5 + 2.50 Sgar. Sqaring the correlation gives R 2 = 0.564 2 = 0.318 or 31.8%. Conclsion Describe what the model says in words and nmbers. Be sre to se the names of the variables and their nits. he key to interpreting a regression model is to start with the phrase b 1 y-nits per x-nit, sbstitting the estimated vale of the slope for and the names of the b 1 he scatterplot shows a positive, linear relationship and no otliers. he slope of the least sqares regression line sggests that cereals have abot 2.50 Calories more per additional gram of Sgar.

A ale of wo Regressions he intercept predicts that sgar-free cereals wold average abot 89.5 calories. R2 gives the fraction of the variability of y acconted for by the linear regression model. he R2 says that 31.8% of the variability in Calories is acconted for by variation in Sgar content. Find the standard deviation of the residals, se, and compare it to the original sy. se = 16.2 calories. hat s smaller than the original SD of 19.5, bt still fairly large. Check Again Even thogh we looked at the scatterplot before fitting a regression model, a plot of the residals is essential to any regression analysis becase it is the best check for additional patterns and interesting qirks in the data. Residals (calories) AGAIN respective nits. he intercept is then a starting or base vale. 187 20 0 20 40 90 Residals plots. See how the residals plot changes as yo drag points arond in a scatterplot. I ips 100 110 120 Predicted (calories) he residals show a horizontal direction, a shapeless form, and roghly eqal scatter for all predicted vales. he linear model appears to be appropriate. Regression lines and residals plots By now yo will not be srprised to learn that yor calclator can do it all: scatterplot, regression line, and residals plot. Let s try it sing the Arizona State tition data from the last chapter. (I ips, p. 149) Yo shold still have that saved in lists named and. First, recreate the scatterplot. 1. Find the eqation of the regression line. Actally, yo already fond the line when yo sed the calclator to get the correlation. Bt this time we ll be a little fancier so that we can display the line on or scatterplot. We want to tell the calclator to do the regression and save the eqation of the model as a graphing variable. Under choose. Specify that x and y are and, as before, bt... Now add a comma and one more specification. Press men, choose, and finally(!) choose Hit., go to the. here s the eqation. he calclator tells yo that the regression line is tit = 6440 + 326 year. Can yo explain what the slope and y-intercept mean? 2. Add the line to the plot. When yo entered this command, the calclator atomatically saved the eqation as. Jst hit to see the line drawn across yor scatterplot.

188 CHAPER 8 Linear Regression 3. Check the residals. Remember, yo are not finished ntil yo check to see if a linear model is appropriate. hat means yo need to see if the residals appear to be randomly distribted. o do that, yo need to look at the residals plot. his is made easy by the fact that the calclator has already placed the residals in a list named. Want to see them? Go to and look throgh the lists. (If is not already there, go to the first blank list and import the name from yor men. he residals shold appear.) Every time yo have the calclator compte a regression analysis, it will atomatically save this list of residals for yo. 4. Now create the residals plot. Set p as a scatterplot with and. Before yo try to see the plot, go to the screen. By moving the crsor arond and hitting in the appropriate places yo can trn off the regression line and, and trn on. will now graph the residals plot. Uh-oh! See the crve? he residals are high at both ends, low in the middle. Looks like a linear model may not be appropriate after all. Notice that the residals plot makes the crvatre mch clearer than the original scatterplot did. Moral: Always check the residals plot! So a linear model might not be appropriate here. What now? he next two chapters provide techniqes for dealing with data like these. Reality Check: Is the Regression Reasonable? Adjective, Non, or Verb Yo may see the term regression sed in different ways. here are many ways to fit a line to data, bt the term regression line or regression withot any other qalifiers always means least sqares. People also se regression as a verb when they speak of regressing a y-variable on an x-variable to mean fitting a linear model. Statistics don t come ot of nowhere. hey are based on data. he reslts of a statistical analysis shold reinforce yor common sense, not fly in its face. If the reslts are srprising, then either yo ve learned something new abot the world or yor analysis is wrong. Whenever yo perform a regression, think abot the coefficients and ask whether they make sense. Is a slope of 2.5 calories per gram of sgar reasonable? hat s hard to say right off. We know from the smmary statistics that a typical cereal has abot 100 calories and 7 grams of sgar. A gram of sgar contribtes some calories (actally, 4, bt yo don t need to know that), so calories shold go p with increasing sgar. he direction of the slope seems right. o see if the size of the slope is reasonable, a sefl trick is to consider its order of magnitde. We ll start by asking if deflating the slope by a factor of 10 seems reasonable. Is 0.25 calories per gram of sgar enogh? he 7 grams of sgar fond in the average cereal wold contribte less than 2 calories. hat seems too small. Now let s try inflating the slope by a factor of 10. Is 25 calories per gram reasonable? hen the average cereal wold have 175 calories from sgar alone. he average cereal has only 100 calories per serving, thogh, so that slope seems too big. We have tried inflating the slope by a factor of 10 and deflating it by 10 and fond both to be nreasonable. So, like Goldilocks, we re left with the vale in the middle that s jst right. And an increase of 2.5 calories per gram of sgar is certainly plasible. he small effort of asking yorself whether the regression eqation is plasible is repaid whenever yo catch errors or avoid saying something silly or absrd abot the data. It s too easy to take something that comes ot of a compter at face vale and assme that it makes sense. Always be skeptical and ask yorself if the answer is reasonable.

Connections 189 R 2 WHA CAN GO WRONG? does not mean that protein acconts for 69% of the fat in a BK food item. It is the variation in fat content that is acconted for by the linear model. here are many ways in which data that appear at first to be good candidates for regression analysis may be nsitable. And there are ways that people se regression that can lead them astray. Here s an overview of the most common problems. We ll discss them at length in the next chapter. Don t fit a straight line to a nonlinear relationship. Linear regression is sited only to relationships that are, well, linear. Fortnately, we can often improve the linearity easily by sing re-expression. We ll come back to that topic in Chapter 10. Beware of extraordinary points. Data points can be extraordinary in a regression in two ways: hey can have y-vales that stand off from the linear pattern sggested by the blk of the data, or extreme x-vales. Both kinds of extraordinary points reqire attention. Don t extrapolate beyond the data. A linear model will often do a reasonable job of smmarizing a relationship in the narrow range of observed x-vales. Once we have a working model for the relationship, it s tempting to se it. Bt beware of predicting y-vales for x-vales that lie otside the range of the original data. he model may no longer hold there, so sch extrapolations too far from the data are dangeros. Don t infer that x cases y jst becase there is a good linear model for their relationship. When two variables are strongly correlated, it is often tempting to assme a casal relationship between them. Ptting a regression line on a scatterplot tempts s even frther, bt it doesn t make the assmption of casation any more valid. For example, or regression model predicting hrricane wind speeds from the central pressre was reasonably sccessfl, bt the relationship is very complex. It is reasonable to say that low central pressre at the eye is responsible for the high winds becase it draws moist, warm air into the center of the storm, where it swirls arond, generating the winds. Bt as is often the case, things aren t qite that simple. he winds themselves also contribte to lowering the pressre at the center of the storm as it becomes a hrricane. Understanding casation reqires far more work than jst finding a correlation or modeling a relationship. Don t choose a model based on alone. Althogh R 2 measres the strength of the linear association, a high R 2 does not demonstrate the appropriateness of the regression. A single otlier, or data that separate into two grops rather than a single clod of points, can make R 2 seem qite large when, in fact, the linear regression model is simply inappropriate. Conversely, a low R 2 vale may be de to a single otlier as well. It may be that most of the data fall roghly along a straight line, with the exception of a single point. Always look at the scatterplot. R 2 CONNECIONS We ve talked abot the importance of models before, bt have seen only the Normal model as an example. he linear model is one of the most important models in Statistics. Chapter 7 talked abot the assignment of variables to the y- and x-axes. hat didn t matter to correlation, bt it does matter to regression becase y is predicted by x in the regression model. he connection of R 2 to correlation is obvios, althogh it may not be immediately clear that jst by sqaring the correlation we can learn the fraction of the variability of y acconted for by a regression on x. We ll retrn to this in sbseqent chapters. We made a big fss abot knowing the nits of yor qantitative variables. We didn t need nits for correlation, bt withot the nits we can t define the slope of a regression. A regression makes no sense if yo don t know the Who, the What, and the Units of both yor variables. We ve smmed sqared deviations before when we compted the standard deviation and variance. hat s not coincidental. hey are closely connected to regression. When we first talked abot models, we noted that deviations away from a model were often interesting. Now we have a formal definition of these deviations as residals.

190 CHAPER 8 Linear Regression WHA HAVE WE LEARNED? We ve learned that when the relationship between qantitative variables is fairly straight, a linear model can help smmarize that relationship and give s insights abot it: he regression (best fit) line doesn t pass throgh all the points, bt it is the best compromise in the sense that the sm of sqares of the residals is the smallest possible. We ve learned several things the correlation, r, tells s abot the regression: he slope of the line is based on the correlation, adjsted for the nits of x and y: b 1 = rs y s x erms Model Linear model We ve learned to interpret that slope in context: For each SD of x that we are away from the x mean, we expect to be r SDs of y away from the y mean. Becase r is always between -1 and +1, each predicted y is fewer SDs away from its mean than the corresponding x was, a phenomenon called regression to the mean. he sqare of the correlation coefficient, R 2, gives s the fraction of the variation of the response acconted for by the regression model. he remaining 1 - R 2 of the variation is left in the residals. he residals also reveal how well the model works: If a plot of residals against predicted vales shows a pattern, we shold re-examine the data to see why. he standard deviation of the residals, s e, qantifies the amont of scatter arond the line. Of corse, the linear model makes no sense nless the Linearity Assmption is satisfied. We check the Straight Enogh Condition and Otlier Condition with a scatterplot, as we did for correlation, and also with a plot of residals against either the x or the predicted vales. For the standard deviation of the residals to make sense as a smmary, we have to make the Eqal Variance Assmption. We check it by looking at both the original scatterplot and the residal plot for the Does the Plot hicken? Condition. 172. An eqation or formla that simplifies and represents reality. 172. A linear model is an eqation of a line. o interpret a linear model, we need to know the variables (along with their W s) and their nits. Predicted vale 172. he vale of yn fond for a given x-vale in the data. A predicted vale is fond by sbstitting the x-vale in the regression eqation. he predicted vales are the vales on the fitted line; the points 1x, yn2 all lie exactly on the fitted line. Residals Least sqares 172. Residals are the differences between data vales and the corresponding vales predicted by the regression model or, more generally, vales predicted by any model. Residal = observed vale - predicted vale = e = y - yn 172. he least sqares criterion specifies the niqe line that minimizes the variance of the residals or, eqivalently, the sm of the sqared residals. Regression to the mean 174. Becase the correlation is always less than 1.0 in magnitde, each predicted yn tends to be fewer standard deviations from its mean than its corresponding x was from its mean. his is called regression to the mean. Regression line Line of best fit 174. he particlar linear eqation yn = b 0 + b 1 x that satisfies the least sqares criterion is called the least sqares regression line. Casally, we often jst call it the regression line, or the line of best fit.