STAT 201 Chapter 3. Association and Regression

Similar documents
Unit 8 Day 1 Correlation Coefficients.notebook January 02, 2018

Chapter 3 CORRELATION AND REGRESSION

STATISTICS INFORMED DECISIONS USING DATA

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

Section 3.2 Least-Squares Regression

Chapter 3: Describing Relationships

3.2 Least- Squares Regression

3.4 What are some cautions in analyzing association?

Examining Relationships Least-squares regression. Sections 2.3

Homework #3. SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Lecture 12: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

3.2A Least-Squares Regression

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

Chapter 3: Examining Relationships

Lecture 6B: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

1.4 - Linear Regression and MS Excel

STATS Relationships between variables: Correlation

AP Statistics Practice Test Ch. 3 and Previous

CHAPTER 3 Describing Relationships

BIVARIATE DATA ANALYSIS

Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

STATISTICS 201. Survey: Provide this Info. How familiar are you with these? Survey, continued IMPORTANT NOTE. Regression and ANOVA 9/29/2013

Chapter 4: More about Relationships between Two-Variables

Chapter 1: Exploring Data

Math 075 Activities and Worksheets Book 2:

Chapter 4: More about Relationships between Two-Variables Review Sheet

INTERPRET SCATTERPLOTS

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Relationships. Between Measurements Variables. Chapter 10. Copyright 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

How Faithful is the Old Faithful? The Practice of Statistics, 5 th Edition 1

Beware of Confounding Variables

Caffeine & Calories in Soda. Statistics. Anthony W Dick

EXECUTIVE SUMMARY DATA AND PROBLEM

5 To Invest or not to Invest? That is the Question.

Unit 1 Exploring and Understanding Data

Reminders/Comments. Thanks for the quick feedback I ll try to put HW up on Saturday and I ll you

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

STATISTICS & PROBABILITY

Math 124: Module 2, Part II

STAT 135 Introduction to Statistics via Modeling: Midterm II Thursday November 16th, Name:

Lecture 12 Cautions in Analyzing Associations

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

12.1 Inference for Linear Regression. Introduction

Chapter 4: Scatterplots and Correlation

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test February 2016

Pitfalls in Linear Regression Analysis

CHILD HEALTH AND DEVELOPMENT STUDY

Business Statistics Probability

M 140 Test 1 A Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

MEASURES OF ASSOCIATION AND REGRESSION

What is Data? Part 2: Patterns & Associations. INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Stat 13, Lab 11-12, Correlation and Regression Analysis

AP Statistics. Semester One Review Part 1 Chapters 1-5

Part 1. For each of the following questions fill-in the blanks. Each question is worth 2 points.

A response variable is a variable that. An explanatory variable is a variable that.

Regression Equation. November 29, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

Unit 8 Bivariate Data/ Scatterplots

Homework 2 Math 11, UCSD, Winter 2018 Due on Tuesday, 23rd January

Simple Linear Regression One Categorical Independent Variable with Several Categories

Section 6: Analysing Relationships Between Variables

Descriptive Methods: Correlation

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Chapter 14: More Powerful Statistical Methods

about Eat Stop Eat is that there is the equivalent of two days a week where you don t have to worry about what you eat.

Regression. Regression lines CHAPTER 5

14.1: Inference about the Model

Unit 3 Lesson 2 Investigation 4

Homework Linear Regression Problems should be worked out in your notebook

Chapter 4. More On Bivariate Data. More on Bivariate Data: 4.1: Transforming Relationships 4.2: Cautions about Correlation

1. What is the difference between positive and negative correlations?

CHAPTER ONE CORRELATION

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Introduction to regression

Still important ideas

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

bivariate analysis: The statistical analysis of the relationship between two variables.

STATISTICS 8 CHAPTERS 1 TO 6, SAMPLE MULTIPLE CHOICE QUESTIONS

Chapter 3 Review. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

HW 3.2: page 193 #35-51 odd, 55, odd, 69, 71-78

Chapter 11 Nonexperimental Quantitative Research Steps in Nonexperimental Research

Chapter 3, Section 1 - Describing Relationships (Scatterplots and Correlation)

Identify two variables. Classify them as explanatory or response and quantitative or explanatory.

Section 3 Correlation and Regression - Teachers Notes

Correlational Method. Does ice cream cause murder, or murder cause people to eat ice cream? As more ice cream is eaten, more people are murdered.

4.2 Cautions about Correlation and Regression

CHAPTER TWO REGRESSION

Exemplar for Internal Assessment Resource Mathematics Level 3. Resource title: Sport Science. Investigate bivariate measurement data

Stat Wk 9: Hypothesis Tests and Analysis

Problem 1) Match the terms to their definitions. Every term is used exactly once. (In the real midterm, there are fewer terms).

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

STAT445 Midterm Project1

F1: Introduction to Econometrics

STP 231 Example FINAL

Eating and Sleeping Habits of Different Countries

Chapter 1 Where Do Data Come From?

IAPT: Regression. Regression analyses

7. Bivariate Graphing

Transcription:

STAT 201 Chapter 3 Association and Regression 1

Association of Variables Two Categorical Variables Response Variable (dependent variable): the outcome variable whose variation is being studied Explanatory Variable (independent variable): the groups to be compared with respect to values on the response variable Example 1: Response: Survival Status; Explanatory: Smoking Status Example 2: Response: Happiness Level; Explanatory: Income Level 2

Definition Association: An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable Contingency table: A display for two categorical variables. Its rows list the categories of one variable and its columns list categories of the other variable. 3

Contingency Table: Example Two Variables Would you keep or turn in a $100 if you found it on the library floor? Do you recycle (cans / bottles)? Keep It Turn It In Total No Recycle 17 8 25 Recycle 30 34 64 Total 47 42 89 4

Contingency Table: Example Counts Percent Keep It Turn It In Total No Recycle 17 8 25 Recycle 30 34 64 Total 47 42 89 Keep It Turn It In Total No Recycle 17/89 8/89 8/89 Recycle 30/89 34/89 64/89 Total 47/89 42/89 89/89 Keep It Turn It In Total Conditional Percent No Recycle 17/25 8/25 25/25 Recycle 30/64 34/64 64/64 5

Contingency Table: Example Counts Percent Keep It Turn It In Total No Recycle 17 8 25 Recycle 30 34 64 Total 47 42 89 Keep It Turn It In Total No Recycle 19.1% 8.989% 8.99% Recycle 33.71% 38.2% 71.91% Total 52.81% 47.19% 100% Keep It Turn It In Total Conditional Percent No Recycle 68% 32% 100% Recycle 46.88% 53.13% 100% 6

Contingency Table: Example Keep It Turn It In Total No Recycle 68% 32% 100% Recycle 46.88% 53.13% 100% Total 52.81% 47.19% 100% With conditional percentage contingency table, does there appear to be an association between recycling and turning in money found on the floor? Yes it appears that a larger percent of people who do the recycle are willing to turn the $100 in compared to those that keep it 7

What About Two Quantitative Variables? We use a scatterplot to examine the association between the two quantitative variables To form a scatterplot we let the response variable be the Y variable and the explanatory variable be the X variable and just plot the points in coordinate system 8

9

Coefficient of Correlation Coefficient of correlation, denoted by letter r, measures the LINEAR relationship between X and Y [linear, linear, linear!!!] r > 0 indicates a positive linear correlation r < 0 indicates a negative linear correlation r=1 indicates a perfect positive linear correlation r=-1 indicates a perfect negative linear correlation r=0 indicates no linear correlation 10

Properties 1 r 1 The closer r is to 1, the stronger the evidence for positive linear correlation The closer r is to -1, the stronger the evidence for a negative linear correlation The closer r is to 0, the weaker the evidence for linear correlation r is affected by outliers, so we have to be careful 11

Coefficient of Correlation: Example 12

Lines A line is the shortest distance between two points. It has no curve, no thickness and it extends to both negative and positive infinitely. The equation has the form y = a + bx a is called intercept. It is the value of y when x is zero. b is called slope. When x increases/decreases one unit, y would increase/decrease b units. It measures how the line changes. 13

Example y = a + bx y = 1 + 2x a =1 is the intercept, the value of y when x=0 b = 2 is the slope, when x increases/decreases one unit, the value of y increases/decreases two units. 14

Example I add a green line y = 1 + 0.5x Compare to the red line y = 1 + 2x, the slope changes from 2 to 0.5. Can you see the difference? 15

Example I add a orange line y = 1.5 + 2x Compare to the red line y = 1 + 2x, the intercept changes from 1 to 1.5. Can you see the difference? 16

Example Red line y = 1 + 2x Green line y = 1 + 0.5x Orange line y = 1.5 + 2x 17

Regression Regression analysis is a statistical method for estimating the relationships among two quantitative variables. Regression Line predicts the value of y (response variable), as a straight line function of the value of x (explanatory variable) 18

Regression Why do we need regression? Let s consider the situation that we want to know % of fat in Shiwen s body. It s hard to measure % of fat, instead we can collect his blood pressure easily. If we know the regression line between % of fat and blood pressure, we can use Shiwen s blood pressure to estimate his % of fat. E.g. (% of fat) = -120+ (blood pressure) 19

Regression line y = b 0 + b 1 x b 0 is the intercept (the value of y when x=0) b 1 is the slope of the line (the amount that y changes when x increases by one unit) y is the predicted value Residual = (the real y) y 20

Regression R 2 R 2, given in the regression output, gives the percent of variation in y explained by x Note: R 2 = r 2 Note: r = R 2 or r = R 2 21

Regression The scatterplot must show a fairly linear relationship A rule of thumb is to look for a coefficient of correlation, r >.7 or r < -.7 22

Regression: Calories and Fat in Hamburger 23

Regression: Example 1) How many grams of fat do you expect a hamburger with 1000 calories to have? Plug in 1000 for calories and see what the fat is 24

Regression: Example 1) How many grams of fat do you expect a hamburger with 1000 calories to have? Plug in 1000 for calories and see what the fat is Fat = 12.9 + 0.0814 calories = 12.9 + 0.0814 1000 = 68.5 25

Regression: Example 2) Write a thorough interpretation of the slope. Fat = 12.9 + 0.0814 calories Here, the slope is.0814 So, with every unit increase in calories we expect a.0814 unit increase in grams of fat on average 26

Regression: Example 3) Write a thorough interpretation of regression R 2 R 2 =.9789 97.89% of the variation in fat is explained by calories 27

Regression: Example 4) Write a thorough interpretation of coefficient of correlation r r = R 2 =.9789 =.9894 since r is very close to one, Fat and Calories have a very strong positive linear correlation 28

Regression: Example 5) The hamburger with 1000 calories actually has 68 grams of fat. What is the residual? Residual = (the real y) y Residual = 68 68.5 = - 0.5 29

Regression: Beer Example In creating beer, yeast and sugar react to create alcohol. The more sugar and yeast you add the more alcohol level. It would make sense that the more alcohol in the beer, the more carbohydrates, so that more calories. Do you agree my statements? Let s show it statistically. 30

Regression: Beer Example Here, we see a positive correlation. (moderate or strong?) 31

Regression: Beer Example 32

Regression: Beer Example The regression line is (Calories) = 16.37+ 27 * (Alcohol %) The intercept is 16.37 and the slope is 27 Regression R 2 is 0.87 Correlation Coefficient r is 0.93 Can you interpret them? Write it down! 33

Regression: Beer Example Intercept: when the alcohol percentage is 0, we expect the bear has calories 16.37. Slope: for each percentage increase in alcohol level there is an increase of 27 calories on average Regression R 2 : 86.86% of the variation in calories is explained by alcohol percentage Correlation Coefficient r: there is a strong linear relationship between alcohol percentage and calories. (it confirms our visual observation!) 34

Regression: Beer Example If we wanted to estimate the calories of Rogue Dead Guy Ale, we can plug in its alcohol percentage into the equation to find an estimate of the calories. Rogue Dead Guy Ale has alcohol % to be 6.6%. we can plug it in to find the estimated calories of a bottle of Rogue Dead Guy Ale. Calories = 16.37 + 27 Alcohol% = 16.37 + 27 6.6 = 194.57 35

Regression: Beer Example So, the estimated amount of calories for Rogue Dead Guy is 194.57. In fact, the actual amount of calories is 198 so our estimate isn t perfect, but very close! The residual is the difference Real value y(estimated value) = 198-194.57=3.43. 36

Something you need to be careful Extrapolation: we don t want to predict using x values different than the known data 37

Extrapolation 38

Something you need to be careful Lurking Variables: a variable that we don t look at that causes the correlation. Confounding: a study occurs when the effects of two or more variables are mixed together. It is often caused by a lurking variable. 39

Confounding and Lurking Variable It s hot outside Where do you go? Beach! Swimming in the sea! What do you eat? Ice cream! We had a hot summer, so there would be more swimming and eating ice cream, and thus, more drowning deaths. If someone wasn t carful and claims The increase sales of ice cream causes drowning deaths. What do you say? 40

Something you need to be careful Influential Outliers a single point can really change the fit of the regression line always check for stray points in the scatterplot Correlation does not imply causation 41