Introduction to regression

Similar documents
Multiple Linear Regression Analysis

Chapter 3: Examining Relationships

Notes for laboratory session 2

Chapter 3 CORRELATION AND REGRESSION

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Lecture 6B: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Age (continuous) Gender (0=Male, 1=Female) SES (1=Low, 2=Medium, 3=High) Prior Victimization (0= Not Victimized, 1=Victimized)

This tutorial presentation is prepared by. Mohammad Ehsanul Karim

3.2 Least- Squares Regression

Reminders/Comments. Thanks for the quick feedback I ll try to put HW up on Saturday and I ll you

HW 3.2: page 193 #35-51 odd, 55, odd, 69, 71-78

Examining Relationships Least-squares regression. Sections 2.3

AP Statistics Practice Test Ch. 3 and Previous

Final Exam - section 2. Thursday, December hours, 30 minutes

Stat 13, Lab 11-12, Correlation and Regression Analysis

3.2A Least-Squares Regression

Modeling unobserved heterogeneity in Stata

Chapter 3: Describing Relationships

ANOVA. Thomas Elliott. January 29, 2013

CHAPTER ONE CORRELATION

Lecture 12: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

5 To Invest or not to Invest? That is the Question.

1.4 - Linear Regression and MS Excel

Chapter 3 Review. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Math 124: Module 2, Part II

Section 3.2 Least-Squares Regression

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

Sociology 63993, Exam1 February 12, 2015 Richard Williams, University of Notre Dame,

CHILD HEALTH AND DEVELOPMENT STUDY

bivariate analysis: The statistical analysis of the relationship between two variables.

MULTIPLE REGRESSION OF CPS DATA

STATISTICS INFORMED DECISIONS USING DATA

CHAPTER 3 Describing Relationships

STAT 201 Chapter 3. Association and Regression

NORTH SOUTH UNIVERSITY TUTORIAL 2

Problem Set 3 ECN Econometrics Professor Oscar Jorda. Name. ESSAY. Write your answer in the space provided.

CHAPTER TWO REGRESSION

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test February 2016

Math 075 Activities and Worksheets Book 2:

Simple Linear Regression the model, estimation and testing

Multiple Regression Analysis

STATISTICS 201. Survey: Provide this Info. How familiar are you with these? Survey, continued IMPORTANT NOTE. Regression and ANOVA 9/29/2013

Unit 1 Exploring and Understanding Data

Use the above variables and any you might need to construct to specify the MODEL A/C comparisons you would use to ask the following questions.

MA 250 Probability and Statistics. Nazar Khan PUCIT Lecture 7

IAPT: Regression. Regression analyses

Homework #3. SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Statistics for Psychology

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

Business Statistics Probability

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Answer all three questions. All questions carry equal marks.

Regression Equation. November 29, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

INTERPRET SCATTERPLOTS

Class 7 Everything is Related

Effects of Nutrients on Shrimp Growth

Pitfalls in Linear Regression Analysis

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

STATISTICS & PROBABILITY

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

M 140 Test 1 A Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Understandable Statistics

Regression. Regression lines CHAPTER 5

Simple Linear Regression

3. For a $5 lunch with a 55 cent ($0.55) tip, what is the value of the residual?

2. Scientific question: Determine whether there is a difference between boys and girls with respect to the distance and its change over time.

HZAU MULTIVARIATE HOMEWORK #2 MULTIPLE AND STEPWISE LINEAR REGRESSION

Method Comparison Report Semi-Annual 1/5/2018

EXECUTIVE SUMMARY DATA AND PROBLEM

Regression Including the Interaction Between Quantitative Variables

Correlation and regression

Chapter 1: Exploring Data

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

SCATTER PLOTS AND TREND LINES

General Example: Gas Mileage (Stat 5044 Schabenberger & J.P.Morgen)

12.1 Inference for Linear Regression. Introduction

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

Ordinary Least Squares Regression

6. Unusual and Influential Data

Psych 5741/5751: Data Analysis University of Boulder Gary McClelland & Charles Judd. Exam #2, Spring 1992

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page Influence Analysis 1

14.1: Inference about the Model

An Introduction to Statistical Thinking Dan Schafer Table of Contents

How Faithful is the Old Faithful? The Practice of Statistics, 5 th Edition 1

AP Statistics. Semester One Review Part 1 Chapters 1-5

Homework Linear Regression Problems should be worked out in your notebook

Caffeine & Calories in Soda. Statistics. Anthony W Dick

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

Dr. Kelly Bradley Final Exam Summer {2 points} Name

F1: Introduction to Econometrics

Lecture 12 Cautions in Analyzing Associations

Statistical Reasoning in Public Health 2009 Biostatistics 612, Homework #2

Transcription:

Introduction to regression Regression describes how one variable (response) depends on another variable (explanatory variable). Response variable: variable of interest, measures the outcome of a study Explanatory variable: explains (or even causes) changes in response variable Examples: Hearing difficulties: response - sound level (decibels), explanatory - age (years) Real estate market: response - listing prize ($), explanatory - house size (sq. ft.) Salaries: response - salary ($), explanatory - experience (years), education, sex Least squares regression, Jan 4, 4 - -

Introduction to regression Example: Food expenditures and income Data: Sample of households 6 food expenditure 8 4 4 6 8 income Questions: How does food expenditure (Y ) depend on income ()? Suppose we know that = x, what can we tell about Y? Linear regression: If the response Y depends linearly on the explanatory variable, we can use a straight line (regression line) to predict Y from. Least squares regression, Jan 4, 4 - -

Least squares regression How to find the regression line 6 food expenditure 8 4 4 6 8 income food expenditure 8 6 4 Observed y Difference y y^ Predicted y^ 8 5 6 7 8 9 income Since we intend to predict Y from, the errors of interest are mispredictions of Y for fixed. The least squares regression line of Y on is the line that minimizes the sum of squared errors. For observations (x, y ),..., (x n, y n ), the regression line is given by where Ŷ = a + b b = r s y s x and a = ȳ b x (r correlation coefficient, s x, s x standard deviations, x, ȳ means) Least squares regression, Jan 4, 4-3 -

Least squares regression Example: Food expenditure and income 8 6 3 4 54 59 44 3 4 8 Y 5. 5. 5.6 4.6.3 8. 7.8 5.8 5. 8. 4 58 8 4 47 85 3 6 Y 4.9.8 5. 4.8 7.9 6.4. 3.7 5..9 The summary statistics are: x = 45.5 s x = 3.96 ȳ = 7.97 s y = 4.66 r =.946 The regression coefficients are: b = r s y s x =.946 4.66 3.96 =.84 a = ȳ b x = 7.97.84 45.5 =.4 food expenditure 5 5 4 6 8 income Least squares regression, Jan 4, 4-4 -

Interpreting the regression model The response in the model is denoted Ŷ to indicate that these are predicted Y values, not the true Y values. The hat denotes prediction. The slope of the line indicates how much change in. Ŷ changes for a unit The intercept is the value of Ŷ for =. It may or not have a physical interpretation, depending on whether or not can take values near. To make a prediction for an unobserved, just plug it in and calculate Ŷ. Note that the line need not pass through the observed data points. In fact, it often will not pass through any of them. Least squares regression, Jan 4, 4-5 -

Regression and correlation Correlation analysis: We are interested in the joint distribution of two (or more) quantitive variables. Example: Heights of,78 fathers and sons 8 78 76 74 Son s height (inches) 7 7 68 66 64 6 6 58 58 6 6 64 66 68 7 7 74 76 78 8 Father s height (inches) Points are scattered around the SD line: (y ȳ) = s y s x (x x) goes through center ( x, ȳ) has slope s y /s x The correlation r measures how much the points spread around the SD line. Least squares regression, Jan 4, 4-6 -

Regression analysis: Regression and correlation We are interested how the distribution of one response variable depends on one (or more) explanatory variables. Example: Heights of,78 fathers and sons Son s height (inches) 8 Father s height = 64 inches. 78 76 74 7 7 68 66 64 6 6 Density Density.5..5. 58 6 6 64 66 68 7 7 74 76 78 8 Son s height (inches).6..8.4 Father s height = 68 inches x x 58 58 6 6 64 66 68 7 7 74 76 78 8 8 Father s height (inches) Density. 58 6 6 64 66 68 7 7 74 76 78 8 Son s height (inches).8.5..9.6.3 Father s height = 7 inches. 58 6 6 64 66 68 7 7 74 76 78 8 Son s height (inches) x 78 76 Son s height (inches) 74 7 7 68 66 64 6 In each vertical strip, the points are distributed around the regression line. 6 58 58 6 6 64 66 68 7 7 74 76 78 8 Father s height (inches) Least squares regression, Jan 4, 4-7 -

Properties of least squares regression The distinction between explanatory and response variables is essential. Looking at vertical deviations means that changing the axes would change the regression line. 8 78 x^ = a + b y 76 74 Son s height (inches) 7 7 68 66 y^ = a + bx 64 6 6 58 58 6 6 64 66 68 7 7 74 76 78 8 Father s height (inches) A change of sd in corresponds to a change of r sds in Y. The least squares regression line always passes through the point ( x, ȳ). r (the square of the correlation) is the fraction of the variation in the values of y that is explained by the least squares regression on x. When reporting the results of a linear regression, you should report r. These properties depend on the least-squares fitting criterion and are one reason why that criterion is used. Least squares regression, Jan 4, 4-8 -

Regression effect The regression effect In virtually all test-retest situations, the bottom group on the first test will on average show some improvement on the second test - and the top group will on average fall back. This is the regression effect. The statistician and geneticist Sir Francis Galton (8-9) called this effect regression to mediocrity. 8 78 76 74 Son s height (inches) 7 7 68 66 64 6 6 58 58 6 6 64 66 68 7 7 74 76 78 8 Father s height (inches) Regression fallacy Thinking that the regression effect must be due to something important, not just the spread around the line, is the regression fallacy. Least squares regression, Jan 4, 4-9 -

Regression in STATA. infile food income size using food.txt. graph twoway scatter food income lfit food income, legend(off) > ytitle(food). regress food income Source SS df MS Number of obs = ------------+------------------------------ F(, 8) = 5.97 Model 369.57965 369.57965 Prob > F =. Residual 43.77536 8.438756 R-squared =.894 ------------+------------------------------ Adj R-squared =.888 Total 43.3455 9.75564 Root MSE =.5594 --------------------------------------------------------------------------- food Coef. Std. Err. t P> t [95% Conf. Interval] ------------+-------------------------------------------------------------- income.8499.49345.33..57336.5486 _cons -.49994.7637666 -.54.596 -.663.965 --------------------------------------------------------------------------- Food expenditure 5 5 4 6 8 Income This graph has been generated using the graphical user interface of STATA. The complete command is:. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black)) > (lfit food income, range( ) clcolor(black) clpat(solid) clwidth(medium)), > ytitle(food expenditure, size(large)) ylabel(, valuelabel angle(horizontal) > labsize(medlarge)) xtitle(income, size(large)) xscale(range( )) > xlabel((), labsize(medlarge)) legend(off) ysize() xsize(3) Least squares regression, Jan 4, 4 - -

Residual plots : difference of observed and predicted values e i = observed y predicted y = y i ŷ i = y i (a + b x i ) For a least squares regression, the residuals always have mean zero. Residual plot A residual plot is a scatterplot of the residuals against the explanatory variable. It is a diagnostic tool to assess the fit of the regression line. Patterns to look for: Curvature indicates that the relationship is not linear. Increasing or decreasing spread indicates that the prediction will be less accurate in the range of explanatory variables where the spread is larger. Points with large residuals are outliers in the vertical direction. Points that are extreme in the x direction are potential high influence points. Influential observations are individuals with extreme x values that exert a strong influence on the position of the regression line. Removing them would significantly change the regression line. Least squares regression, Jan 4, 4 - -

Regression Diagnostics Example: First data set Y 5 5 5 4 6 8 Fitted values 5 5 residuals are regularly distributed Least squares regression, Jan 4, 4 - -

Regression Diagnostics Example: Second data set Y 5 5 5 4 6 8 Fitted values 5 5 functional relationship other than linear Least squares regression, Jan 4, 4-3 -

Regression Diagnostics Example: Third data set 5 Y 5 5 5 3 4 6 8 Fitted values 3 5 5 outlier, regression line misfits majority of data Least squares regression, Jan 4, 4-4 -

Regression Diagnostics Example: Fourth data set 5 Y 5 5 5 4 6 8 Fitted values 5 5 heteroscedasticity Least squares regression, Jan 4, 4-5 -

Regression Diagnostics Example: Fifth data set 5 Y 5 5 5 6 8 4 Fitted values 5 5 one separate point in direction of x, highly influential Least squares regression, Jan 4, 4-6 -