Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Similar documents
Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

bivariate analysis: The statistical analysis of the relationship between two variables.

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Regression Equation. November 29, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

NORTH SOUTH UNIVERSITY TUTORIAL 2

MULTIPLE REGRESSION OF CPS DATA

Multiple Regression Analysis

12/30/2017. PSY 5102: Advanced Statistics for Psychological and Behavioral Research 2

Chapter 1: Exploring Data

Simple Linear Regression the model, estimation and testing

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Unit 1 Exploring and Understanding Data

Chapter 3 CORRELATION AND REGRESSION

Name: emergency please discuss this with the exam proctor. 6. Vanderbilt s academic honor code applies.

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Diurnal Pattern of Reaction Time: Statistical analysis

IAPT: Regression. Regression analyses

Simple Linear Regression

STATISTICS INFORMED DECISIONS USING DATA

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

AP Statistics Practice Test Ch. 3 and Previous

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Homework #3. SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

11/24/2017. Do not imply a cause-and-effect relationship

6. Unusual and Influential Data

APPENDIX D REFERENCE AND PREDICTIVE VALUES FOR PEAK EXPIRATORY FLOW RATE (PEFR)

Instrumental Variables Estimation: An Introduction

Understandable Statistics

Ordinary Least Squares Regression

Class 7 Everything is Related

Chapter 3: Examining Relationships

5 To Invest or not to Invest? That is the Question.

How to analyze correlated and longitudinal data?

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

STAT 135 Introduction to Statistics via Modeling: Midterm II Thursday November 16th, Name:

REGRESSION MODELLING IN PREDICTING MILK PRODUCTION DEPENDING ON DAIRY BOVINE LIVESTOCK

Pitfalls in Linear Regression Analysis

Section 3.2 Least-Squares Regression

Problem 1) Match the terms to their definitions. Every term is used exactly once. (In the real midterm, there are fewer terms).

Week 8 Hour 1: More on polynomial fits. The AIC. Hour 2: Dummy Variables what are they? An NHL Example. Hour 3: Interactions. The stepwise method.

Simple Linear Regression One Categorical Independent Variable with Several Categories

CSE 258 Lecture 1.5. Web Mining and Recommender Systems. Supervised learning Regression

3.2A Least-Squares Regression

On Regression Analysis Using Bivariate Extreme Ranked Set Sampling

Final Exam - section 2. Thursday, December hours, 30 minutes

Introduction to Econometrics

12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

3.2 Least- Squares Regression

Multiple Linear Regression (Dummy Variable Treatment) CIVL 7012/8012

First of two parts Joseph Hogan Brown University and AMPATH

HW 3.2: page 193 #35-51 odd, 55, odd, 69, 71-78

Linear Regression in SAS

CLASSICAL AND. MODERN REGRESSION WITH APPLICATIONS

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test February 2016

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Sample Math 71B Final Exam #1. Answer Key

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page Influence Analysis 1

Stat 13, Lab 11-12, Correlation and Regression Analysis

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

A response variable is a variable that. An explanatory variable is a variable that.

ANOVA in SPSS (Practical)

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

10. LINEAR REGRESSION AND CORRELATION

Chapter 3: Describing Relationships

Regression Including the Interaction Between Quantitative Variables

Correlation and regression

Chapter 11 Multiple Regression

1 Simple and Multiple Linear Regression Assumptions

2 Assumptions of simple linear regression

12.1 Inference for Linear Regression. Introduction

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

Bangor University Laboratory Exercise 1, June 2008

MAT Intermediate Algebra - Final Exam Review Textbook: Beginning & Intermediate Algebra, 5th Ed., by Martin-Gay

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Chapter 13 Estimating the Modified Odds Ratio

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

Homework 2 Math 11, UCSD, Winter 2018 Due on Tuesday, 23rd January

Final Exam Version A

Logistic Regression Predicting the Chances of Coronary Heart Disease. Multivariate Solutions

MEA DISCUSSION PAPERS

Anale. Seria Informatică. Vol. XVI fasc Annals. Computer Science Series. 16 th Tome 1 st Fasc. 2018

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

Ec331: Research in Applied Economics Spring term, Panel Data: brief outlines

CHAPTER 3 RESEARCH METHODOLOGY

Problem Set 5 ECN 140 Econometrics Professor Oscar Jorda. DUE: June 6, Name

Data Analysis Using Regression and Multilevel/Hierarchical Models

Clincial Biostatistics. Regression

Multiple Linear Regression

Underweight Children in Ghana: Evidence of Policy Effects. Samuel Kobina Annim

Lecture 12: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

CHAPTER TWO REGRESSION

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Simple Linear Regression: Prediction. Instructor: G. William Schwert

Reveal Relationships in Categorical Data

Transcription:

Multiple Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Multiple Regression 1 / 19

Multiple Regression 1 The Multiple Regression Model 2 An Example: The U.N. Fertility Data 3 Effect of an Added Variable 4 Predictors and Regressors 5 Predictors and Regressors 6 Multiple Regression in Matrix Notation James H. Steiger (Vanderbilt University) Multiple Regression 2 / 19

The Multiple Regression Model The Multiple Regression Model The simple linear regression model states that E(Y X = x) = β 0 + β 1 x (1) Var(Y X = x) = σ 2 (2) In the multiple regression model, we simply add one or more predictors to the system. For example, if we add a single predictor X 2, we get E(Y X 1 = x 1, X 2 = x 2 ) = β 0 + β 1 x 1 + β 2 x 2 (3) James H. Steiger (Vanderbilt University) Multiple Regression 3 / 19

An Example: The U.N. Fertility Data An Example: The U.N. Fertility Data The UN11 dataset contains U.N. Data on factors related to fertility for 199 countries. We are curious about the relationship between female life expectancy (lifeexpf) and two other variables. One potential predictor of life expectancy is fertility, measured by the average number of children per woman in the country. Another potential predictor is the general economic condition of the country. We measure that by log(ppgdp), the natural logarithm of ppgdp, the per capita gross domestic product in U.S. dollars. James H. Steiger (Vanderbilt University) Multiple Regression 4 / 19

An Example: The U.N. Fertility Data An Example: The U.N. Fertility Data We start with a simple linear regression predicting female life expectancy from fertility, another simple linear regression predicting female life expectancy from from log(ppgdp, and a third relating the two potential predictors to each other. These three simple linear regressions are called marginal plots. Data for these 3 variables, plus the names of the localities for each data point, are in the UN11 dataset. James H. Steiger (Vanderbilt University) Multiple Regression 5 / 19

An Example: The U.N. Fertility Data An Example: The U.N. Fertility Data Preliminary Analyses with Simple Regression We begin (after loading alr4) by predicting examining the correlations and scatterplots among the 3 variables. > library(alr4) > attach(un11) > log.ppgdp <- log(ppgdp) > cor(cbind(lifeexpf,fertility,log.ppgdp)) lifeexpf fertility log.ppgdp lifeexpf 1.0000000-0.8235881 0.7722587 fertility -0.8235881 1.0000000-0.7210800 log.ppgdp 0.7722587-0.7210800 1.0000000 > pairs(~lifeexpf + fertility + log.ppgdp) 1 2 3 4 5 6 7 lifeexpf 50 60 70 80 1 2 3 4 5 6 7 fertility log.ppgdp 5 6 7 8 9 10 50 60 70 80 5 6 7 8 9 10 11 James H. Steiger (Vanderbilt University) Multiple Regression 6 / 19

An Example: The U.N. Fertility Data An Example: The U.N. Fertility Data Preliminary Analyses with Simple Regression Both the scatterplots and the correlation matrix reveal that both fertility and log(ttppgdp) are excellent predictors of lifeexpf. However, they both are excellent predictors of each other, because, up to a point, fertility decreases sharply as log(ppgdp) increases. Since the two predictors are substantially redundant, the regression coefficients in a multiple regression equation using both predictors simultaneously may differ substantially from the slope of the two simple regression lines obtained by using only one predictor at a time. James H. Steiger (Vanderbilt University) Multiple Regression 7 / 19

An Example: The U.N. Fertility Data An Example: The U.N. Fertility Data Preliminary Analyses with Simple Regression Given what we have seen so far, what could we say about the proportion of variability of lifeexpf explained by fertility and log.ppgdp? A general rule in multiple regression is that adding a predictor can never decrease the proportion of variance accounted for. Moreover, if the predictors were uncorrelated, the proportions of variance would be additive, i.e. the pair of predictors would explain 67.8% + 59.6% = 117.4% of the variance. In this case, the substantial redundancy of the two variables means that, together, they will predict substantially less than this sum. However, adding a predictor can never reduce the proportion of variance accounted for in a multiple regression, so we know that adding log.ppgdp to fertility will result in a system that accounts for somewhere between 67.8% and 100% of the variance in lifeexpf. James H. Steiger (Vanderbilt University) Multiple Regression 8 / 19

Effect of an Added Variable Effect of an Added Variable The Added Variable Plot Suppose we have a simple regression in which fertility is used to predict lifeexpf, and we wish to evaluate the impact of adding fertility to the regression equation. It can be shown algebraically that the unique effect of adding log.ppgdp to a mean function that already includes fertility is determined by the relationship between the part of lifeexpf that is not explained by fertility and the part of log.ppgdp that is not explained by fertility. In a linear prediction system, the unexplained parts are just the residuals from these two simple regressions. James H. Steiger (Vanderbilt University) Multiple Regression 9 / 19

Effect of an Added Variable Effect of an Added Variable The Added Variable Plot To examine these, we can look at the scatterplot (and the linear fit) of the residuals from the regression of lifeexpf on fertility versus the residuals from the regression of log.ppgdp on fertility. This scatterplot is called an added variable plot. The slope of the regression line in the added variable plot is identical to the OLS estimate of the regression coefficient β 2 for the added variable if the multiple regression model is fit with both predictors included. Recalling our earlier discussion of partial correlation, we realize that the added variable plot actually depicts the partial regression of lifeexpf on log.ppgdp with fertility partialled out. So a regression coefficient in multiple regression is a partial regression coefficient between a variable and the criterion, with all the other regressors partialled out of both variables. James H. Steiger (Vanderbilt University) Multiple Regression 10 / 19

Effect of an Added Variable Effect of an Added Variable Automatic Generation of Added-Variable Plots The avplots function generates added variable plots automatically. Here is an example. You first specify the full model. > m2 <- lm(lifeexpf ~ fertility + log.ppgdp) Then you specify the terms you wish to have added variable plots for in the terms specifier. > avplots(m2,terms= ~ log.ppgdp) lifeexpf others 20 15 10 5 0 5 10 3 2 1 0 1 2 3 log.ppgdp others James H. Steiger (Vanderbilt University) Multiple Regression 11 / 19

Predictors and Regressors Predictors and Regressors Introduction In Section 3.3 of ALR, Weisberg introduces a number of key ideas and nomenclature in connection with a regression model of the form E(Y X ) = β 0 + β 1 X 1 + + β p X p (4) with Var(Y X ) = σ 2 (5) James H. Steiger (Vanderbilt University) Multiple Regression 12 / 19

Predictors and Regressors Predictors and Regressors Introduction Regression problems start with a collection of potential predictors. Some of these may be continuous measurements, like the height or weight of an object. Some may be discrete but ordered, like a doctor s rating of overall health of a patient on a nine-point scale. Other potential predictors can be categorical, like eye color or an indicator of whether a particular unit received a treatment. All these types of potential predictors can be useful in multiple linear regression. A key notion is the distinction between predictors and regressors in the regression equation. In early discussions, these are often synonymous. However, we quickly learn that they need not be. James H. Steiger (Vanderbilt University) Multiple Regression 13 / 19

Predictors and Regressors Predictors and Regressors Types of Regressors Many types of regressors can be created from a group of predictors. Here are some examples The intercept. We can rewrite the mean function on the previous slide as E(Y X ) = β 0 X 0 + β 1 X 1 + + β p X p (6) where X 0 is a term that is always equal to one. Mean functions without an intercept would not have this term included. Predictors. The simplest type of regressor is simply one of the predictors. James H. Steiger (Vanderbilt University) Multiple Regression 14 / 19

Predictors and Regressors Predictors and Regressors Types of Regressors Transformations of predictors. Often we will transform one of the predictors to create a regressor. For example, X 1 in a previous example was the logarithm of one of the predictors. Polynomials. Sometimes, we fit curved functions by including polynomial regressors in the predictor variables. So, for example, X 1 might be a predictor, and X 2 might be its square. Interactions and other combinations of predictors. Combining several predictors is often useful. An example of this is using body mass index, given by weight in kg divided by the square of height in meters, in place of both height and weight, or using a total test score in place of the separate scores from each of several parts. Products of predictors called interactions are often included in a mean function along with the original predictors to allow for joint effects of two or more variables. James H. Steiger (Vanderbilt University) Multiple Regression 15 / 19

Predictors and Regressors Predictors and Regressors Types of Regressors Dummy variables and factors. A categorical predictor with two or more levels is called a factor. Factors are included in multiple linear regression using dummy variables, which are typically regressors that have only two values, often 0 and 1, indicating which category is present for a particular observation. A categorical predictor with two categories can be represented by one dummy variable, while a categorical predictor with many categories can require several dummy variables. Regression splines. Polynomials represent the effect of a predictor by using a sum of regressors, like β 1 x + β 2 x 2 + β 3 x 3. We can view this as a linear combination of basis functions, given in the polynomial case by the functions (x, x 2, x 3 ). Using splines is similar to fitting a polynomial, except we use different basis functions that can have useful properties under some circumstances. James H. Steiger (Vanderbilt University) Multiple Regression 16 / 19

Predictors and Regressors Predictors and Regressors Types of Regressors Principal components. In some cases, one has many predictors, and there are substantial redundancies between predictors. For reasons we shall explore a bit later in the course, this situation can lead to computational problems and lack of clarity in the analysis. Data reduction methods can help us in this kind of situation, by reducing the large number of predictors into a smaller set that captures the unique variability shared by those predictors. Principal components are linear combinations of the predictors that are themselves uncorrelated, yet capture the maximum possible variance in the predictor set. James H. Steiger (Vanderbilt University) Multiple Regression 17 / 19

Predictors and Regressors Predictors and Regressors Types of Regressors It is easy to see that a multiple regression with k predictors may have fewer than k regressors or more than k regressors. James H. Steiger (Vanderbilt University) Multiple Regression 18 / 19

Multiple Regression in Matrix Notation Multiple Regression in Matrix Notation As we present additional results in multiple regression, it will prove useful to have at our disposal some basic matrix algebra, which we will present in the next two modules. James H. Steiger (Vanderbilt University) Multiple Regression 19 / 19