Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

Similar documents
Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 8 One Way ANOVA and comparisons among means Introduction

Multiple Linear Regression Analysis

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Example of Interpreting and Applying a Multiple Regression Model

CHAPTER TWO REGRESSION

10. LINEAR REGRESSION AND CORRELATION

Simple Linear Regression

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Multiple Regression Using SPSS/PASW

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

Daniel Boduszek University of Huddersfield

SCATTER PLOTS AND TREND LINES

Simple Linear Regression the model, estimation and testing

CHAPTER ONE CORRELATION

bivariate analysis: The statistical analysis of the relationship between two variables.

Simple Linear Regression One Categorical Independent Variable with Several Categories

Regression Including the Interaction Between Quantitative Variables

CHILD HEALTH AND DEVELOPMENT STUDY

Understandable Statistics

Chapter 3 CORRELATION AND REGRESSION

Business Statistics Probability

Chapter 3: Examining Relationships

isc ove ring i Statistics sing SPSS

Chapter 3: Describing Relationships

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Daniel Boduszek University of Huddersfield

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

12/30/2017. PSY 5102: Advanced Statistics for Psychological and Behavioral Research 2

Correlation and Regression

Statistics for Psychology

Still important ideas

STATISTICS & PROBABILITY

2 Assumptions of simple linear regression

Math 075 Activities and Worksheets Book 2:

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

Lab 5a Exploring Correlation

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

Section 3 Correlation and Regression - Teachers Notes

Bangor University Laboratory Exercise 1, June 2008

Stat 13, Lab 11-12, Correlation and Regression Analysis

5 To Invest or not to Invest? That is the Question.

Data Analysis with SPSS

Using SPSS for Correlation

Linear Regression in SAS

Pitfalls in Linear Regression Analysis

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

The North Carolina Health Data Explorer

MULTIPLE OLS REGRESSION RESEARCH QUESTION ONE:

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

Sample Exam Paper Answer Guide

Math 215, Lab 7: 5/23/2007

1 Version SP.A Investigate patterns of association in bivariate data

6. Unusual and Influential Data

STATISTICS INFORMED DECISIONS USING DATA

Introduction to regression

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Problem Set 3 ECN Econometrics Professor Oscar Jorda. Name. ESSAY. Write your answer in the space provided.

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

First of two parts Joseph Hogan Brown University and AMPATH

Analysis of Variance (ANOVA) Program Transcript

MULTIPLE REGRESSION OF CPS DATA

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Chapter 1: Exploring Data

This tutorial presentation is prepared by. Mohammad Ehsanul Karim

Statistical reports Regression, 2010

Class 7 Everything is Related

Psychology of Perception Psychology 4165, Spring 2003 Laboratory 1 Weight Discrimination

Psychology of Perception Psychology 4165, Fall 2001 Laboratory 1 Weight Discrimination

Measuring the User Experience

THE STATSWHISPERER. Introduction to this Issue. Doing Your Data Analysis INSIDE THIS ISSUE

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

An Introduction to Modern Econometrics Using Stata

Normal Q Q. Residuals vs Fitted. Standardized residuals. Theoretical Quantiles. Fitted values. Scale Location 26. Residuals vs Leverage

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

NORTH SOUTH UNIVERSITY TUTORIAL 2

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

Section 3.2 Least-Squares Regression

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

Intro to SPSS. Using SPSS through WebFAS

Reveal Relationships in Categorical Data

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

Biostatistics II

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Chapter 14: More Powerful Statistical Methods

How to Conduct On-Farm Trials. Dr. Jim Walworth Dept. of Soil, Water & Environmental Sci. University of Arizona

3.2A Least-Squares Regression

SPSS Portfolio. Brittany Murray BUSA MWF 1:00pm-1:50pm

Effects of Nutrients on Shrimp Growth

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Lab 8: Multiple Linear Regression

ANOVA in SPSS (Practical)

The Effectiveness of Captopril

Lecture 6B: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Survey Project Data Analysis Guide

A Penny for Your Thoughts: Scientific Measurements and Introduction to Excel

Lesson 9: Two Factor ANOVAS

EXPERIMENT 3 ENZYMATIC QUANTITATION OF GLUCOSE

Transcription:

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction In this exercise, we will gain experience assessing scatterplots in regression and work with examples of multiple regression to illustrate principles of multiple regression. Objectives Learn how to assess scatterplots and residuals in a single linear regression. Use an example of multiple regression with no intercorrelation among X variables to illustrate principles of multiple regression. Use an example of multiple regression with intercorrelation among X variables to illustrate additional principles of multiple regression. Learn to interpret JMP output for multiple regression analyses. Exercise 1- Assessing scatterplots and residuals to identify problems in linear regression Quinn and Keough (pp 97-98) and the JMP manual authors (pp 245-247) propose that researchers use scatterplots of X values, Y values, and residuals to examine results of regression analyses. We performed similar assessments as a part of last week s lab when we identified outliers in our measurement data. Today, we will examine four pairs of variables and assess the utility of scatterplots to identify potential problems with certain data and linear regression. Opening the data file Open the file scatterplotsf03.jmp. You will see four pairs of ExampleX and ExampleY variables. Conducting regressions Select the Analyze and Fit Y by X commands, and drag ExampleX1 into the X, Factor box and ExampleY1 into the Y, Response box. Select the red triangle under Bivariate fit of Example1Y by Example1X and select Fit Line. You will see the predicted linear relationship between the two variables. Make a table indicating the example number, slope, intercept, and r square in your output window. Repeat this procedure for all four pairs of variables. What do you conclude about the regression statistics? Now examine each scatterplot closely and write a brief description of whether one of the following issues apply: typical bivariate normal distribution of values for both variables, nonlinear relationship between X and Y, outliers on Y axis, extreme values on X axis. Once you have completed this, click the red triangle under Linear Fit and select Plot Residuals for each regression. For each example, write an additional brief description of how the pattern of residuals relates to your first assessment. Finally, determine whether your examples conform to the four examples in the Anscombe data set (page 97 of your text), and if so, identify which one.

Bio 345 Week 5- Residuals and multiple regression -2- Exercise 2- Using an example to illustrate principles of multiple regression We will use experimental data from my research on beetle/host plant relationships to illustrate principles of multiple regression. This data set is useful because there are relatively few predictor variables (only 3), and they are all potentially important. Thus, we can concentrate on the differences between multiple versus single linear regression and postpone discussion about selection of X variables for regression models for a later exercise. In addition to the actual predictor variables, which are intercorrelated, we will also work with three uncorrelated predictor variables to study the effects of intercorrelation on regression coefficients and correlations. 1 Opening the data file Open the file calif_beetle_survival.jmp In August 1989 I performed a series of experiments comparing growth and survival of beetle larvae on four willow species in the laboratory and field. I wanted to know whether larval survival was related to the concentration of chemicals in the leaves that beetles used to make a defensive secretion. But I also wanted to know whether the nutritional quality of the plants was important. I measured 1. water content of the leaves [mass of H 2 O in leaves/total leaf mass] 2. mean growth rate of larvae in the laboratory on each plant [Ln(final mass - initial mass)/# days] 3. the total amount of host plant salicylates in the leaves [log(mg of salicylates/leaf mass in g)] 4. dependent variable: the average survival of beetle larvae per plant in nature (% survival) I originally created the three variables Prin1, Prin2, and Prin3 during an exploratory analysis of this data, which I included in my Ph.D. dissertation. Regressing the uncorrelated predictor variables onto survival Conducting single linear regressions Construct a table with five columns: Variable name, slope, intercept, correlation coefficient, r square, and MS ERR. Use the Analyze and Multivariate commands with the four variables Prin1, Prin2, Prin3, and survival to obtain correlations between the three X variables and Y. Note whether the X variables appear to be intercorrelated. Use the Fit Y by X platform to conduct three regressions of each X variable onto survival. Fill in the values in your table with the regression output. 1 The uncorrelated predictor variables were constructed through a multivariate technique that we will cover later in the semester, Principal Components Analysis. We will discuss how this analysis constructs uncorrelated variables at a later time.

Bio 345 Week 5- Residuals and multiple regression -3- Conducting multiple linear regression Select the Analyze and Fit Model commands, drag Prin1, Prin2, and Prin3 into the Construct Model Effects box, and drag survival into the Y box. Select Run Model and examine the output. Interpreting the regression output At the top of the output, you will see a plot of the predicted versus actual values of Y. To the right of these are leverage plots, which we will discuss later. There are many similarities to the output from single linear regression. The Summary of Fit window shows you the value for multiple R square. It also shows the R square adj. Finally, the Summary of Fit window shows the mean of the dependent (response) variable and the number of observations. The ANOVA table shows the Sum of Squares explained by the regression variable on the Model line. The total Sum of Squares is shown under the C Total line. Finally, the SS shown on the Error line in the ANOVA table represents the remaining variation not explained by the regression variable. The Parameter Estimates window shows the intercept and partial regression coefficient estimates for each variable, their standard errors, and t tests for significance of the slopes. In the Effects Tests window below this, the Sum of Squares attributed to each regression variable and an F test are indicated. Using a hand calculator, confirm that the Sum of Squares values for each variable, added that of the others, equals the total Sum of Squares for the multiple regression model in the Analysis of Variance window. Calculating predicted and residual values Click on the red triangle at the top of the output window (next to Response Survival), and select Save Columns and Predicted Values. Click on the red triangle again and save the residuals. You should see two new columns added to the data set, corresponding to the regression and error variables. Calculating regression coefficients To obtain values for intercorrelation among X variables in the multiple regression, click on the red triangle at the top of the output window (next to Response Survival), select Estimates and Correlation of Estimates. The values are shown at the bottom of the output. To obtain partial correlation coefficients, return to the data table and select Analyze and Multivariate and drag your X variables and Y into the Y, columns box. In the red triangle next to Multivariate in the output window, select Partial Correlations. You will see two correlation matrices at the top of the output window, one for the raw correlations between the X variables and Y, and another for the partial correlations in the multiple regression. Calculating Beta coefficients For some reason, it is much harder to obtain beta coefficients in JMP 4 than it was in JMP 3. We will not conduct this analysis here, but you can do this by creating new columns representing the standardized values for each independent variable and the dependent variable (in the formula editor this is performed by the Col Standardize function. Then you can rerun the regression

Bio 345 Week 5- Residuals and multiple regression -4- using the standardized variables. The slope values in the multiple regression represent the beta coefficients. Illustrating general regression principles Return to your data table and locate the column window on the left side of the table. Scroll down until you see the Predicted survival and Residual survival variables. Now use the Analyze and Distribution commands to obtain distributions for three variables: survival, Predicted Survival, and Residual Survival. Examine the moments carefully. Which principles of regression, covered in lecture and shown on the handout on multiple regression principles, are illustrated here? Now click the red triangle beneath each variable name, select Display and More Moments. What additional principle of regression is now evident? Now select the Analyze and Fit Y by X commands, and place PRIN1, PRIN2, PRIN3, and Predicted Survival into the X box and Residual Survival into the Y box. Then select the red triangle under each bivariate fit and select Fit Line. Which additional multiple regression principle is now evident? Multiple regression when X s are uncorrelated Return to the output of your multiple regression analysis and compare the partial regression slopes shown here to the slope values from the individual regressions that you wrote into your table above. Calculate the sum of the r square values from the individual regressions and compare this value to multiple r square. What additional principle of multiple regression is illustrated here? Multiple regression when X s are correlated Now we will analyze the relationship between the original independent variables (water content, growth, host plant chemistry) and survival. First, we will run the single regressions. Construct another table with five columns: Variable name, slope, intercept, correlation coefficient, r square, and MS ERR. Use the Analyze and Multivariate commands with the four variables watcont, growth, logsaln, and survival to obtain correlations between the three X variables and Y. Note that the X variables are intercorrelated. Use the Fit Y by X platform to conduct three regressions of each X variable onto survival. Fill in the values in your table with the regression output. To run the multiple regression, select the Analyze and Fit Model commands, drag watcont, growth, and logsaln into the Construct Model Effects box, and drag survival into the Y box. Select Run Model and compare the partial regression slopes shown here to the slope values from the individual regressions. Calculate the sum of the r square values from the individual regressions and compare this value to multiple r square. What principle of multiple regression is illustrated here? Exercise 3- Comparison of multiple regression to simple regression models Advantages of multiple regression When one suspects that several factors influence the dependent variable, then it can be better to include all of them. The multiple regression allows for a more sensitive test for multiple effects

Bio 345 Week 5- Residuals and multiple regression -5- than a series of single regressions. Specifically, the effects of each independent variable are adjusted for the effects of the other independent variables. Effects of inter correlation It is best to conduct multiple regression when independent variables are mildly inter correlated. If there is high inter correlation, it is difficult to separate the contribution of each independent variable in explaining the variation in the dependent variable. With high inter correlation, the two independent variables are measuring the same thing, and it is not really fair to consider them separately. Another problem is that the partial regression coefficients are very unstable and susceptable to outliers. See the JMP manual (pages 316-319) and Keough and Quinn (pages 127-129) for more on the collinearity problem. If there is no inter correlation, single regressions would yield the same information as the multiple regression. Application to the data set In the data we are analyzing, we can test the effect of host plant salicylates on larval survival while adjusting for larval growth and water content. The multiple regression reveals new information because the single regression does not account for inter correlation among plant characteristics. Specifically, high salicylate plants tended to have low water contents (and low beetle growth). We may not detect a relationship between salicylate content and survival because the effect of salicylates is cancelled out by the low water content of high salicylate plants. In the multiple regression, both variables are included, and the effect of salicylate content is adjusted for the relationship between water content and survival. Comparing the multiple regression to the single regressions Return to the output from your multiple regression model (which shows up as a Fit Least Squares window), click the red triangle next to Response survival, select Save Columns, and Effect Leverage Pairs. This operation saves the data from the leverage plots shown on the right hand side of the regression output. Save your file as calif_beetle_survivalmod.jmp. The Effect Leverage plots show the relationship between each predictor variable and Y, after adjustment by inclusion of the other predictor variables in the model. Data from multiple regressions are often presented this way (see Fig 6.4, page 126 of Quinn and Keough for an example). Use the Fit Y by X platform to conduct three regressions of each X variable onto survival. Compare the degree of scatter around the regression lines in the single regressions to that observed in the leverage plots (which you can also examine using the Fit Y by X platform). In which plots do you see less scatter (leverage from the multiple regression or the simple linear regression)? Write a brief paragraph describing how the effect of each variable depended on whether it was regressed against survival alone or included in a multiple regression model. Tasks 1. Evaluate the four scatterplots found in the file scatterplotsf03.jmp. Write a brief description of whether any of the scatterplots indicate problems with the data or with the relationship between X and Y. Use residual plots to help you evaluate the plots. 2. Learn to interpret the output from a multiple regression.

Bio 345 Week 5- Residuals and multiple regression -6-3. Describe how two examples conform to the principles of multiple regression discussed in class: one without intercorrelation among independent variables and the other with intercorrelation. You will include some tables here. 4. Write a brief paragraph summarizing your observations about the scatter around the regression line for leverage plots resulting from a multiple regression versus single linear regressions, and describing how the effect of each independent variable depended on whether it was regressed against the dependent variable alone or included in a multiple regression model.

Principles of multiple regression The mean of the RV is equal to the mean in Y. The mean of the error variable equals zero. SS RV + SS Err = SS Y Multiple r 2 = SS RV /SS Y The error variable is uncorrelated with the RV and with the X variables If X variables are not intercorrelated Partial regression slopes in the multiple regression are equal to slopes of X variables individually regressed on Y Sum of squared correlation coefficients = multiple r 2 If X variables are intercorrelated Partial regression slopes in the multiple regression differ from slopes of X variables individually regressed on Y Sum of squared correlation coefficients > multiple r 2