Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction In this exercise, we will gain experience assessing scatterplots in regression and work with examples of multiple regression to illustrate principles of multiple regression. Objectives Learn how to assess scatterplots and residuals in a single linear regression. Use an example of multiple regression with no intercorrelation among X variables to illustrate principles of multiple regression. Use an example of multiple regression with intercorrelation among X variables to illustrate additional principles of multiple regression. Learn to interpret JMP output for multiple regression analyses. Exercise 1- Assessing scatterplots and residuals to identify problems in linear regression Quinn and Keough (pp 97-98) and the JMP manual authors (pp 245-247) propose that researchers use scatterplots of X values, Y values, and residuals to examine results of regression analyses. We performed similar assessments as a part of last week s lab when we identified outliers in our measurement data. Today, we will examine four pairs of variables and assess the utility of scatterplots to identify potential problems with certain data and linear regression. Opening the data file Open the file scatterplotsf03.jmp. You will see four pairs of ExampleX and ExampleY variables. Conducting regressions Select the Analyze and Fit Y by X commands, and drag ExampleX1 into the X, Factor box and ExampleY1 into the Y, Response box. Select the red triangle under Bivariate fit of Example1Y by Example1X and select Fit Line. You will see the predicted linear relationship between the two variables. Make a table indicating the example number, slope, intercept, and r square in your output window. Repeat this procedure for all four pairs of variables. What do you conclude about the regression statistics? Now examine each scatterplot closely and write a brief description of whether one of the following issues apply: typical bivariate normal distribution of values for both variables, nonlinear relationship between X and Y, outliers on Y axis, extreme values on X axis. Once you have completed this, click the red triangle under Linear Fit and select Plot Residuals for each regression. For each example, write an additional brief description of how the pattern of residuals relates to your first assessment. Finally, determine whether your examples conform to the four examples in the Anscombe data set (page 97 of your text), and if so, identify which one.
Bio 345 Week 5- Residuals and multiple regression -2- Exercise 2- Using an example to illustrate principles of multiple regression We will use experimental data from my research on beetle/host plant relationships to illustrate principles of multiple regression. This data set is useful because there are relatively few predictor variables (only 3), and they are all potentially important. Thus, we can concentrate on the differences between multiple versus single linear regression and postpone discussion about selection of X variables for regression models for a later exercise. In addition to the actual predictor variables, which are intercorrelated, we will also work with three uncorrelated predictor variables to study the effects of intercorrelation on regression coefficients and correlations. 1 Opening the data file Open the file calif_beetle_survival.jmp In August 1989 I performed a series of experiments comparing growth and survival of beetle larvae on four willow species in the laboratory and field. I wanted to know whether larval survival was related to the concentration of chemicals in the leaves that beetles used to make a defensive secretion. But I also wanted to know whether the nutritional quality of the plants was important. I measured 1. water content of the leaves [mass of H 2 O in leaves/total leaf mass] 2. mean growth rate of larvae in the laboratory on each plant [Ln(final mass - initial mass)/# days] 3. the total amount of host plant salicylates in the leaves [log(mg of salicylates/leaf mass in g)] 4. dependent variable: the average survival of beetle larvae per plant in nature (% survival) I originally created the three variables Prin1, Prin2, and Prin3 during an exploratory analysis of this data, which I included in my Ph.D. dissertation. Regressing the uncorrelated predictor variables onto survival Conducting single linear regressions Construct a table with five columns: Variable name, slope, intercept, correlation coefficient, r square, and MS ERR. Use the Analyze and Multivariate commands with the four variables Prin1, Prin2, Prin3, and survival to obtain correlations between the three X variables and Y. Note whether the X variables appear to be intercorrelated. Use the Fit Y by X platform to conduct three regressions of each X variable onto survival. Fill in the values in your table with the regression output. 1 The uncorrelated predictor variables were constructed through a multivariate technique that we will cover later in the semester, Principal Components Analysis. We will discuss how this analysis constructs uncorrelated variables at a later time.
Bio 345 Week 5- Residuals and multiple regression -3- Conducting multiple linear regression Select the Analyze and Fit Model commands, drag Prin1, Prin2, and Prin3 into the Construct Model Effects box, and drag survival into the Y box. Select Run Model and examine the output. Interpreting the regression output At the top of the output, you will see a plot of the predicted versus actual values of Y. To the right of these are leverage plots, which we will discuss later. There are many similarities to the output from single linear regression. The Summary of Fit window shows you the value for multiple R square. It also shows the R square adj. Finally, the Summary of Fit window shows the mean of the dependent (response) variable and the number of observations. The ANOVA table shows the Sum of Squares explained by the regression variable on the Model line. The total Sum of Squares is shown under the C Total line. Finally, the SS shown on the Error line in the ANOVA table represents the remaining variation not explained by the regression variable. The Parameter Estimates window shows the intercept and partial regression coefficient estimates for each variable, their standard errors, and t tests for significance of the slopes. In the Effects Tests window below this, the Sum of Squares attributed to each regression variable and an F test are indicated. Using a hand calculator, confirm that the Sum of Squares values for each variable, added that of the others, equals the total Sum of Squares for the multiple regression model in the Analysis of Variance window. Calculating predicted and residual values Click on the red triangle at the top of the output window (next to Response Survival), and select Save Columns and Predicted Values. Click on the red triangle again and save the residuals. You should see two new columns added to the data set, corresponding to the regression and error variables. Calculating regression coefficients To obtain values for intercorrelation among X variables in the multiple regression, click on the red triangle at the top of the output window (next to Response Survival), select Estimates and Correlation of Estimates. The values are shown at the bottom of the output. To obtain partial correlation coefficients, return to the data table and select Analyze and Multivariate and drag your X variables and Y into the Y, columns box. In the red triangle next to Multivariate in the output window, select Partial Correlations. You will see two correlation matrices at the top of the output window, one for the raw correlations between the X variables and Y, and another for the partial correlations in the multiple regression. Calculating Beta coefficients For some reason, it is much harder to obtain beta coefficients in JMP 4 than it was in JMP 3. We will not conduct this analysis here, but you can do this by creating new columns representing the standardized values for each independent variable and the dependent variable (in the formula editor this is performed by the Col Standardize function. Then you can rerun the regression
Bio 345 Week 5- Residuals and multiple regression -4- using the standardized variables. The slope values in the multiple regression represent the beta coefficients. Illustrating general regression principles Return to your data table and locate the column window on the left side of the table. Scroll down until you see the Predicted survival and Residual survival variables. Now use the Analyze and Distribution commands to obtain distributions for three variables: survival, Predicted Survival, and Residual Survival. Examine the moments carefully. Which principles of regression, covered in lecture and shown on the handout on multiple regression principles, are illustrated here? Now click the red triangle beneath each variable name, select Display and More Moments. What additional principle of regression is now evident? Now select the Analyze and Fit Y by X commands, and place PRIN1, PRIN2, PRIN3, and Predicted Survival into the X box and Residual Survival into the Y box. Then select the red triangle under each bivariate fit and select Fit Line. Which additional multiple regression principle is now evident? Multiple regression when X s are uncorrelated Return to the output of your multiple regression analysis and compare the partial regression slopes shown here to the slope values from the individual regressions that you wrote into your table above. Calculate the sum of the r square values from the individual regressions and compare this value to multiple r square. What additional principle of multiple regression is illustrated here? Multiple regression when X s are correlated Now we will analyze the relationship between the original independent variables (water content, growth, host plant chemistry) and survival. First, we will run the single regressions. Construct another table with five columns: Variable name, slope, intercept, correlation coefficient, r square, and MS ERR. Use the Analyze and Multivariate commands with the four variables watcont, growth, logsaln, and survival to obtain correlations between the three X variables and Y. Note that the X variables are intercorrelated. Use the Fit Y by X platform to conduct three regressions of each X variable onto survival. Fill in the values in your table with the regression output. To run the multiple regression, select the Analyze and Fit Model commands, drag watcont, growth, and logsaln into the Construct Model Effects box, and drag survival into the Y box. Select Run Model and compare the partial regression slopes shown here to the slope values from the individual regressions. Calculate the sum of the r square values from the individual regressions and compare this value to multiple r square. What principle of multiple regression is illustrated here? Exercise 3- Comparison of multiple regression to simple regression models Advantages of multiple regression When one suspects that several factors influence the dependent variable, then it can be better to include all of them. The multiple regression allows for a more sensitive test for multiple effects
Bio 345 Week 5- Residuals and multiple regression -5- than a series of single regressions. Specifically, the effects of each independent variable are adjusted for the effects of the other independent variables. Effects of inter correlation It is best to conduct multiple regression when independent variables are mildly inter correlated. If there is high inter correlation, it is difficult to separate the contribution of each independent variable in explaining the variation in the dependent variable. With high inter correlation, the two independent variables are measuring the same thing, and it is not really fair to consider them separately. Another problem is that the partial regression coefficients are very unstable and susceptable to outliers. See the JMP manual (pages 316-319) and Keough and Quinn (pages 127-129) for more on the collinearity problem. If there is no inter correlation, single regressions would yield the same information as the multiple regression. Application to the data set In the data we are analyzing, we can test the effect of host plant salicylates on larval survival while adjusting for larval growth and water content. The multiple regression reveals new information because the single regression does not account for inter correlation among plant characteristics. Specifically, high salicylate plants tended to have low water contents (and low beetle growth). We may not detect a relationship between salicylate content and survival because the effect of salicylates is cancelled out by the low water content of high salicylate plants. In the multiple regression, both variables are included, and the effect of salicylate content is adjusted for the relationship between water content and survival. Comparing the multiple regression to the single regressions Return to the output from your multiple regression model (which shows up as a Fit Least Squares window), click the red triangle next to Response survival, select Save Columns, and Effect Leverage Pairs. This operation saves the data from the leverage plots shown on the right hand side of the regression output. Save your file as calif_beetle_survivalmod.jmp. The Effect Leverage plots show the relationship between each predictor variable and Y, after adjustment by inclusion of the other predictor variables in the model. Data from multiple regressions are often presented this way (see Fig 6.4, page 126 of Quinn and Keough for an example). Use the Fit Y by X platform to conduct three regressions of each X variable onto survival. Compare the degree of scatter around the regression lines in the single regressions to that observed in the leverage plots (which you can also examine using the Fit Y by X platform). In which plots do you see less scatter (leverage from the multiple regression or the simple linear regression)? Write a brief paragraph describing how the effect of each variable depended on whether it was regressed against survival alone or included in a multiple regression model. Tasks 1. Evaluate the four scatterplots found in the file scatterplotsf03.jmp. Write a brief description of whether any of the scatterplots indicate problems with the data or with the relationship between X and Y. Use residual plots to help you evaluate the plots. 2. Learn to interpret the output from a multiple regression.
Bio 345 Week 5- Residuals and multiple regression -6-3. Describe how two examples conform to the principles of multiple regression discussed in class: one without intercorrelation among independent variables and the other with intercorrelation. You will include some tables here. 4. Write a brief paragraph summarizing your observations about the scatter around the regression line for leverage plots resulting from a multiple regression versus single linear regressions, and describing how the effect of each independent variable depended on whether it was regressed against the dependent variable alone or included in a multiple regression model.
Principles of multiple regression The mean of the RV is equal to the mean in Y. The mean of the error variable equals zero. SS RV + SS Err = SS Y Multiple r 2 = SS RV /SS Y The error variable is uncorrelated with the RV and with the X variables If X variables are not intercorrelated Partial regression slopes in the multiple regression are equal to slopes of X variables individually regressed on Y Sum of squared correlation coefficients = multiple r 2 If X variables are intercorrelated Partial regression slopes in the multiple regression differ from slopes of X variables individually regressed on Y Sum of squared correlation coefficients > multiple r 2