WELCOME! Lecture 11 Thommy Perlinger

Size: px

Start display at page:

Download "WELCOME! Lecture 11 Thommy Perlinger"

Harvey James
5 years ago
Views:

1 Quantitative Methods II WELCOME! Lecture 11 Thommy Perlinger

2 Regression based on violated assumptions If any of the assumptions are violated, potential inaccuracies may be present in the estimated regression model. The confidence in the interpretations and predictions decreases, the result may e.g. provide: Inappropriate tests of the significance of coefficients (either showing significance when it is not present, or vice versa) Biased and inaccurate predictions of the dependent variables So make sure to analyze your residuals and partial regression plots!

3 Assessing statistical assumptions Rules of thumb Testing assumptions must be done not only for each dependent and explanatory variable, but for the variate as well. Graphical analyses (residual plots, partial regression plots, Normal probability plots) are the most widely used methods of assessing assumptions for the variate. Remedies for problems found in the variate must be accomplished by modifying the dependent variable and/or one or more explanatory variables.

4 From stage 2 A decision process for multiple regression analysis Specification by the researcher Go to Stage 2 Stage 4 Select an estimation technique Specify the regression model? Or utilize a procedure that selects variables to optimize prediction? Selection by procedure Forward/ backward/ stepwise estimation All-possible-subsets Does the regression variate meet the assumptions of regression analysis? No

5 A decision process for multiple regression analysis Stage 1 Select objective(s) Prediction Explanation Select variables From stage 4 Stage 2 Research design Sample size Power Generalizability Additional variables Transformations? Dummy variables? Curvilinear relationships? Interaction terms? Stage 4 Select an estimation technique

6 From stage 2 A decision process for multiple regression analysis Specification by the researcher Go to Stage 2 Stage 4 Select an estimation technique Specify the regression model? Or utilize a procedure that selects variables to optimize prediction? Selection by procedure Forward/ backward/ stepwise estimation All-possible-subsets Does the regression variate meet the assumptions of regression analysis? No Yes Examine statistical and practical sign. Coefficient of determination, R 2 Adjusted coeff. of determination Standard error of the estimate, SE E Sign. of regression coefficients

7 Estimating the statistical significance of our model All samples are affected by random variation. Since we take only one sample and base our predictive model on that, we need to test the hypothesis that our regression model can represent the population and not just our sample. This can be done in two ways: 1) Testing the coefficient of determination R 2 (the variance explained) 2) Testing each regression coefficient

8 Estimating the statistical significance of our model All samples are affected by random variation. Since we take only one sample and base our predictive model on that, we need to test the hypothesis that our regression model can represent the population and not just our sample. This can be done in two ways: 1) Testing the coefficient of determination R 2 (the variance explained) 2) Testing each regression coefficient

9 Example: Happiness The World Database of Happiness is an online registry of scientific research on the subjective appreciation of life. The average happiness score is presented for various nations. This average is based on individual responses from numerous general population surveys to a general life satisfication (well-being) question.

10 Example: Happiness Variables: Happiness (0=dissatisfied to 10=satisfied) GINI index (degree of inequality in the distribution of income, higher score=greater inequality) Degree of corruption in government (higher score=less corruption) Average life expectancy Degree of democracy (higher score = more political liberties) Independent (explanatory) variables Dependent variable

11 Example: Happiness ANOVA a Model Sum of Squares df Mean Square F Sig. Regression 89, ,411 57,335,000 b 1 Residual 26,189 67,391 Total 115, a. Dependent Variable: Happiness b. Predictors: (Constant), Life expectancy, GINI, Democracy, Corruption P-value for test of overall significance of the model This model has overall significance, i.e. R 2 is significantly larger than zero (at least one of the explanatory variables significantly affects happiness).

12 F value The value of the F statistic shows the amount of variation explained by the model compared to how much is explained by using the simple mean of the dependent variable Y. Happiness example: F = 57.3 tells us that, considering the sample used for estimation, we can explain 57.3 times more variation of the happiness variable using the explanatory variables GINI, degree of corruption, degree of democracy, and life expectancy, than when using the simple average of Y=happiness.

13 Estimating the statistical significance of our model If we are comparing different models, we use the adjusted R 2 as a measure of how the additional explanatory variable(s) influence(s) the predictive accuracy of the model. We also examine the standard error of the estimate, SE E. The lower the SE E, the better the predictive accuracy of the model.

14 Estimating the statistical significance of our model All samples are affected by random variation. Since we take only one sample and base our predictive model on that, we need to test the hypothesis that our regression model can represent the population and not just our sample. This can be done in two ways: 1) Testing the coefficient of determination R 2 (the variance explained) 2) Testing each regression coefficient

15 Significance tests of regression coefficients The other way of testing the hypothesis that our regression model can represent the population and not just our sample is to test the significance of each regression coefficient. We already know how to test if the estimated regression coefficients are significantly different from zero. H : 0 0 i H : 0 a i (The variable X i has no linear effect on Y) (The variable X i has a linear effect on Y)

16 Example: Happiness Coefficients a Model Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. (Constant) -2,720,866-3,141,003 GINI,037,009,255 3,916,000 1 Corruption,186,050,363 3,680,000 Democracy,039,066,052,598,552 Life expectancy,090,011,639 8,080,000 a. Dependent Variable: Happiness The coefficients are significant for all explanatory variables but degree of democracy.

17 Confidence interval for the regression coefficient To get a more likely estimate of the regression coefficient, and to get an estimate that is generalizable to the population, confidence intervals can be created for the regression coefficients. A confidence interval visualizes the impact of the sample size on the result, smaller samples lead to wider confidence intervals and the other way around.

18 Example: Happiness Coefficients a Model Unstandardized Coefficients Standardized Coefficients t Sig. 95,0% Confidence Interval for B B Std. Error Beta Lower Bound Upper Bound (Constant) -2,720,866-3,141,003-4,449 -,991 GINI,037,009,255 3,916,000,018,056 1 Corruption,186,050,363 3,680,000,085,286 Democracy,039,066,052,598,552 -,092,170 Life expectancy,090,011,639 8,080,000,068,113 a. Dependent Variable: Happiness SPSS: Analyze >> Regression >> Linear. Click Statistics, mark Confidence intervals Confidence interval for each regression coefficient β

19 From stage 2 A decision process for multiple regression analysis Specification by the researcher Go to Stage 2 Stage 4 Select an estimation technique Specify the regression model? Or utilize a procedure that selects variables to optimize prediction? Selection by procedure Forward/ backward/ stepwise estimation All-possible-subsets Does the regression variate meet the assumptions of regression analysis? No Yes Examine statistical and practical sign. Coefficient of determination, R 2 Adjusted coeff. of determination Standard error of the estimate, SE E Sign. of regression coefficients Identify influential observations Deletion required?

20 Identifying influential observations Influential observations are observations that have a disproportionate effect on the regression results. The effect on the regression results can either be good, which means that the results are strengthened, or bad, which means that the results are being substantially changed. Any influential observations must be identified to assess their impact.

21 Identifying influential observations There are three basic types of influential observations: 1) Outliers = observations with large residual values. They can be identified only with respect to a specific regression model. 2) Leverage points = observations distinct from the remaining observations based on their explanatory variable values. 3) Other influential observations, that have a disproportionate effect on the regression results.

22 Identifying influential observations Procedures for identifying influential observations are becoming quite widespread, yet still not very well known and not frequently used in regression analysis. A good way to identify residual outliers is to look at standardized residuals exceeding 2.0 (more than 2 standard deviations from the mean of the residuals). SPSS can calculate Leverage, a measure of how far an observation deviates from the mean of that variable. During this course, we ll focus on identifying outliers among the residuals.

23 Identifying influential observations

24 Keeping or deleting influential observations Whether an influential observation should be deleted or kept depends on the type of observation: An error in observations or data entry should be deleted if the data cannot be corrected. A valid but exceptional observation that is explainable by an extraordinary situation should be deleted, unless variables reflecting the extraordinary situation are included in the model.

25 Keeping or deleting influential observations An exceptional observation with no likely explanation have no reasons for deleting the case, but no justification for keeping it. Perform analyses with and without the observations to make a complete assessment. An ordinary observation in its individual characteristics but exceptional in its combination of characteristics should be kept.

26 Statistical significance and influential observations Rules of thumb Always ensure practical significance when using large samples, because the model results and regression coefficients could be deemed irrelevant even when statistically significant due just to the statistical power arising from large sample sizes. Use the adjusted R 2 as your measure of overall model predictive accuracy when comparing models.

27 Statistical significance and influential observations Rules of thumb, cont d Statistical significance is required for a relationship to have validity, but statistical significance without theoretical support does not support validity. Although outliers may be easily identifiable, the other forms of influential observations requiring more specialized diagnostic methods can be equal to or have even more impact on the results

28 From stage 2 A decision process for multiple regression analysis Specification by the researcher Go to Stage 2 Stage 4 Select an estimation technique Specify the regression model? Or utilize a procedure that selects variables to optimize prediction? Selection by procedure Forward/ backward/ stepwise estimation All-possible-subsets Does the regression variate meet the assumptions of regression analysis? No Yes Examine statistical and practical sign. Coefficient of determination, R 2 Adjusted coeff. of determination Standard error of the estimate, SE E Sign. of regression coefficients Delete influential observations Yes Identify influential observations Deletion required? No To stage 5

29 A decision process for multiple regression analysis From stage 4 Stage 5 Interpret the regression variate Evaluate the prediction equation Evaluate the relative importance of the explanatory variables Assess multicollinearity

30 A decision process for multiple regression analysis Stage 5 Stage 5: interpreting the regression variate During this stage it is time to evaluate the estimated regression coefficients for their explanation of the dependent variable.

31 Using the regression coefficients The estimated regression coefficients (the b coefficients) represent both the type of relationship (positive or negative) and the strength of the relationship between explanatory and dependent variables (the value of b). The regression coefficients have two important functions in meeting the objectives of prediction and explanation for any regression analysis.

32 Prediction The estimated regression equation can be used to calculate estimated/predicted values for the dependent variable, based on certain values for the explanatory variable(s). When a regression equation is used for prediction with a set of observations that were not used in the estimation process, it is called forecasting.

33 Confidence intervals for predicted values To get a more likely estimate of a predicted/estimated value of Y based on the regression equation, you can create a confidence interval for the prediction.

34 Confidence intervals around estimated/predicted mean values Form intervals around the estimated/predicted value y to express uncertainty about the value of y for a given x j Confidence Interval for the estimated value of y, given x j Y y y = b 0 +b 1 x j x j X

35 Example: Happiness A simple regression model is estimated aiming to explain happiness, with life expectancy as the single explanatory variable. Y = Happiness (0-10) X = Life expectancy (years)

36 Example: Happiness Coefficients a Model Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 1 (Constant) -1,960,724-2,709,008 Life expectancy,114,010,807 11,446,000 a. Dependent Variable: Happiness The regression equation can be used to estimate the average happiness for nations with a life expectancy of e.g. 50 years: Happiness = = 3.74

37 Example: Happiness Estimated value of Happiness for X=49.94 years Predicted/estimated mean values Lower limit of Mean Confidence Interval SPSS: Analyze >> Regression >> Linear. Click Save, mark Mean under Prediction intervals. Upper limit of Mean Confidence Interval

38 Example: Happiness The estimated/predicted average value of happiness for a life expectancy of approx. 50 years (49.94) is 3.7. With 95% confidence, the interval 3.3 to 4.2 covers the true average value of happiness score in the population of nations with a life expectancy of approx. 50 years.

39 Confidence intervals around forecasts When using the regression equation to make forecasts, i.e. predictions of Y for a new set of data, you can create a confidence interval for the forecast. Forecasting predictions not only have the sampling variations from the original sample, but also those of the newly drawn sample. Confidence intervals around forecasts also include both the error associated with future observations, and therefore they are wider than the confidence intervals for estimated/predicted values.

40 Confidence intervals around individual forecasted (new observed) values Form intervals around the forecasted value y to express uncertainty about the value of y for a given x j Confidence Interval for the estimated value of y, given x j Y y y = b 0 +b 1 x j Forecasting Interval for a new observed y, given x j x j X

41 Example: Happiness We can use the regression equation estimated on the data from The World Database of Happiness to predict (forecast) the happiness score for a nation not included in the database. This nation happens to have a life expectancy of 50 years.

42 Example: Happiness Coefficients a Model Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. 1 (Constant) -1,960,724-2,709,008 Life expectancy,114,010,807 11,446,000 a. Dependent Variable: Happiness The same regression equation can be used to forecast the average happiness for a nation with a life expectancy of e.g. 50 years: Happiness = = 3.74

Example: Happiness Predicted/estimated mean values Lower limit of Individual (forecast) Confidence Interval SPSS: Analyze >> Regression >> Linear.

43 Example: Happiness Predicted/estimated mean values Lower limit of Individual (forecast) Confidence Interval SPSS: Analyze >> Regression >> Linear. Click Save, mark Individual under Prediction intervals. Estimated value of Happiness for X=49.94 years Upper limit of Individual (forecast) Confidence Interval

44 Example: Happiness The forecasted average value of happiness for a life expectancy of approx. 50 years (49.94) is 3.7, the same as the estimated/predicted value. With 95% confidence, the interval 2.0 to 5.2 covers the forecasted value of happiness score in the population of nations with a life expectancy of approx. 50 years.

45 Recap: Using the regression coefficients The estimated regression coefficients (the b coefficients) represent both the type of relationship (positive or negative) and the strength of the relationship between explanatory and dependent variables (the value of b). The regression coefficients have two important functions in meeting the objectives of prediction and explanation for any regression analysis.

46 Explanation The nature and impact of each explanatory variable in making prediction of the dependent variable is often of great interest. An explanation of the relationship between explanatory and dependent variables is gained by examining the relative contributions of each variable. The regression coefficients are indicators of the relative impact and importance of the explanatory variables in the relationship with the dependent variable.

47 Explanation In order to use the regression coefficients for explanation purposes, first ensure that all of the explanatory variables are on comparable scales. Example: If you want to investigate the effect of household income on number of cars in the household, you might include different individuals' income as explanatory variables. Then make sure that all individuals income are measured the same way, e.g. in SEK (and not one persons income in e.g SEK).

48 Explanation Even when all of the explanatory variables are on comparable scales, differences in variability from variable to variable can affect the size of the regression coefficient. To make all explanatory variables comparable in both scale and variability, you can use a modified regression coefficient called the beta coefficient.

49 The beta coefficient The regression coefficients can be standardized, meaning that they are converted to a common scale and variability. When using the standardized beta coefficients you don t have to deal with different units of measurement, they directly reflect the relative impact on the dependent variable of a change in one standard deviation of each variable. Multiple regression provides both the regression coefficients and the standardized beta coefficients.

50 Example: Cheddar cheese As cheddar cheese matures, a variety of chemical processes take place. The taste of matured cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests (n = 30).

51 Example: Cheddar cheese Variables: Taste score (obtained by combining the scores from several tasters) Dependent variable Concentrations of the following chemicals: Acetic acid Hydrogen sulfate Lactic acid Independent (explanatory) variables

52 Example: Cheddar cheese The coefficient for Lactic acid is largest, but the standard error is also largest for that variable Standardized beta coefficients can be used to assess the relative impact of the variables.

53 Cautions when using beta coefficients Beta coefficients should be used as a guide to the relative importance of individual explanatory variables only when collinearity is minimal. Collinearity can distort the contributions of any explanatory variable. The beta values can be interpreted only in the context of the other variables in the equation. A beta value for e.g. Hydrogen sulfide reflects its importance only in relation to Lactic acid and Acetic acid, not in any absolute sense.

54 A decision process for multiple regression analysis From stage 4 Stage 5 Interpret the regression variate Evaluate the prediction equation Evaluate the relative importance of the explanatory variables Assess multicollinearity

55 A decision process for multiple regression analysis From stage 4 Stage 5 Interpret the regression variate Evaluate the prediction equation Evaluate the relative importance of the explanatory variables Assess multicollinearity

56 A decision process for multiple regression analysis From stage 4 Stage 5 Interpret the regression variate Evaluate the prediction equation Evaluate the relative importance of the explanatory variables Assess multicollinearity

57 Assessing multicollinearity Correlation among the explanatory variables may cause problems when interpreting the regression results. Some degree of multicollinearity is however often unavoidable. You need to: Assess the degree of multicollinearity Determine its impact on the results Apply the necessary remedies if needed

58 Multicollinearity affects standard errors Standard errors of the coefficients for the correlated independent/explanatory variables will increase compared to the case of no or low degree of multicollinearity

59 Identifying multicollinearity The simplest and most obvious means of identifying collinearity is an examination of the correlation matrix for the explanatory variables. The presence of strong correlations (generally r 0.90) is the first indication of substantial collinearity. The absence of strong correlations however does not ensure an absence of collinearity. Collinearity may be due to the combined effect of two or more explanatory variables (multicollinearity):

60 Tolerance A direct measure of multicollinearity is tolerance, the amount of variability of an explanatory variable that is not explained by the other explanatory variables. E.g., if the other explanatory variables explain 25% of the variation of the explanatory variable X 1, then the tolerance value of X 1 is 75%. A high tolerance value means a small degree of multicollinearity.

61 Example: Cheddar cheese Tolerance values of 50-54%. Is this good enough? Tolerance measure of multicollinearity SPSS: Analyze >> Regression >> Linear. Click Statistics, mark Collinearity diagnostics

62 Variance inflation factor (VIF) A second measure of multicollinearity is the variance inflation factor (VIF), which is simply the inverse of the tolerance value. Higher degrees of multicollinearity are reflected in lower tolerance values and higher VIF values. VIF is the degree to which the standard error has been increased due to multicollinearity.

63 Example: Cheddar cheese VIF= This means that the standard error for Lactic acid has increased 1.4 times due to multicollinearity. Variance Inflation Factor measure of multicollinearity

64 How much multicollinearity is too much? Small tolerance values (and thus large VIF values) denote high collinearity. A common cutoff threshold is a tolerance value of 0.10, which corresponds to a VIF value of 10. With a VIF value of 10, this tolerance corresponds to standard errors being tripled ( ) Most recommended thresholds still allow for substantial collinearity. You may wish to be more restrictive, especially with small sample sizes.

65 How much multicollinearity is too much? Some suggested guidelines: Bivariate correlations of even 0.70 can impact both the explanation and estimation of the regression results. Even weaker correlations can have an impact if the correlation between explanatory variables is greater than either explanatory variable s correlation with the dependent variable. The suggested cutoff for the tolerance value is When values at this level are encountered, multicollinearity problems are almost certain.

66 Example: Cheddar cheese Correlations Hydrogen Taste score Acetic acid sulfide Lactic acid Taste score Pearson Correlation 1,550 **,756 **,704 ** Sig. (2-tailed),002,000,000 N Acetic acid Pearson Correlation,550 ** 1,618 **,604 ** Sig. (2-tailed),002,000,000 N Hydrogen sulfide Pearson Correlation,756 **,618 ** 1,645 ** Sig. (2-tailed),000,000,000 N Lactic acid Pearson Correlation,704 **,604 **,645 ** 1 Sig. (2-tailed),000,000,000 N **. Correlation is significant at the 0.01 level (2-tailed). Correlations around 0.6 between the explanatory variables, and with the dependent variable.

67 Example: Cheddar cheese All tolerance values far above 0.10 Following the guidelines, we don t have a multicollinearity problem with regards to tolerance values.

68 Remedies for multicollinearity Once the degree of multicollinearity has been determined, you have a number of options: 1) Omit one or more highly correlated explanatory variables and identify other variables to help the prediction (if possible). 2) Use the model with the highly correlated explanatory variables for prediction only (don t interpret the regression coefficients), but be aware of the lowered level of overall predictive ability

69 Remedies for multicollinearity 1) 2) 3) Use the simple correlations between each explanatory variable and the dependent variable to understand the different relationships. 4) Use a more sophisticated method of analysis to obtain a model that more clearly reflects the simple effects of the explanatory variables (not included in this course).

70 Interpreting the regression variate Rules of thumb Interpret the impact of each explanatory variable relative to the other variables in the model, because model respecification can have a profound effect on the remaining variables o Use standardized beta coefficients when comparing relative importance among explanatory variables

71 Interpreting the regression variate Rules of thumb, cont d Multicollinearity is generally viewed as harmful because increases in multicollinearity: o reduce the overall R 2 that can be achieved o o confound estimation of the regression coefficients negatively affect the statistical significance tests of regression coefficients

72 Interpreting the regression variate Rules of thumb, cont d Generally accepted levels of multicollinearity (tolerance values up to 0.10, corresponding to a VIF of 10) almost always indicate problems with multicollinearity, but these problems may also be seen at much lower levels of collinearity and multicollinearity: o Bivariate correlations of 0.70 or higher may result in problems, and even lower correlations may be problematic if they are higher than the correlations between the explanatory and dependent variables

73 Interpreting the regression variate Rules of thumb, cont d o Values much lower than the suggested thresholds (VIF values of even 3 to 5) may result in interpretation or estimation problems, particularly when the relationships with the dependent measure are weaker

74 A decision process for multiple regression analysis From stage 4 Stage 5 Interpret the regression variate Evaluate the prediction equation Evaluate the relative importance of the explanatory variables Assess multicollinearity Stage 6 Validate the results Split-sample analysis PRESS statistic

75 A decision process for multiple regression analysis Stage 6 Stage 6: validation of the results The final stage is to ensure that the regression model represents the general population (generalizability) and is appropriate for the situations in which it will be used (transferability). The best guideline is to which extent the model matches an existing theoretical model, or set of previously validated results on the same topic. If prior results or theory are not available, empirical validation approaches can be used.

76 Additional or split samples The most appropriate empirical validation is to test the regression model on an additional sample drawn from the population. This can be done in several ways: The original model can be used to predict values in the new sample and the predicted values are compared to the actual values of the dependent variable in that sample. A separate model can be estimated with the new sample and then compared with the original equation.

77 Additional or split samples When you don t have the possibility to draw a new sample, you can split the existing sample into two parts. One part for creating the regression model and a second part used for validation of the equation. This is appropriate only when you have large samples! No matter if you use additional or split samples, you will often find differences between the original model and the validation efforts. Your role is then to look for the best model across all samples. No regression model, unless estimated from the entire population, is the final and absolute model.

78 The PRESS statistic An alternative approach to obtaining additional samples for validation purposes is to calculate the PRESS statistic, a measure of the predictive accuracy of the estimated regression model (similar to R 2 ). The PRESS statistic (Predicted Residual Sum of Squares) is based on the model being fitted, repeatedly, leaving out one observation each time. In each repetition the model is used to predict the observation that was left out. The PRESS statistic won t be used during this course.

79 Forecasting with the model Once you have your final validated model, you might want to use it to make predictions or forecasts. When forecasting, i.e. applying the estimated model to a new data set to calculate predicted values for the dependent variable, there are several factors that can have a serious impact on the quality of the new predictions.

80 Forecasting with the model 1) The predictions not only have the sampling variations from the original sample, but also those of the newly drawn sample. Always calculate the confidence intervals of your predictions in addition to the point estimate. 2) Make sure that the conditions and relationships measured at the time the original sample was taken have not changed substantially. 3) Don t use the model to estimate beyond the range of explanatory variables found in the sample.

CHILD HEALTH AND DEVELOPMENT STUDY

CHILD HEALTH AND DEVELOPMENT STUDY 9. Diagnostics In this section various diagnostic tools will be used to evaluate the adequacy of the regression model with the five independent variables developed in