This tutorial presentation is prepared by. Mohammad Ehsanul Karim

Size: px

Start display at page:

Download "This tutorial presentation is prepared by. Mohammad Ehsanul Karim"

Garry Lang
6 years ago
Views:

1 STATA: The Red tutorial

2 STATA: The Red tutorial This tutorial presentation is prepared by Mohammad Ehsanul Karim

3 STATA: The Red tutorial This tutorial presentation is prepared by Mohammad Ehsanul Karim

4 Contents Linear Regression Analysis 1. Introduction to Linear Regression 2. Tests for Normality of Residuals 3. Tests for Heteroscedasticity 4. Tests for Multicollinearity 5. Tests for Autocorrelation 6. Detecting Unusual and Influential Data 7. Tests for Model Specification

5 1. Introduction to Linear Regression

6 Linear Regression The command regress is used to perform linear regressions. The first variable after the regress command is always the dependent variable ( left-hand-side variable), and the list of the independent variables that we chose to include in the estimation model follows ( right-hand-side variables).

7 Linear Regression. clear. use hs1, clear. regress write read female

8 Linear Regression. clear. use hs1, clear. regress write read female Source SS df MS Number of obs = F( 2, 197) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = write Coef. Std. Err. t P> t [95% Conf. Interval] read female _cons

9 Linear Regression. clear. use hs1, clear. regress write read female Source SS df MS Number of obs = F( 2, 197) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = write Coef. Std. Err. t P> t [95% Conf. Interval] read female _cons

10 2. Tests for Normality of Residuals

11 Tests for Normality of Residuals We use the predict command with the resid option to generate residuals and we name the residuals r.. predict r, resid

12 Tests for Normality of Residuals Shapiro-Wilk W test for Normality For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data

13 Tests for Normality of Residuals Shapiro-Wilk W test for Normality For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data. swilk r

14 Tests for Normality of Residuals Shapiro-Wilk W test for Normality For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data. swilk r Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z r

15 Tests for Normality of Residuals In verifying that the residuals are normally distributed, which is a very important assumption for regression, the kdensity command with the normal option displays a density graph of the residuals with an normal distribution superimposed on the graph.

16 Tests for Normality of Residuals. kdensity r, normal

17 Tests for Normality of Residuals. kdensity r, normal

18 Tests for Normality of Residuals The pnorm command produces a normal probability plot and it is another method of testing whether the residuals from the regression are normally distributed.

19 Tests for Normality of Residuals. pnorm r

20 Tests for Normality of Residuals. pnorm r

21 Tests for Normality of Residuals The qnorm command produces a normal quantile plot. It is yet another method for testing if the residuals are normally distributed.

22 Tests for Normality of Residuals. qnorm r

23 Tests for Normality of Residuals. qnorm r

24 Tests for Normality of Residuals Summary of Tests for Normality of Residuals swilk performs the Shapiro-Wilk W test for normality. kdensity produces kernel density plot with normal distribution overlayed. pnorm graphs a standardized normal probability (P-P) plot. qnorm plots the quantiles of varname against the quantiles of a normal distribution.

25 3. Tests for Heteroscedasticity

26 Tests for Heteroscedasticity One of the basic assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals. There are graphical and non-graphical methods for detecting heteroscedasticity.

27 Tests for Heteroscedasticity Cook-Weisberg test for heteroskedasticity

28 Tests for Heteroscedasticity Cook-Weisberg test for heteroskedasticity. hettest Cook-Weisberg test for heteroskedasticity using fitted values of write Ho: Constant variance chi2(1) = 5.79 Prob > chi2 =

29 Tests for Heteroscedasticity we use the rvfplot command with the yline(0) option to put a reference line at y=0.

30 Tests for Heteroscedasticity we use the rvfplot command with the yline(0) option to put a reference line at y=0.. rvfplot, yline(0)

31 Tests for Heteroscedasticity we use the rvfplot command with the yline(0) option to put a reference line at y=0.. rvfplot, yline(0)

32 Tests for Heteroscedasticity Summary of Tests for Heteroscedasticity hettest performs Cook and Weisberg test rvfplot graphs residual-versus-fitted plot.

33 4. Tests for Multicollinearity

34 Tests for Multicollinearity Multicollinearity is a concern for multiple regression, not for its existence, but for its degree. For severe degree of multicollinearity, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated.

35 Tests for Multicollinearity We can use the vif command after the regression to check for multicollinearity. vif stands for variance inflation factor.

36 Tests for Multicollinearity We can use the vif command after the regression to check for multicollinearity. vif stands for variance inflation factor.. vif Variable VIF 1/VIF female read Mean VIF 1.00

37 Tests for Multicollinearity We can use the vif command after the regression to check for multicollinearity. vif stands for variance inflation factor.. vif Variable VIF 1/VIF female read Mean VIF 1.00 A variable whose VIF values are greater than 10 may merit further investigation. Tolerance= 1/VIF, is used to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10.

38 Tests for Multicollinearity Summary of Tests for Multicollinearity vif calculates the variance inflation factor for the independent variables in the linear model.

39 5. Tests for Autocorrelation

40 Tests for Autocorrelation. tsset id time variable: id, 1 to 200. dwstat Durbin-Watson d-statistic( 3, 200) =

41 6. Detecting Unusual and Influential Data

42 Detecting Unusual and Influential Data Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an effect on the estimate of regression coefficients. Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

43 Detecting Unusual and Influential Data Here we summarize the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations). Measure Value leverage >(2k+2)/n abs(rstu) > 2 Cook's D > 4/n abs(dfits) > 2*sqrt(k/n) abs(dfbeta) > 2/sqrt(n)

44 Detecting Unusual and Influential Data We use the predict command with the rstudent option to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.

45 Detecting Unusual and Influential Data We use the predict command with the rstudent option to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.. predict r, rstudent

46 Detecting Unusual and Influential Data. stem r Stem-and-leaf plot for r (Studentized residuals) r rounded to nearest multiple of.01 plot in units of.01-2** 50,42-2** 26,21-2** 18-1** 92,85,84,83-1** 75,72,69,61,61,60-1** 50,48,46,46,42-1** 33,32,22,20,20,20-1** 17,16,13,12,10,01-0** 97,97,96,96,93,93,92,92,90,89,89,89,86,86,84,82,82,80,80-0** 74,74,71,70,67-0** 59,59,58,53,49,49,47,42,42,40-0** 35,35,33,31,31,31,30,28,28,28,28,27,25,23,23,22-0** 19,17,16,16,16,16,14,13,13,09,09,07,04,03,03,02 0** 00,02,02,04,04,04,04,07,09,11,14,16,16,19 0** 21,23,23,24,24,26,28,29,30,33,33,35,35 0** 40,44,44,51,51,54,54,54,54,56,56,57,57,57 0** 61,63,64,64,64,64,64,66,70,70,71,73,73,73,74,78 0** 88,88,89,93,94,94,97,98,99 1** 01,06,06,08,08,13,13,13,13,15,19 1** 23,29,32,36,36,37,37,39 1** 42,43,44,48,51,52,53,55 1** 60,68,73,73,75,77 1** 80,84 2** 16

47 Detecting Unusual and. stem r. sort r. list r in 1/10 r Influential Data

48 Detecting Unusual and Influential Data. stem r. sort r. list r in 1/10 r list r in -10/l r

49 Detecting Unusual and. stem r. sort r. list r in 1/10 r Influential Data. We should pay attention to. list r in -10/lstudentized r residuals that exceed +2 or , and get even more concerned about residuals that exceed or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

50 Detecting Unusual and Influential Data. We should pay attention to studentized residuals that exceed +2 or - 2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

51 Detecting Unusual and. list r if r<-2 r>2 r Influential Data. We should pay attention to studentized residuals that exceed +2 or - 2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

52 Detecting Unusual and. list r if r<-2 r>2 r list r if r<-2.5 r>2.5 Influential Data r We should pay attention to studentized residuals that exceed +2 or - 2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

53 Detecting Unusual and Influential Data To get Leverage points, we use the predict command with the leverage option and we name them lev.

54 Detecting Unusual and Influential Data To get Leverage points, we use the predict command with the leverage option and we name them lev.. predict lev, leverage

55 Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

56 Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.. predict d, cooksd

57 Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.. predict d, cooksd. list female read d if d>4/_n female read d 13. male male female male

58 Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.. predict dfit, dfits. list dfit if abs(dfit)>2*sqrt(3/51)

59 Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.. predict dfit, dfits. list dfit if abs(dfit)>2*sqrt(3/51) The above measures are general measures of influence.

60 Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.

61 Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. Apparently this is more computational intensive than summary statistics such as Cook's D.

62 Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.

63 Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.. dfbeta DFread: DFbeta(read) DFfemale: DFbeta(female)

64 Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.. list DFread DFfemale in 1/5 DFread DFfemale

65 Detecting Unusual and Influential Data There are also several graphs that can be used to search for unusual and influential observations. The avplot command graphs an addedvariable plot.

66 Detecting Unusual and Influential Data avplot command not only works for the variables in the model, it also works for variables that are not in the model, which is why it is called added-variable plot. We can do an avplot on variable grade.

67 Detecting Unusual and Influential Data. avplot grade

68 Detecting Unusual and Influential Data. avplot grade Added-Variable plot

69 Detecting Unusual and Influential Data rvpplot is another convenience command which produces a plot of the residual versus a specified predictor and it is also used after regress or anova.

70 Detecting Unusual and Influential Data. rvpplot read

71 Detecting Unusual and Influential Data. rvpplot read

72 Detecting Unusual and Influential Data lvr2plot stands for leverage versus residual squared plot.

73 Detecting Unusual and Influential Data. lvr2plot

74 Detecting Unusual and Influential Data. lvr2plot

75 Detecting Unusual and Influential Data Summary of Detecting Unusual and Influential Data predict create predicted values, residuals, and measures of influence. dfbeta DFBETAs for all the independent variables avplot graphs an added-variable plot lvr2plot graphs a leverage-versus-squaredresidual plot. rvpplot graphs a residual-versus-predictor plot. rvfplot graphs residual-versus-fitted plot.

76 7. Tests for Model Specification

77 Tests for Model Specification A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model.

78 Tests for Model Specification There are several methods to detect specification errors. The linktest command performs a model specification link test for single-equation models.

79 Tests for Model Specification. Linktest Source SS df MS Number of obs = F( 2, 197) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = write Coef. Std. Err. t P> t [95% Conf. Interval] _hat _hatsq _cons

80 Tests for Model Specification The ovtest command performs performs a regression specification error test (RESET) for omitted variables.

81 Tests for Model Specification The ovtest command performs performs a regression specification error test (RESET) for omitted variables.. ovtest

82 Tests for Model Specification The ovtest command performs performs a regression specification error test (RESET) for omitted variables.. ovtest Ramsey RESET test using powers of the fitted values of write Ho: model has no omitted variables F(3, 194) = 1.95 Prob > F =

83 Tests for Model Specification Summary of Tests for Model Specification linktest performs a link test for model specification. ovtest performs regression specification error test (RESET) for omitted variables.

84 STATA: The Red tutorial

Notes for laboratory session 2

Notes for laboratory session 2 Preliminaries Consider the ordinary least-squares (OLS) regression of alcohol (alcohol) and plasma retinol (retplasm). We do this with STATA as follows:. reg retplasm alcohol