This tutorial presentation is prepared by. Mohammad Ehsanul Karim

STATA: The Red tutorial

STATA: The Red tutorial This tutorial presentation is prepared by Mohammad Ehsanul Karim ehsan.karim@gmail.com

Contents Linear Regression Analysis 1. Introduction to Linear Regression 2. Tests for Normality of Residuals 3. Tests for Heteroscedasticity 4. Tests for Multicollinearity 5. Tests for Autocorrelation 6. Detecting Unusual and Influential Data 7. Tests for Model Specification

1. Introduction to Linear Regression

Linear Regression The command regress is used to perform linear regressions. The first variable after the regress command is always the dependent variable ( left-hand-side variable), and the list of the independent variables that we chose to include in the estimation model follows ( right-hand-side variables).

Linear Regression. clear. use hs1, clear. regress write read female

Linear Regression. clear. use hs1, clear. regress write read female Source SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 77.21 Model 7856.32118 2 3928.16059 Prob > F = 0.0000 Residual 10022.5538 197 50.8759077 R-squared = 0.4394 -------------+------------------------------ Adj R-squared = 0.4337 Total 17878.875 199 89.843593 Root MSE = 7.1327 ------------------------------------------------------------------------------ write Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- read.5658869.0493849 11.46 0.000.468496.6632778 female 5.486894 1.014261 5.41 0.000 3.48669 7.487098 _cons 20.22837 2.713756 7.45 0.000 14.87663 25.58011 ------------------------------------------------------------------------------

2. Tests for Normality of Residuals

Tests for Normality of Residuals We use the predict command with the resid option to generate residuals and we name the residuals r.. predict r, resid

Tests for Normality of Residuals Shapiro-Wilk W test for Normality For verifying that the residuals are normally distributed, which is a very important assumption for regression, we use Shapiro-Wilk W test for normal data. swilk r Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z -------------+------------------------------------------------- r 200 0.98714 1.919 1.499 0.06692

Tests for Normality of Residuals In verifying that the residuals are normally distributed, which is a very important assumption for regression, the kdensity command with the normal option displays a density graph of the residuals with an normal distribution superimposed on the graph.

Tests for Normality of Residuals. kdensity r, normal

Tests for Normality of Residuals The pnorm command produces a normal probability plot and it is another method of testing whether the residuals from the regression are normally distributed.

Tests for Normality of Residuals. pnorm r

Tests for Normality of Residuals The qnorm command produces a normal quantile plot. It is yet another method for testing if the residuals are normally distributed.

Tests for Normality of Residuals. qnorm r

Tests for Normality of Residuals Summary of Tests for Normality of Residuals swilk performs the Shapiro-Wilk W test for normality. kdensity produces kernel density plot with normal distribution overlayed. pnorm graphs a standardized normal probability (P-P) plot. qnorm plots the quantiles of varname against the quantiles of a normal distribution.

3. Tests for Heteroscedasticity

Tests for Heteroscedasticity One of the basic assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals. There are graphical and non-graphical methods for detecting heteroscedasticity.

Tests for Heteroscedasticity Cook-Weisberg test for heteroskedasticity

Tests for Heteroscedasticity Cook-Weisberg test for heteroskedasticity. hettest Cook-Weisberg test for heteroskedasticity using fitted values of write Ho: Constant variance chi2(1) = 5.79 Prob > chi2 = 0.0161

Tests for Heteroscedasticity we use the rvfplot command with the yline(0) option to put a reference line at y=0.

Tests for Heteroscedasticity we use the rvfplot command with the yline(0) option to put a reference line at y=0.. rvfplot, yline(0)

Tests for Heteroscedasticity Summary of Tests for Heteroscedasticity hettest performs Cook and Weisberg test rvfplot graphs residual-versus-fitted plot.

4. Tests for Multicollinearity

Tests for Multicollinearity Multicollinearity is a concern for multiple regression, not for its existence, but for its degree. For severe degree of multicollinearity, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated.

Tests for Multicollinearity We can use the vif command after the regression to check for multicollinearity. vif stands for variance inflation factor.

Tests for Multicollinearity We can use the vif command after the regression to check for multicollinearity. vif stands for variance inflation factor.. vif Variable VIF 1/VIF -------------+---------------------- female 1.00 0.997182 read 1.00 0.997182 -------------+---------------------- Mean VIF 1.00 A variable whose VIF values are greater than 10 may merit further investigation. Tolerance= 1/VIF, is used to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10.

Tests for Multicollinearity Summary of Tests for Multicollinearity vif calculates the variance inflation factor for the independent variables in the linear model.

5. Tests for Autocorrelation

Tests for Autocorrelation. tsset id time variable: id, 1 to 200. dwstat Durbin-Watson d-statistic( 3, 200) = 1.93992

6. Detecting Unusual and Influential Data

Detecting Unusual and Influential Data Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an effect on the estimate of regression coefficients. Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

Detecting Unusual and Influential Data Here we summarize the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations). Measure Value leverage >(2k+2)/n abs(rstu) > 2 Cook's D > 4/n abs(dfits) > 2*sqrt(k/n) abs(dfbeta) > 2/sqrt(n)

Detecting Unusual and Influential Data We use the predict command with the rstudent option to generate studentized residuals and we name the residuals r. Studentized residuals are a type of standardized residual that can be used to identify outliers.

Detecting Unusual and Influential Data. stem r Stem-and-leaf plot for r (Studentized residuals) r rounded to nearest multiple of.01 plot in units of.01-2** 50,42-2** 26,21-2** 18-1** 92,85,84,83-1** 75,72,69,61,61,60-1** 50,48,46,46,42-1** 33,32,22,20,20,20-1** 17,16,13,12,10,01-0** 97,97,96,96,93,93,92,92,90,89,89,89,86,86,84,82,82,80,80-0** 74,74,71,70,67-0** 59,59,58,53,49,49,47,42,42,40-0** 35,35,33,31,31,31,30,28,28,28,28,27,25,23,23,22-0** 19,17,16,16,16,16,14,13,13,09,09,07,04,03,03,02 0** 00,02,02,04,04,04,04,07,09,11,14,16,16,19 0** 21,23,23,24,24,26,28,29,30,33,33,35,35 0** 40,44,44,51,51,54,54,54,54,56,56,57,57,57 0** 61,63,64,64,64,64,64,66,70,70,71,73,73,73,74,78 0** 88,88,89,93,94,94,97,98,99 1** 01,06,06,08,08,13,13,13,13,15,19 1** 23,29,32,36,36,37,37,39 1** 42,43,44,48,51,52,53,55 1** 60,68,73,73,75,77 1** 80,84 2** 16

Detecting Unusual and Influential Data. stem r. sort r. list r in 1/10 r 1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068 10. -1.750652. list r in -10/l r 191. 1.551833 192. 1.602682 193. 1.677923 194. 1.726393 195. 1.730591 196. 1.749522 197. 1.774811 198. 1.798141 199. 1.840841 200. 2.160904

Detecting Unusual and. stem r. sort r. list r in 1/10 r 1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 6. -1.916192 7. -1.848524 8. -1.843611 9. -1.831068 10. -1.750652 Influential Data. We should pay attention to. list r in -10/lstudentized r residuals that 191. 1.551833 exceed +2 or - 192. 1.602682 2, and get even 193. 1.677923 more concerned 194. 1.726393 about residuals 195. 1.730591 that exceed 196. 1.749522 +2.5 or -2.5 and 197. 1.774811 198. 1.798141 even yet more 199. 1.840841 concerned about residuals 200. 2.160904 that exceed +3 or -3.

Detecting Unusual and Influential Data. We should pay attention to studentized residuals that exceed +2 or - 2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

Detecting Unusual and. list r if r<-2 r>2 r 1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 200. 2.160904 Influential Data. We should pay attention to studentized residuals that exceed +2 or - 2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

Detecting Unusual and. list r if r<-2 r>2 r 1. -2.503566 2. -2.421219 3. -2.255832 4. -2.210221 5. -2.178212 200. 2.160904. list r if r<-2.5 r>2.5 Influential Data r 1. -2.503566. We should pay attention to studentized residuals that exceed +2 or - 2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.

Detecting Unusual and Influential Data To get Leverage points, we use the predict command with the leverage option and we name them lev.

Detecting Unusual and Influential Data To get Leverage points, we use the predict command with the leverage option and we name them lev.. predict lev, leverage

Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.. predict d, cooksd. list female read d if d>4/_n female read d 13. male 50.0234054 39. male 47.0212312 123. female 57.0202435 142. male 76.0327483

Detecting Unusual and Influential Data Cook's D and DFITS measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.. predict dfit, dfits. list dfit if abs(dfit)>2*sqrt(3/51) The above measures are general measures of influence.

Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. Apparently this is more computational intensive than summary statistics such as Cook's D.

Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.

Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors.. dfbeta DFread: DFbeta(read) DFfemale: DFbeta(female)

Detecting Unusual and Influential Data We can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors.. list DFread DFfemale in 1/5 DFread DFfemale 1..0492348.1971976 2. -.0887463 -.1617497 3..0915453.1802994 4..0434659.1740918 5..0717626 -.1374498

Detecting Unusual and Influential Data There are also several graphs that can be used to search for unusual and influential observations. The avplot command graphs an addedvariable plot.

Detecting Unusual and Influential Data avplot command not only works for the variables in the model, it also works for variables that are not in the model, which is why it is called added-variable plot. We can do an avplot on variable grade.

Detecting Unusual and Influential Data. avplot grade

Detecting Unusual and Influential Data. avplot grade Added-Variable plot

Detecting Unusual and Influential Data rvpplot is another convenience command which produces a plot of the residual versus a specified predictor and it is also used after regress or anova.

Detecting Unusual and Influential Data. rvpplot read

Detecting Unusual and Influential Data lvr2plot stands for leverage versus residual squared plot.

Detecting Unusual and Influential Data. lvr2plot

Detecting Unusual and Influential Data Summary of Detecting Unusual and Influential Data predict create predicted values, residuals, and measures of influence. dfbeta DFBETAs for all the independent variables avplot graphs an added-variable plot lvr2plot graphs a leverage-versus-squaredresidual plot. rvpplot graphs a residual-versus-predictor plot. rvfplot graphs residual-versus-fitted plot.

7. Tests for Model Specification

Tests for Model Specification A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model.

Tests for Model Specification There are several methods to detect specification errors. The linktest command performs a model specification link test for single-equation models.

Tests for Model Specification. Linktest Source SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 79.86 Model 8005.11739 2 4002.55869 Prob > F = 0.0000 Residual 9873.75761 197 50.120597 R-squared = 0.4477 -------------+------------------------------ Adj R-squared = 0.4421 Total 17878.875 199 89.843593 Root MSE = 7.0796 ------------------------------------------------------------------------------ write Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- _hat 2.807497 1.052071 2.67 0.008.7327302 4.882264 _hatsq -.0170281.0098827-1.72 0.086 -.0365176.0024615 _cons -47.29516 27.77544-1.70 0.090-102.0705 7.480201 ------------------------------------------------------------------------------

Tests for Model Specification The ovtest command performs performs a regression specification error test (RESET) for omitted variables.

Tests for Model Specification The ovtest command performs performs a regression specification error test (RESET) for omitted variables.. ovtest

Tests for Model Specification The ovtest command performs performs a regression specification error test (RESET) for omitted variables.. ovtest Ramsey RESET test using powers of the fitted values of write Ho: model has no omitted variables F(3, 194) = 1.95 Prob > F = 0.1233

Tests for Model Specification Summary of Tests for Model Specification linktest performs a link test for model specification. ovtest performs regression specification error test (RESET) for omitted variables.

STATA: The Red tutorial