1 Simple and Multiple Linear Regression Assumptions

Size: px

Start display at page:

Download "1 Simple and Multiple Linear Regression Assumptions"

Belinda Andrews
6 years ago
Views:

1 Simple and Multiple Linear Regression Assumptions The assumptions for simple are in fact special cases of the assumptions for multiple: Check: 1. What is external validity?

1 1 Simple and Multiple Linear Regression Assumptions The assumptions for simple are in fact special cases of the assumptions for multiple: Check: 1. What is external validity? Which assumption is critical for external validity? 2. What is internal validity? Which assumption is critical for internal validity? 3. What null hypothesis are we typically testing? Which assumption is critical for hypothesis testing? 4. What happens when a dummy variable for male is included in a regression alongside a variable for non-male? Which assumption is violated? 2 Omitted Variable Bias i. Practice: Signing the bias A classmate of mine is working on a project in Mexico looking at how homicide rates (per capita) are affected by changes in police financing (per capita). Presumably, giving police more resources with which to fight crime would lower the rate of homicides in a given area. Let us imagine that the population model of homicides looks like this, with an index of gang presence in a given district as an additional explanatory variable. homicide β 0 + β 1 police f inance + β 2 gangs + u (1) However, let s pretend that she didn t think to collect data on prevalence of gangs in each district, so that the model she estimates is: homicide β 0 + β 1 police f inance + ũ (2) 1

2 If we think there s an important variable missing, like gangs is above, we can sign the bias we expect if we leave gangs out of the regression simply by determining the signs of two correlations: 1. Cov(homicide, gangs) or Cov(y, omitted variable) 2. Cov(police f inance, gangs) or Cov(x, omitted variable) Often we leave out the unimportant variables. An unimportant variable is one that we re not interested in, and one that will not induce bias (i.e. bias is zero) in our coefficients of interest if we leave it out The bias due to the omitted variable will be zero when: On a problem set or exam, when you re trying to decide how omitted variable bias might be affecting your estimation of a parameter, it s easy to think of the following table: Cov(x, x ov ) > 0 Cov(x, x ov ) < 0 Cov(y, x ov ) > 0 Upward bias Downward bias Cov(y, x ov ) < 0 Downward bias Upward bias ii. OVB: Derivation/Calculation SLR4 fails because of an omitted variable: E[u X] 0 Reviewing lecture from last Thursday: Population Model: y β 0 + β 1 x 1 + β 2 x 2 + u Model with omitted x 2 variable: y β 0 + β 1 x 1 + ũ Suppose x 1 is correlated with x 2 in the following way: x 2 α + ρx 1 + v Substituting this equation into the true population model we get: y β 0 + β 1 x 1 + β 2 (α + ρx 1 + v) + u There are extra terms! If we take the expectation of ˆ β 1 : [ ] E ˆ β 1 If E [ ˆβ 1 ] β1 then we say ˆβ 1 is biased. What this means is that on average, our regression estimate is going to miss the true population parameter by. 2

3 3 Clean and Dirty Variation In class we heard the concept of clean and dirty variation mentioned. What does it mean and why do we care? This is a very intuitive lense through which to think about the variation of certain variables. The idea is that an estimation is suffering from omitted variable bias. This is a problem because as a result we are not getting the true relationship between our explanatory variable of interest and our outcome, β. In the example above we wanted to explain homicides with police funding. However, we had biased results because areas with higher police funding also had higher levels of gang presence. This correlation between police funding and gang presence is considered dirty variation in police funding. It is variation in police funding that cannot be disentangled from gang presence and results in biasing our estimates. However, police funding and gang presence are not perfectly collinear. There is other variation in police funding that is not directly explained by gang presence and vice versa. If we could isolate only the clean variation in police funding that was independent of gang presence, then we could end up getting an unbiased estimate of the relationship between homicides and police funding! But this is impossible, right? Sort of. In the example from class, we actually had data for our omitted variable. Analagously, if we had data on gang presence we could be clever and use Stata to isolate only our clean variation in police funding that would capture an unbiased estimate of β. To do this, we generate a new model that uses gang presence to predict police funding by regressing police funding on gang presence. This predicted police funding is the dirty portion of variation that we can t use. Instead we extract the clean variation in police funding that has nothing to do with gang presence. This clean variation is captured in the residuals of our model. What does this do to our results? We saw that this new estimate from regressing homicides on these residuals is essentially the same as when we estimate the true population model and include the omitted variable directly in the equation! We now have E[ ˆβ] β! So why don t we just include this variable in the first place!?! Well... Good question. I m glad you asked. Of course we would love to always be blessed with the complete set of data and variables so that we could directly estimate the true population model, but that is very rarely the case. This example was constructed to illustrate two things: 1) how we can focus on clean variation to get accurate estimates and 2) how isolating only clean variation leads to an increase in our standard errors relative to our biased estimate. The latter point is because we are now using less of the total variation in our police funding (our X variable). The notion of clean and dirty variation will come back later in the course. Many impact evaluation methods are tools designed to help us isolate the clean variation of our variables of interest so that we can get unbiased estimates of β. 3

4 Variance of ˆβ Check: Bringing this home (hopefully) lets remember our variance for ˆβ. Var( ˆβ) ˆ σ 2 SST x (1 R 2 x) 1. What happened to n in this formula, don t we still care about our sample size? 2. What would happen to R 2 x if we add an additional variable into our regression that is highly correlated with X? 3. What happens to σ 2 if the newly added variable explains a lot of variation in Y? 4

5 4 Example: OVB in Action In this section, I use the wage data (WAGE1.dta) from your textbook to demonstrate the evils of omitted variable bias and show you that the OVB formula works. Let s pretend that this sample of 500 people is our whole population of interest, so that when we run our regressions, we are actually revealing the true parameters instead of just estimates. We re interested in the relationship between wages and gender, and our omitted variable will be tenure (how long the person has been at his/her job). Suppose our population model is: (1) log(wage) i β 0 + β 1 f emale i + β 2 tenure i + u i First let s look at the correlations between our variables and see if we can predict how omitting tenure will bias ˆβ 1 :. corr lwage female tenure lwage female tenure lwage female tenure If we ran the regression: (2) log(wage) i β 0 + β 1 f emale i + e i...then the information above tells us that β 1 β 1. Let s see if we were right. Imagine we ran the regressions in Stata (we did) and we get the below results for our two models: (1) log(wage) i f emale i tenure i + u i (2) log(wage) i f emale i + e i From these results we now know that β 1 and β 1. This means that our BIAS is equal to: There s one more parameter missing from our OVB formula. What regression do we have to run to find its value? The Stata results give us: tenure ρ 0 + ρ 1 f emale + v tenure f emale + v Now we can plug all of our parameters into the bias formula to check that it in fact gives us the bias from leaving out tenure from our wage regression: β 1 E[ ˆ β 1 ] β 1 + β 2 ρ 1 5

6 5 OVB Intuition (For your own reference) For further intuition on omitted variable bias, I like to think of an archer. When our MLR1-4 hold, the archer is aiming the arrow directly at the center of the target if he/she misses, it s due to random fluctuations in the air that push the arrow around, or maybe imperfections in the arrow that send it a little off course. When MLR1-4 do not all hold, like when we have an omitted variable, the archer is no longer aiming at the center of the target. There are still puffs of air and feather imperfections that send the arrow off course, but the course wasn t even the right one to begin with! The arrow (which you should think of as our ˆβ) misses the center of the target (which you should think of as our true β) systematically. To demonstrate this, I did the following: Take a random sample of 150 people out of the 500 that are in WAGE1.dta Estimate ˆβ 1 using OLS, controlling for tenure with these 150 people. Estimate ˆα 1 using OLS (NOT controlling for tenure) with these 150 people. Repeat 6000 times. At the end of all of the above, I end up with 6000 biased and 6000 unbiased estimates of ˆβ 1. I plotted the kernel density of the biased estimates alongside that of the unbiased estimates. You can see how the biased distribution is shifted to the left indicating a downward bias! Figure 1. Kernel densities for biased and unbiased estimates. Density effect of female on ln(wage) alphahat_1 betahat_1 6

Technical Track Session IV Instrumental Variables

Impact Evaluation Technical Track Session IV Instrumental Variables Christel Vermeersch Beijing, China, 2009 Human Development Human Network Development Network Middle East and North Africa Region World