Introduction to regression Regression describes how one variable (response) depends on another variable (explanatory variable). Response variable: variable of interest, measures the outcome of a study Explanatory variable: explains (or even causes) changes in response variable Examples: Hearing difficulties: response - sound level (decibels), explanatory - age (years) Real estate market: response - listing prize ($), explanatory - house size (sq. ft.) Salaries: response - salary ($), explanatory - experience (years), education, sex Least squares regression, Jan 4, 4 - -
Introduction to regression Example: Food expenditures and income Data: Sample of households 6 food expenditure 8 4 4 6 8 income Questions: How does food expenditure (Y ) depend on income ()? Suppose we know that = x, what can we tell about Y? Linear regression: If the response Y depends linearly on the explanatory variable, we can use a straight line (regression line) to predict Y from. Least squares regression, Jan 4, 4 - -
Least squares regression How to find the regression line 6 food expenditure 8 4 4 6 8 income food expenditure 8 6 4 Observed y Difference y y^ Predicted y^ 8 5 6 7 8 9 income Since we intend to predict Y from, the errors of interest are mispredictions of Y for fixed. The least squares regression line of Y on is the line that minimizes the sum of squared errors. For observations (x, y ),..., (x n, y n ), the regression line is given by where Ŷ = a + b b = r s y s x and a = ȳ b x (r correlation coefficient, s x, s x standard deviations, x, ȳ means) Least squares regression, Jan 4, 4-3 -
Least squares regression Example: Food expenditure and income 8 6 3 4 54 59 44 3 4 8 Y 5. 5. 5.6 4.6.3 8. 7.8 5.8 5. 8. 4 58 8 4 47 85 3 6 Y 4.9.8 5. 4.8 7.9 6.4. 3.7 5..9 The summary statistics are: x = 45.5 s x = 3.96 ȳ = 7.97 s y = 4.66 r =.946 The regression coefficients are: b = r s y s x =.946 4.66 3.96 =.84 a = ȳ b x = 7.97.84 45.5 =.4 food expenditure 5 5 4 6 8 income Least squares regression, Jan 4, 4-4 -
Interpreting the regression model The response in the model is denoted Ŷ to indicate that these are predicted Y values, not the true Y values. The hat denotes prediction. The slope of the line indicates how much change in. Ŷ changes for a unit The intercept is the value of Ŷ for =. It may or not have a physical interpretation, depending on whether or not can take values near. To make a prediction for an unobserved, just plug it in and calculate Ŷ. Note that the line need not pass through the observed data points. In fact, it often will not pass through any of them. Least squares regression, Jan 4, 4-5 -
Regression and correlation Correlation analysis: We are interested in the joint distribution of two (or more) quantitive variables. Example: Heights of,78 fathers and sons 8 78 76 74 Son s height (inches) 7 7 68 66 64 6 6 58 58 6 6 64 66 68 7 7 74 76 78 8 Father s height (inches) Points are scattered around the SD line: (y ȳ) = s y s x (x x) goes through center ( x, ȳ) has slope s y /s x The correlation r measures how much the points spread around the SD line. Least squares regression, Jan 4, 4-6 -
Regression analysis: Regression and correlation We are interested how the distribution of one response variable depends on one (or more) explanatory variables. Example: Heights of,78 fathers and sons Son s height (inches) 8 Father s height = 64 inches. 78 76 74 7 7 68 66 64 6 6 Density Density.5..5. 58 6 6 64 66 68 7 7 74 76 78 8 Son s height (inches).6..8.4 Father s height = 68 inches x x 58 58 6 6 64 66 68 7 7 74 76 78 8 8 Father s height (inches) Density. 58 6 6 64 66 68 7 7 74 76 78 8 Son s height (inches).8.5..9.6.3 Father s height = 7 inches. 58 6 6 64 66 68 7 7 74 76 78 8 Son s height (inches) x 78 76 Son s height (inches) 74 7 7 68 66 64 6 In each vertical strip, the points are distributed around the regression line. 6 58 58 6 6 64 66 68 7 7 74 76 78 8 Father s height (inches) Least squares regression, Jan 4, 4-7 -
Properties of least squares regression The distinction between explanatory and response variables is essential. Looking at vertical deviations means that changing the axes would change the regression line. 8 78 x^ = a + b y 76 74 Son s height (inches) 7 7 68 66 y^ = a + bx 64 6 6 58 58 6 6 64 66 68 7 7 74 76 78 8 Father s height (inches) A change of sd in corresponds to a change of r sds in Y. The least squares regression line always passes through the point ( x, ȳ). r (the square of the correlation) is the fraction of the variation in the values of y that is explained by the least squares regression on x. When reporting the results of a linear regression, you should report r. These properties depend on the least-squares fitting criterion and are one reason why that criterion is used. Least squares regression, Jan 4, 4-8 -
Regression effect The regression effect In virtually all test-retest situations, the bottom group on the first test will on average show some improvement on the second test - and the top group will on average fall back. This is the regression effect. The statistician and geneticist Sir Francis Galton (8-9) called this effect regression to mediocrity. 8 78 76 74 Son s height (inches) 7 7 68 66 64 6 6 58 58 6 6 64 66 68 7 7 74 76 78 8 Father s height (inches) Regression fallacy Thinking that the regression effect must be due to something important, not just the spread around the line, is the regression fallacy. Least squares regression, Jan 4, 4-9 -
Regression in STATA. infile food income size using food.txt. graph twoway scatter food income lfit food income, legend(off) > ytitle(food). regress food income Source SS df MS Number of obs = ------------+------------------------------ F(, 8) = 5.97 Model 369.57965 369.57965 Prob > F =. Residual 43.77536 8.438756 R-squared =.894 ------------+------------------------------ Adj R-squared =.888 Total 43.3455 9.75564 Root MSE =.5594 --------------------------------------------------------------------------- food Coef. Std. Err. t P> t [95% Conf. Interval] ------------+-------------------------------------------------------------- income.8499.49345.33..57336.5486 _cons -.49994.7637666 -.54.596 -.663.965 --------------------------------------------------------------------------- Food expenditure 5 5 4 6 8 Income This graph has been generated using the graphical user interface of STATA. The complete command is:. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black)) > (lfit food income, range( ) clcolor(black) clpat(solid) clwidth(medium)), > ytitle(food expenditure, size(large)) ylabel(, valuelabel angle(horizontal) > labsize(medlarge)) xtitle(income, size(large)) xscale(range( )) > xlabel((), labsize(medlarge)) legend(off) ysize() xsize(3) Least squares regression, Jan 4, 4 - -
Residual plots : difference of observed and predicted values e i = observed y predicted y = y i ŷ i = y i (a + b x i ) For a least squares regression, the residuals always have mean zero. Residual plot A residual plot is a scatterplot of the residuals against the explanatory variable. It is a diagnostic tool to assess the fit of the regression line. Patterns to look for: Curvature indicates that the relationship is not linear. Increasing or decreasing spread indicates that the prediction will be less accurate in the range of explanatory variables where the spread is larger. Points with large residuals are outliers in the vertical direction. Points that are extreme in the x direction are potential high influence points. Influential observations are individuals with extreme x values that exert a strong influence on the position of the regression line. Removing them would significantly change the regression line. Least squares regression, Jan 4, 4 - -
Regression Diagnostics Example: First data set Y 5 5 5 4 6 8 Fitted values 5 5 residuals are regularly distributed Least squares regression, Jan 4, 4 - -
Regression Diagnostics Example: Second data set Y 5 5 5 4 6 8 Fitted values 5 5 functional relationship other than linear Least squares regression, Jan 4, 4-3 -
Regression Diagnostics Example: Third data set 5 Y 5 5 5 3 4 6 8 Fitted values 3 5 5 outlier, regression line misfits majority of data Least squares regression, Jan 4, 4-4 -
Regression Diagnostics Example: Fourth data set 5 Y 5 5 5 4 6 8 Fitted values 5 5 heteroscedasticity Least squares regression, Jan 4, 4-5 -
Regression Diagnostics Example: Fifth data set 5 Y 5 5 5 6 8 4 Fitted values 5 5 one separate point in direction of x, highly influential Least squares regression, Jan 4, 4-6 -