First of two parts Joseph Hogan Brown University and AMPATH
Overview What is regression? Does regression have to be linear? Case study: Modeling the relationship between weight and CD4 count Exploratory analysis Linear model (???) Nonlinear models: quadratic and exponential Summary and practical recommendations
What is regression? One explanatory variable Functional form is E(Y X=x) = g(x) Translation: the average value of Y when X=x follows the function g(x)
What is the correct function? Most of the time, regression thought of as linear E(Y X=x) = α + βx Most of the time, life is not linear! Examples Growth over time PK/PD characteristics of a new drug
Example: Weight vs CD4 1,100 individuals initiating cart At cart initiation, measure weight and CD4 count What is the relationship between these variables?
20 40 80 100 weight (kg) 02 24 46 68 810 20 40 weight (kg) 80 100 0 2 4 6 8 10
Fitting a model Objective: fit a model to characterize this relationship Steps Explore relationship without a model Find a function that best characterizes the relationship Interpret the model Try not to be confined to linearity
Exploration Can use model- free methods to estimate the functional relationship Key idea: Within windows of X, compute mean of Y Move the window a little bit at a time Connect the dots in a smooth way Tool: LOWESS (LOcally WEighted regression) Available in Stata
20 40 80 100 weight (kg) 02 24 46 68 810 20 40 weight (kg) 80 100 0 2 4 6 8 10
100 Lowess smoother Lowess smoother bandwidth =.4 20 40 80 100 weight (kg) 02 24 46 68 810 20 40 weight (kg) 80 0 bandwidth =.4 2 4 6 8 10
50 55 65 70 02 24 46 68 810 Fitted values lowess weight cd4_100 50 55 65 70 0 2 4 6 8 10 Fitted values lowess weight cd4_100
Linear regression (?) Regress Y on X (weight on CD4) Assumes linear relationship E(Y X=x) = α + βx α = intercept (weight when CD4 = 0) β = slope (difference in mean weight for those who differ by 100 CD4 units)
Fitted linear regression α = 53.7 β = 1.21 SE =.14 p <.001 How well does this model fit?
50 55 65 70 02 24 46 68 810 Fitted values lowess weight cd4_100 50 55 65 70 0 2 4 6 8 10 Fitted values lowess weight cd4_100
50 55 65 70 02 24 46 68 810 Fitted values lpred1 upred1 lowess weight cd4_100 50 55 65 70 0 2 4 6 8 10 Fitted values upred1 lpred1 lowess weight cd4_100
Brief digression What does the slope mean in a linear model? β = 1.21 SE =.14 p <.001 (a) If CD4 changes by 100 units, weight will change by 1.21 kg
Brief digression What does the slope mean in a linear model? β = 1.21 SE =.14 p <.001 (b) Two individuals who differ by 100 CD4 units will differ, on average, by 1.21 kg in weight
Correct answer is (b) First interpretation assumes that if we increase CD4 by 100 units, we will increase that person s weight by 1.21 kg Longitudinal comparison; requires repeated measures Second interpretation compares weights of separate individuals who differ by 100 units in CD4 count. Compares different individuals at a single point in time
Think outside the line Quadratic model Adds curvature Can be restrictive Exponential model Useful for capturing leveling off behavior
Quadratic model g(x) = α + β 1 X + β 2 X 2 Applied to CD4 and weight: E(Wt CD4) = α + β 1 CD4 + β 2 CD4 2
Fitted quadratic model Linear term β 1 = 2.0 SE =.21 p <.001 On average, higher CD4 associated with higher weight Quadratic term β 2 = - 0.26 SE =.05 p <.001 Implies negative curvature
50 55 65 02 24 46 68 810 Quadratic Model l2u2 lowess weight cd4_100 50 55 65 0 2 4 6 8 10 Quadratic Model u2 l2 lowess weight cd4_100
Quadratic model assessment Technically the model fits OK Does not capture leveling off Degree of curvature implies highest CD4 associated with lower weights
Exponential model Form of regression is nonlinear g(x) = α + β ϕ x In terms of wt and CD4 E(Wt CD4) = α + β ϕ CD4
Interpretation E(Wt CD4) = α + β ϕ CD4 α = leveling off point for Weight at high CD4 β = difference in Wt between those with CD4 near zero and the leveling off point ϕ = how fast Wt reaches its limiting value Quickly if near zero Slowly if near 1
Fitted model α = leveling- off Wt for large CD4 values Estimate =.4 kg β = difference in Wt between CD4 near zero and leveling- off point Estimate = - 9.9 kg ϕ = how fast Wt reaches its leveling off point Estimate = 0.59
50 55 65 02 24 46 68 810 Exponential Model lowess weight cd4_100 lower 95% upper bound: wtpred3 50 55 65 0 2 4 6 8 10 Exponential Model 95% lower bound: wtpred3 lowess weight cd4_100 95% upper bound: wtpred3
Summary Regression characterizes mean value of Y as function of X Today s example: Y = Weight in kg X = CD4 count Regression is a very broad topic Today s theme: think outside the line
Practical suggestions Use scatterplots and exploratory analysis Use LOWESS curves to approximate relationships in the scatterplot If relationship nonlinear, should take this into account Especially important for predictive models If you want to predict Wt from CD4 count
Next lecture Multiple regression (more than one predictor) Focus: Analysis of change from baseline Adjusting for one or more variables when testing hypothesis about a primary variable