Biostat/Stat 571: Coursework 5

Size: px

Start display at page:

Download "Biostat/Stat 571: Coursework 5"

Lee Harris
5 years ago
Views:

1 Biostat/Stat 571: Coursework 5 Answer Key 7 th February, 2005 Question 1. Generally cov(θ i, θ j ) = E[θ i θ j ] E[θ i ]E[θ j ] = E φ [E[θ i θ j φ]] E φ [E[θ i φ]]e φ [E[θ j φ]] = E φ [E[θ i φ]e[θ j φ]] E φ [E[θ i φ]]e φ [E[θ j φ]] Now, assuming E[θ i φ] = E[θ j φ] = f(φ), the above may be written E[f(φ) 2 ] (E[f(φ)]) 2 = var[f(φ)] 0

2 Biostat/Stat 571: Coursework 5 2 Question 2. Start with some exploratory looks at the data. Figure 1 shows CD4 count versus time, a linear decline certainly looks feasible counts aren t too close to zero so perhaps assuming normality is reasonable. Number of observations per individual is given in Table 1. Data from 24 individuals is given in Figure 2 clearly lots of between-patient variability. n i Total No. of patients Table 1: Number of individuals with different numbers of observations, n i, i = 1,..., 226. For interest here we will carry-out a so-called derived variable analysis. Historically this was used for longitudinal/clustered data when methods for dependent data were not so well developed. The idea is to summarize the curves of each individual into a lower dimensional summary measure. Here we evaluate the intercepts and slopes for each individual and examine their variability and relationship with viral load and age. Figure 3 shows least squares estimates of intercepts and slopes, ˆα and ˆβ for all individuals with more than two observations (which from Table 1 is all but 13). Because of the unbalanced data the uncertainty with each pair of estimates is not equal (which greatly reduces the appeal of such analyses). Individual fits can be useful to assess model assumptions, however. cd month Figure 1: CD4 count versus month of visit in 226 patients, with local smoother superimposed. We now turn to examine the within-person correlation, see Table 2. Year 1 Year 2 Year 3 Year 4 Year 1 92, Year 2 63,589 81, Year 3 48,798 57,458 75, Year 4 55,501 63,150 70, ,418 Table 2: Variances along the diagonal, correlations above the diagonal, covariances below the diagonal. (a) Table 3 summarizes the fits. Note that the robust standard error is smaller for the exchangeable correlation structure which, as we shall see, is more appropriate for these data (in other words, there is a loss of efficiency with the independence working model).

3 Biostat/Stat 571: Coursework cd month 50 Figure 2: CD4 count versus month for first 24 patients (in dataset). Frequency Frequency alpha hat beta hat beta hat alpha hat alpha hat log(viral load) beta hat alpha hat log(viral load) Age (years) Figure 3: Derived variable plots.

4 Biostat/Stat 571: Coursework 5 4 Low Viral Load Medium Viral Load High Medium Load 2000 cd4 month Figure 4: CD4 versus month, as a function of baseline viral load. From exploratory fits, biological plausibility, and size of the standard deviation of the slopes (about the same size as the population slope), looks as though random slopes are needed and what do we lose by doing this when we have so many individuals? Method Variance Model ˆβ0 s.e.( ˆβ 0 ) ˆβ1 s.e.( ˆβ 1 ) GEE Independence GEE Exchangeable LMEM Random Intercept LMEM Random Int+Slope Table 3: Initial GEE and MLEM estimates. (b) From Figure 4 it appears that individuals with low viral load have higher initial CD4 values, and individuals with high viral load have the lowest initial CD4 values plot is dominated by measurement error though. (c) Models modgee4 and modmle4 (see R code and results at end) fit main effects due to month and viral load, and the interaction (the lme fit with random intercepts and slopes). Main effect due to viral load is clearly significant, interaction it is less clear (and non-monotonic pattern which makes it less plausible). Note also that ˆσ 0 is reduced from to 247 after inclusion of viral load some of the variability has been explained. (d) Including age doesn t seem to make much difference (model modmle5). Don t look at CD8 as it is another response. (e) In GEE we make no distributional assumptions and have a working covariance model so that for consistency we don t need to have the correct variance-covariance model (so long as we use sandwich estimation of the variance). However, if the working model is a long way from the truth, then we will lose a lot of efficiency. Hence worth looking at mean-variance relationships, as well as the correlation structure, in the residuals. The mean model does have to be appropriate, however, and m has to be large enough for asymptotic normality, and for a reliable sandwich estimator. Here, m is large enough, and the mean model looks fine.

5 Biostat/Stat 571: Coursework 5 5 (f) For LMEME we have made distributional assumptions at different levels. With this many total observations (N = 1479) and individuals (m = 226) we probably don t have to worry about normality, at least for the population parameters (for individual random effects, and predictions for individuals it s a different story). More important to think about whether there are meanvariance relationships and correlations present that we have not modeled. Residuals do not seem to show anything amiss here. Normal Q Q Plot Normal Q Q Plot Sample Quantiles Sample Quantiles Theoretical Quantiles Theoretical Quantiles re4[, 2] res re4[, 1] month Figure 5: Residuals from LMEM model with random intercepts, slopes and interaction between viral load and month.

6 Biostat/Stat 571: Coursework 5 6 R CODE AND RESULTS Setup and misc. source("datared.txt") vcuts <- c(0,40) viraltert <- rep(1,length(cd4)) viraltert[subset$viral<=vcuts[1]] <- 0 viraltert[subset$viral>vcuts[2]] <- 2 viraltert <- factor(viraltert,labels=c("low Viral Load","Medium Viral Load","High Medium Load")) table(viraltert) GEE with working independence: > library(gee) > modgee1 <- gee(cd4 ~ month, corstr="independence") > summary(modgee1) (Intercept) month Estimated Scale Parameter: Number of Iterations: 1 GEE with working exchangeable: > modgee2 <- gee(cd4 ~ month, corstr="exchangeable") > summary(modgee2) (Intercept) month Estimated Scale Parameter: Number of Iterations: 2 Working Correlation [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] etc... LMEM with random intercepts: > modlme1 <- lme( cd4 ~ month, random= ~1 id ) > summary(modlme1) Random effects: Formula: ~1 id (Intercept) Residual StdDev: Fixed effects: cd4 ~ month (Intercept) month Correlation: (Intr) month Number of Observations: 1479 Number of Groups: 226 LMEM with random intercepts and slopes:

7 Biostat/Stat 571: Coursework 5 7 > library(nlme) > modlme2 <- lme( cd4 ~ month, random= ~month id ) > summary(modlme2) StdDev Corr (Intercept) (Intr) month Residual Fixed effects: cd4 ~ month (Intercept) month Correlation: (Intr) month Number of Observations: 1479 Number of Groups: 226 > modgee3 <- gee(cd4 ~ month+viraltert, corstr="exchangeable") > summary(modgee3) (Intercept) month viraltert viraltert Estimated Scale Parameter: Number of Iterations: 2 > modgee4 <- gee(cd4 ~ month+viraltert+viraltert*month, corstr="exchangeable") > summary(modgee4) (Intercept) month viraltert viraltert month:viraltert month:viraltert Estimated Scale Parameter: Number of Iterations: 2 > modlme3 <- lme( cd4 ~ month+viraltert, random= ~month id ) > summary(modlme3) StdDev Corr (Intercept) (Intr) month Residual Fixed effects: cd4 ~ month + viraltert (Intercept) month viraltert viraltert Correlation: (Intr) month vrltr1 month viraltert viraltert > modgee4 <- gee(cd4 ~ month+viraltert+viraltert*month, corstr="exchangeable") [1] "Beginning Cgee geeformula.q /01/27" [1] "running glm to get initial regression estimate" [1] [6]

8 Biostat/Stat 571: Coursework 5 8 > summary(modgee4) (Intercept) month viraltert viraltert month:viraltert month:viraltert > modlme4 <- lme( cd4 ~ month+viraltert+viraltert*month, random= ~month id ) > summary(modlme4) StdDev Corr (Intercept) (Intr) month Residual Fixed effects: cd4 ~ month + viraltert + viraltert * month (Intercept) month viraltert viraltert month:viraltert month:viraltert > modlme5 <- lme(cd4~month+viraltert+viraltert*month+age, random= ~month id ) > summary(modlme5) StdDev Corr (Intercept) (Intr) month Residual Fixed effects: cd4 ~ month + viraltert + viraltert * month + age (Intercept) month viraltert viraltert age month:viraltert month:viraltert Plots postscript("fig1.eps",horizontal=f) plot(month,cd4) lines(lowess(month,cd4)) postscript("fig2.eps",horizontal=f) par(mfrow=c(1,1)) smallset <- groupeddata( cd4 ~ month id, data=temp[ 1:sum(n[1:24]),] ) plot(smallset,layout=c(6,4)) postscript("fig3.eps",horizontal=f) par(mfrow=c(3,2)) hist(alpha[1:subm],xlab="alpha hat",main="") hist(beta[1:subm],xlab="beta hat",main="") plot(alpha[1:subm],beta[1:subm],xlab="alpha hat",ylab="beta hat") plot(log(vfacid[1:subm]),alpha[1:subm],xlab="log(viral load)",ylab="alpha hat") lines(lowess(log(vfacid[1:subm]),alpha[1:subm])) plot(log(vfacid[1:subm]),beta[1:subm],xlab="log(viral load)",ylab="beta hat")

9 Biostat/Stat 571: Coursework 5 9 lines(lowess(log(vfacid[1:subm]),beta[1:subm])) plot(ageid[1:subm],alpha[1:subm],xlab="age (years)",ylab="alpha hat") lines(lowess(ageid[1:subm],alpha[1:subm])) postscript("fig4.eps",horizontal=f) newdat <- groupeddata( cd4 ~ month viraltert, outer=~id, order.groups=f ) plot(newdat, layout=c(3,1), aspect=2) postscript("fig5.eps",horizontal=f) re4 <- ranef(modlme4) par(mfrow=c(2,2)) qqnorm(re4[,1]) qqnorm(re4[,2]) plot(re4[,1],re4[,2],type="n") text(re4[,1],re4[,2]) res4 <- residuals(modlme4) plot(month,res4)

Pitfalls in Linear Regression Analysis

Pitfalls in Linear Regression Analysis Due to the widespread availability of spreadsheet and statistical software for disposal, many of us do not really have a good understanding of how to use regression