Math 215, Lab 7: 5/23/2007 (1) Parametric versus Nonparamteric Bootstrap. Parametric Bootstrap: (Davison and Hinkley, 1997) The data below are 12 times between failures of airconditioning equipment in a Boeing 720 jet aircraft, for which we wish to estimate the underlying mean or its reciprocal, the failure rate: 3, 5, 7, 18, 43, 85, 91, 98, 100, 130, 120, 230, 487 It makes sense to assume that these data come from an exponential distribution. The problem is we do not know the rate of that distribution hence, we can not identify its pdf. However, we can rely upon the fact that the sample average is an efficient estimator for the exponential mean. Consequently, we may consider an exponential distribution whose rate is the reciprocal of the sample average. Formally, this means: 1) Let y exp(1/µ) f(y) =(1/µ) exp( 1/µ). 2) Let ˆf = fˆµ =(1/ȳ) exp( 1/ȳ). Now, bootstrapping seems obvious here. Obtain a sample Y1,..., Y n from ˆf, thencalculate the statistic of the interest T boot using sampled data, finally repeat this procedure B times to get the distribution for T boot. (a) Perform a parametric bootstrap for the Boeing data to calculate the MSE for the mean, the MSE for the variance, a 95% confidence interval for mean, and a 95% confidence interval for the variance of the exponential distribution of interest. Nonparametric Bootstrap: This is pretty much what we have been doing so far. That is, we tend to assume that the underlying distribution from which the data are generated is unknown. Consequently, we assumed that the data may very well modeled via its empirical distribution which puts equal probabilities of n 1 at each sample value. The latter assumption is equivalent to estimate the unknown cumulative distribution function (CDF) F with its Empirical Cumulative Distribution (ECDF) or ˆF which is defined as the sample proportion: ˆF (y) = #{y j y} n Note that the values of ECDF are fixed and they are : (0, 1, 2,..., n). So in the n n n Nonparametric version, we may sample with replacement from the original dataset B times and proceed as before. (b) Perform a Nonparametric bootstrap for the Boeing data. Carry out calculations similar to part a. 1
(2) Consider the following data representing the populations in thousands of n=20 large US cities in 1920 (x) and 1930 (y). 1920 (x) 1930 (y) 138 143 93 104 61 69 179 260 48 75 37 63 29 50 23 48 30 110 2 50 38 52 46 53 25 57 298 317 74 93 50 58 76 80 381 464 387 459 507 634 (a) Perform a Nonparametric bootstrap to obtain a distribution for the Pearson s correlation coefficient between the 1920 and 1930 population data. Note in particular that, you need to obtain 20 samples without replacement from the pairs of data. (b) Obtain a 90% confidence interval for the following statistic: Z = 1 2 ln(1+r 1 r ) (3) Consider the bootstrap procedure for obtaining the t-statistic for the slope of a regression line using its associated least-square residuals as explained in page 271. Note that in order to do this, you need to fit e = y ˆβ 0 ˆβ 1 X or a separate regression with e i s being the response values. Subsequently, you need to sample with replacement from e i s to get B repetitions of the slope s t statistic. (a) Use the data of the previous problem to create a 95% bootstrap confidence interval for β 1, the slope of the regression line. 2
Multiple Regression Parametric Multiple Regression (Dalgaard, 2002). Dalgaard presents the analysis related to a study concerning lung function in patients with cystic fibrosis. Data are in the ISwR package which can be downloaded from the web. After connecting to the web, all you need to do is to type the following in R: > cyst<-read.table("http://www.csub.edu/~sbehseta/ageheight.txt",header=t) > cyst<-as.matrix(cyst) Then, each time you need to use the data sets included in this package in R, you need to click on packages and choose ISwR from the packages menu. Each data set will be available by using the command data (see Dalgaard Appendix A andsection9.1). > cyst age sex height weight bmp fev1 rv frc tlc pemax 1 7 0 109 13.1 68 32 258 183 137 95 2 7 1 112 12.9 65 19 449 245 134 85 3 8 0 124 14.1 64 22 441 268 147 100 4 8 1 125 16.2 67 41 234 146 124 85 5 8 0 127 21.5 93 52 202 131 104 95 6 9 0 130 17.5 68 44 308 155 118 80 7 11 1 139 30.7 89 28 305 179 119 65 8 12 1 150 28.4 69 18 369 198 103 110 9 12 0 146 25.1 67 24 312 194 128 70 10 13 1 155 31.5 68 23 413 225 136 95 11 13 0 156 39.9 89 39 206 142 95 110 12 14 1 153 42.1 90 26 253 191 121 90 13 14 0 160 45.6 93 45 174 139 108 100 14 15 1 158 51.2 93 45 158 124 90 80 15 16 1 160 35.9 66 31 302 133 101 134 16 17 1 153 34.8 70 29 204 118 120 134 17 17 0 174 44.7 70 49 187 104 103 165 18 17 1 176 60.1 92 29 188 129 130 120 19 17 0 171 42.6 69 38 172 130 103 130 20 19 1 156 37.2 72 21 216 119 81 85 21 19 0 174 54.6 86 37 184 118 101 85 22 20 0 178 64.0 86 34 225 148 135 160 23 23 0 180 73.8 97 57 171 108 98 165 24 23 0 175 51.1 71 33 224 131 113 95 3
25 23 0 179 71.5 95 52 225 127 101 195 The description for the data set is also given at page 228 of the text book (Appendix B): The cystfibr data frame has 25 rows and 10 columns. It contains lung function data for for cystic fibrosis patients (7-23 years old). Format: This data frame contains the following columns: age a numeric vector: Age in years. sex a numeric vector code. 0:male, 1:female. height a numeric vector. Height (cm). weight a numeric vector. Weight (kg). bmp a numeric vector. Body mass (% of normal). fev 1 a numeric vector. Forced expiratory volume. rv a numeric vector. Residual volume. frc a numeric vector. Functional residual capacity. tlc a numeric vector. Total lung capacity. pemax a numeric vector. Maximum expiratory pressure. 4
Preliminary Exploratory Analysis First of all, by typing: > plot(cystfibr) you can obtain a matrix for all pairwise scatterplots associated with the data set (figure 1). This is an extremely powerful tool in visualizing the multivariate data. secondly, by typing: > attach(cystfibr) the default data set of in R is going to be cystfibr. This is really important because to see the column labeled as height instead of typing > cystfibr[, height] I only need to type: > height. 5
0.0 0.6 20 50 20 40 100 200 60 140 age 10 20 0.0 0.8 sex height 110 170 20 60 weight bmp 65 85 20 50 fev1 rv 150 400 100 250 frc tlc 80 130 60 160 pemax 10 20 110 150 65 80 95 150 350 80 120 figure 1. This grid of scatterplots reflects the pairwise relationships between all variables of interest. 6
Also, you can get a matrix for all the pairwise correlations by typing: > cor(cystfibr) age sex height weight bmp fev1 age 1.0000000-0.16712203 0.9260520 0.9058675 0.3777643 0.2944880 sex -0.1671220 1.00000000-0.1675482-0.1904400-0.1375611-0.5282571 height 0.9260520-0.16754816 1.0000000 0.9206953 0.4407623 0.3166636 weight 0.9058675-0.19043998 0.9206953 1.0000000 0.6725463 0.4488393 bmp 0.3777643-0.13756107 0.4407623 0.6725463 1.0000000 0.5455204 fev1 0.2944880-0.52825710 0.3166636 0.4488393 0.5455204 1.0000000 rv -0.5519445 0.27135157-0.5695199-0.6215056-0.5823729-0.6658557 frc -0.6393569 0.18360547-0.6242769-0.6172561-0.4343888-0.6651149 tlc -0.4693733 0.02423487-0.4570819-0.4184676-0.3649035-0.4429945 pemax 0.6134741-0.28856921 0.5992195 0.6352220 0.2295148 0.4533757 rv frc tlc pemax age -0.5519445-0.6393569-0.46937332 0.6134741 sex 0.2713516 0.1836055 0.02423487-0.2885692 height -0.5695199-0.6242769-0.45708185 0.5992195 weight -0.6215056-0.6172561-0.41846764 0.6352220 bmp -0.5823729-0.4343888-0.36490350 0.2295148 fev1-0.6658557-0.6651149-0.44299453 0.4533757 rv 1.0000000 0.9106029 0.58913911-0.3155501 frc 0.9106029 1.0000000 0.70439993-0.4172078 tlc 0.5891391 0.7043999 1.00000000-0.1816157 pemax -0.3155501-0.4172078-0.18161570 1.0000000 note that this will enable us to browse through all the pairwise linear associations. 7
1 Multiple Regression Modeling Here, the response variable is pemax. All the other variables are considered as the independent variables. One might be interested to start with the most complete model. That is, we consider the additive model of: pemax = β 0 + β 1 age + β 2 sex + β 3 height + β 4 weight + β 5 bmp + β 6 fev1+β 7 rv + β 8 frc+ β 9 tlc (1) here is the output for the full model: > summary(lm(pemax~age+sex+height+weight+bmp+fev1+rv+frc+tlc)) Call: lm(formula = pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc) Residuals: Min 1Q Median 3Q Max -37.338-11.532 1.081 13.386 33.405 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 176.0582 225.8912 0.779 0.448 age -2.5420 4.8017-0.529 0.604 sex -3.7368 15.4598-0.242 0.812 height -0.4463 0.9034-0.494 0.628 weight 2.9928 2.0080 1.490 0.157 bmp -1.7449 1.1552-1.510 0.152 fev1 1.0807 1.0809 1.000 0.333 rv 0.1970 0.1962 1.004 0.331 frc -0.3084 0.4924-0.626 0.540 tlc 0.1886 0.4997 0.377 0.711 Residual standard error: 25.47 on 15 degrees of freedom Multiple R-Squared: 0.6373, Adjusted R-squared: 0.4197 F-statistic: 2.929 on 9 and 15 DF, p-value: 0.03195 > test<-lm(pemax~age+sex+height+weight+bmp+fev1+rv+frc+tlc) > step(test) Start: AIC= 169.11 pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc 8
Df Sum of Sq RSS AIC - sex 1 37.9 9769.2 167.2 - tlc 1 92.4 9823.7 167.3 - height 1 158.3 9889.6 167.5 - age 1 181.8 9913.1 167.6 - frc 1 254.6 9985.8 167.8 - fev1 1 648.4 10379.7 168.7 - rv 1 653.8 10385.0 168.7 <none> 9731.2 169.1 - weight 1 1441.2 11172.5 170.6 - bmp 1 1480.1 11211.4 170.6 Step: AIC= 167.2 pemax ~ age + height + weight + bmp + fev1 + rv + frc + tlc Df Sum of Sq RSS AIC - tlc 1 115.9 9885.1 165.5 - height 1 131.2 9900.4 165.5 - age 1 145.6 9914.7 165.6 - frc 1 221.5 9990.7 165.8 - rv 1 636.2 10405.3 166.8 <none> 9769.2 167.2 - weight 1 1446.2 11215.4 168.7 - bmp 1 1474.7 11243.9 168.7 - fev1 1 1770.4 11539.6 169.4 Step: AIC= 165.5 pemax ~ age + height + weight + bmp + fev1 + rv + frc Df Sum of Sq RSS AIC - frc 1 133.2 10018.3 163.8 - height 1 215.8 10100.9 164.0 - age 1 252.2 10137.3 164.1 - rv 1 543.5 10428.6 164.8 <none> 9885.1 165.5 - fev1 1 1727.4 11612.5 167.5 - weight 1 2132.5 12017.6 168.4 - bmp 1 2354.3 12239.4 168.8 Step: AIC= 163.83 pemax ~ age + height + weight + bmp + fev1 + rv 9
Df Sum of Sq RSS AIC - age 1 145.3 10163.6 162.2 - height 1 158.2 10176.5 162.2 - rv 1 568.1 10586.3 163.2 <none> 10018.3 163.8 - weight 1 2027.2 12045.5 166.4 - bmp 1 2324.1 12342.3 167.0 - fev1 1 2851.2 12869.5 168.1 Step: AIC= 162.19 pemax ~ height + weight + bmp + fev1 + rv Df Sum of Sq RSS AIC - height 1 191.0 10354.6 160.7 - rv 1 829.0 10992.6 162.2 <none> 10163.6 162.2 - weight 1 2603.5 12767.0 165.9 - bmp 1 2743.5 12907.1 166.2 - fev1 1 3210.9 13374.5 167.1 Step: AIC= 160.66 pemax ~ weight + bmp + fev1 + rv Df Sum of Sq RSS AIC <none> 10354.6 160.7 - rv 1 1183.6 11538.2 161.4 - bmp 1 3072.6 13427.2 165.2 - fev1 1 3717.1 14071.7 166.3 - weight 1 10930.2 21284.8 176.7 Call: lm(formula = pemax ~ weight + bmp + fev1 + rv) Coefficients: (Intercept) weight bmp fev1 rv 63.9467 1.7489-1.3772 1.5477 0.1257 10
The Best Model > summary(lm(formula = pemax ~ weight + bmp + fev1 + rv)) Call: lm(formula = pemax ~ weight + bmp + fev1 + rv) Residuals: Min 1Q Median 3Q Max -39.77-11.74 4.33 15.66 35.07 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 63.94669 53.27673 1.200 0.244057 weight 1.74891 0.38063 4.595 0.000175 *** bmp -1.37724 0.56534-2.436 0.024322 * fev1 1.54770 0.57761 2.679 0.014410 * rv 0.12572 0.08315 1.512 0.146178 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 22.75 on 20 degrees of freedom Multiple R-Squared: 0.6141, Adjusted R-squared: 0.5369 F-statistic: 7.957 on 4 and 20 DF, p-value: 0.000523 > anova(lm(formula = pemax ~ weight + bmp + fev1 + rv)) Analysis of Variance Table Response: pemax Df Sum Sq Mean Sq F value Pr(>F) weight 1 10827.2 10827.2 20.9128 0.0001846 *** bmp 1 1914.9 1914.9 3.6987 0.0688086. fev1 1 2552.4 2552.4 4.9299 0.0381131 * rv 1 1183.6 1183.6 2.2861 0.1461776 Residuals 20 10354.6 517.7 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 11
A Naive Bootstrap Analysis of the Best Model B<-1000 Tboot<-matrix(0,nrow=B,ncol=5) for(i in 1:B) { e.star<-sample(cyst.resid,25,replace=t) y.star<- cyst.predict+e.star Tboot[i,]<-summary(lm(y.star~weight + bmp + fev1 + rv))$coefficients[,1] #Tcov<-summary(lm(y.star~weight + bmp + fev1 + rv))$cov print(i) } par(mfrow=c(2,3)) for(i in 1:5) { hist(tboot[,i]) } > mean(tboot[,1]) [1] 62.86821 > sd(tboot[,1]) [1] 48.52069 > > mean(tboot[,2]) [1] 1.748008 > sd(tboot[,2]) [1] 0.3344789 > > mean(tboot[,3]) [1] -1.370281 > sd(tboot[,3]) [1] 0.4970465 > > mean(tboot[,4]) [1] 1.552385 > sd(tboot[,4]) [1] 0.5046212 > > mean(tboot[,5]) [1] 0.1271266 > sd(tboot[,5]) [1] 0.07545277 12
Histogram of Tboot[, i] Histogram of Tboot[, i] Histogram of Tboot[, i] Frequency 0 100 200 300 400 Frequency 0 50 100 150 200 Frequency 0 100 200 300 100 0 50 150 250 Tboot[, i] 1.0 1.5 2.0 2.5 Tboot[, i] 3 2 1 0 Tboot[, i] Histogram of Tboot[, i] Histogram of Tboot[, i] Frequency 0 100 200 300 400 Frequency 0 50 100 150 200 250 0 1 2 3 Tboot[, i] 0.1 0.0 0.1 0.2 0.3 0.4 Tboot[, i] figure 2. The Distribution of the Bootstrap Parameter Estimates. 13
14