Math 215, Lab 7: 5/23/2007

Similar documents
Normal Q Q. Residuals vs Fitted. Standardized residuals. Theoretical Quantiles. Fitted values. Scale Location 26. Residuals vs Leverage

NORTH SOUTH UNIVERSITY TUTORIAL 2

Regression models, R solution day7

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Bangor University Laboratory Exercise 1, June 2008

Daniel Boduszek University of Huddersfield

Multiple Regression Analysis

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

10. LINEAR REGRESSION AND CORRELATION

Multiple Linear Regression Analysis

Statistics for EES Factorial analysis of variance

GPA vs. Hours of Sleep: A Simple Linear Regression Jacob Ushkurnis 12/16/2016

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

Simple Linear Regression

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Statistical reports Regression, 2010

bivariate analysis: The statistical analysis of the relationship between two variables.

Score Tests of Normality in Bivariate Probit Models

ANOVA in SPSS (Practical)

Poisson regression. Dae-Jin Lee Basque Center for Applied Mathematics.

Data Analysis in the Health Sciences. Final Exam 2010 EPIB 621

Overview of Non-Parametric Statistics

Self-assessment test of prerequisite knowledge for Biostatistics III in R

Daniel Boduszek University of Huddersfield

Analysis of Variance: repeated measures

ANOVA. Thomas Elliott. January 29, 2013

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

Math 075 Activities and Worksheets Book 2:

CHAPTER TWO REGRESSION

Correlation and Regression

1. Objective: analyzing CD4 counts data using GEE marginal model and random effects model. Demonstrate the analysis using SAS and STATA.

Measurement Error 2: Scale Construction (Very Brief Overview) Page 1

Chapter 1: Exploring Data

BAM Monitor Performance. Seasonal and Geographic Variation in NC

HZAU MULTIVARIATE HOMEWORK #2 MULTIPLE AND STEPWISE LINEAR REGRESSION

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 8 One Way ANOVA and comparisons among means Introduction

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

SPSS output for 420 midterm study

EXECUTIVE SUMMARY DATA AND PROBLEM

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

SPSS output for 420 midterm study

The Pretest! Pretest! Pretest! Assignment (Example 2)

AP Statistics. Semester One Review Part 1 Chapters 1-5

Math Section MW 1-2:30pm SR 117. Bekki George 206 PGH

Section 6: Analysing Relationships Between Variables

Notes for laboratory session 2

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Ordinary Least Squares Regression

Hour 2: lm (regression), plot (scatterplots), cooks.distance and resid (diagnostics) Stat 302, Winter 2016 SFU, Week 3, Hour 1, Page 1

GENERALIZED ESTIMATING EQUATIONS FOR LONGITUDINAL DATA. Anti-Epileptic Drug Trial Timeline. Exploratory Data Analysis. Exploratory Data Analysis

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Simple Linear Regression the model, estimation and testing

IAPT: Regression. Regression analyses

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

CHAPTER ONE CORRELATION

Psych 5741/5751: Data Analysis University of Boulder Gary McClelland & Charles Judd. Exam #2, Spring 1992

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

Choosing a Significance Test. Student Resource Sheet

Stat Wk 9: Hypothesis Tests and Analysis

Part 8 Logistic Regression

Linear Regression in SAS

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Small Group Presentations

THE STATSWHISPERER. Introduction to this Issue. Doing Your Data Analysis INSIDE THIS ISSUE

STAT 503X Case Study 1: Restaurant Tipping

12.1 Inference for Linear Regression. Introduction

isc ove ring i Statistics sing SPSS

Advanced ANOVA Procedures

Two-Way Independent Samples ANOVA with SPSS

Things you need to know about the Normal Distribution. How to use your statistical calculator to calculate The mean The SD of a set of data points.

Daniel Boduszek University of Huddersfield

Stat 13, Lab 11-12, Correlation and Regression Analysis

Lessons in biostatistics

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Simple Linear Regression One Categorical Independent Variable with Several Categories

Chapter 9. Factorial ANOVA with Two Between-Group Factors 10/22/ Factorial ANOVA with Two Between-Group Factors

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Statistics 571: Statistical Methods Summer 2003 Final Exam Ramón V. León

Answer all three questions. All questions carry equal marks.

5 To Invest or not to Invest? That is the Question.

Unit 1 Exploring and Understanding Data

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Biostatistics II

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Hungry Mice. NP: Mice in this group ate as much as they pleased of a non-purified, standard diet for laboratory mice.

On Regression Analysis Using Bivariate Extreme Ranked Set Sampling

Understandable Statistics

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

REGRESSION MODELLING IN PREDICTING MILK PRODUCTION DEPENDING ON DAIRY BOVINE LIVESTOCK

MS&E 226: Small Data

Still important ideas

Transcription:

Math 215, Lab 7: 5/23/2007 (1) Parametric versus Nonparamteric Bootstrap. Parametric Bootstrap: (Davison and Hinkley, 1997) The data below are 12 times between failures of airconditioning equipment in a Boeing 720 jet aircraft, for which we wish to estimate the underlying mean or its reciprocal, the failure rate: 3, 5, 7, 18, 43, 85, 91, 98, 100, 130, 120, 230, 487 It makes sense to assume that these data come from an exponential distribution. The problem is we do not know the rate of that distribution hence, we can not identify its pdf. However, we can rely upon the fact that the sample average is an efficient estimator for the exponential mean. Consequently, we may consider an exponential distribution whose rate is the reciprocal of the sample average. Formally, this means: 1) Let y exp(1/µ) f(y) =(1/µ) exp( 1/µ). 2) Let ˆf = fˆµ =(1/ȳ) exp( 1/ȳ). Now, bootstrapping seems obvious here. Obtain a sample Y1,..., Y n from ˆf, thencalculate the statistic of the interest T boot using sampled data, finally repeat this procedure B times to get the distribution for T boot. (a) Perform a parametric bootstrap for the Boeing data to calculate the MSE for the mean, the MSE for the variance, a 95% confidence interval for mean, and a 95% confidence interval for the variance of the exponential distribution of interest. Nonparametric Bootstrap: This is pretty much what we have been doing so far. That is, we tend to assume that the underlying distribution from which the data are generated is unknown. Consequently, we assumed that the data may very well modeled via its empirical distribution which puts equal probabilities of n 1 at each sample value. The latter assumption is equivalent to estimate the unknown cumulative distribution function (CDF) F with its Empirical Cumulative Distribution (ECDF) or ˆF which is defined as the sample proportion: ˆF (y) = #{y j y} n Note that the values of ECDF are fixed and they are : (0, 1, 2,..., n). So in the n n n Nonparametric version, we may sample with replacement from the original dataset B times and proceed as before. (b) Perform a Nonparametric bootstrap for the Boeing data. Carry out calculations similar to part a. 1

(2) Consider the following data representing the populations in thousands of n=20 large US cities in 1920 (x) and 1930 (y). 1920 (x) 1930 (y) 138 143 93 104 61 69 179 260 48 75 37 63 29 50 23 48 30 110 2 50 38 52 46 53 25 57 298 317 74 93 50 58 76 80 381 464 387 459 507 634 (a) Perform a Nonparametric bootstrap to obtain a distribution for the Pearson s correlation coefficient between the 1920 and 1930 population data. Note in particular that, you need to obtain 20 samples without replacement from the pairs of data. (b) Obtain a 90% confidence interval for the following statistic: Z = 1 2 ln(1+r 1 r ) (3) Consider the bootstrap procedure for obtaining the t-statistic for the slope of a regression line using its associated least-square residuals as explained in page 271. Note that in order to do this, you need to fit e = y ˆβ 0 ˆβ 1 X or a separate regression with e i s being the response values. Subsequently, you need to sample with replacement from e i s to get B repetitions of the slope s t statistic. (a) Use the data of the previous problem to create a 95% bootstrap confidence interval for β 1, the slope of the regression line. 2

Multiple Regression Parametric Multiple Regression (Dalgaard, 2002). Dalgaard presents the analysis related to a study concerning lung function in patients with cystic fibrosis. Data are in the ISwR package which can be downloaded from the web. After connecting to the web, all you need to do is to type the following in R: > cyst<-read.table("http://www.csub.edu/~sbehseta/ageheight.txt",header=t) > cyst<-as.matrix(cyst) Then, each time you need to use the data sets included in this package in R, you need to click on packages and choose ISwR from the packages menu. Each data set will be available by using the command data (see Dalgaard Appendix A andsection9.1). > cyst age sex height weight bmp fev1 rv frc tlc pemax 1 7 0 109 13.1 68 32 258 183 137 95 2 7 1 112 12.9 65 19 449 245 134 85 3 8 0 124 14.1 64 22 441 268 147 100 4 8 1 125 16.2 67 41 234 146 124 85 5 8 0 127 21.5 93 52 202 131 104 95 6 9 0 130 17.5 68 44 308 155 118 80 7 11 1 139 30.7 89 28 305 179 119 65 8 12 1 150 28.4 69 18 369 198 103 110 9 12 0 146 25.1 67 24 312 194 128 70 10 13 1 155 31.5 68 23 413 225 136 95 11 13 0 156 39.9 89 39 206 142 95 110 12 14 1 153 42.1 90 26 253 191 121 90 13 14 0 160 45.6 93 45 174 139 108 100 14 15 1 158 51.2 93 45 158 124 90 80 15 16 1 160 35.9 66 31 302 133 101 134 16 17 1 153 34.8 70 29 204 118 120 134 17 17 0 174 44.7 70 49 187 104 103 165 18 17 1 176 60.1 92 29 188 129 130 120 19 17 0 171 42.6 69 38 172 130 103 130 20 19 1 156 37.2 72 21 216 119 81 85 21 19 0 174 54.6 86 37 184 118 101 85 22 20 0 178 64.0 86 34 225 148 135 160 23 23 0 180 73.8 97 57 171 108 98 165 24 23 0 175 51.1 71 33 224 131 113 95 3

25 23 0 179 71.5 95 52 225 127 101 195 The description for the data set is also given at page 228 of the text book (Appendix B): The cystfibr data frame has 25 rows and 10 columns. It contains lung function data for for cystic fibrosis patients (7-23 years old). Format: This data frame contains the following columns: age a numeric vector: Age in years. sex a numeric vector code. 0:male, 1:female. height a numeric vector. Height (cm). weight a numeric vector. Weight (kg). bmp a numeric vector. Body mass (% of normal). fev 1 a numeric vector. Forced expiratory volume. rv a numeric vector. Residual volume. frc a numeric vector. Functional residual capacity. tlc a numeric vector. Total lung capacity. pemax a numeric vector. Maximum expiratory pressure. 4

Preliminary Exploratory Analysis First of all, by typing: > plot(cystfibr) you can obtain a matrix for all pairwise scatterplots associated with the data set (figure 1). This is an extremely powerful tool in visualizing the multivariate data. secondly, by typing: > attach(cystfibr) the default data set of in R is going to be cystfibr. This is really important because to see the column labeled as height instead of typing > cystfibr[, height] I only need to type: > height. 5

0.0 0.6 20 50 20 40 100 200 60 140 age 10 20 0.0 0.8 sex height 110 170 20 60 weight bmp 65 85 20 50 fev1 rv 150 400 100 250 frc tlc 80 130 60 160 pemax 10 20 110 150 65 80 95 150 350 80 120 figure 1. This grid of scatterplots reflects the pairwise relationships between all variables of interest. 6

Also, you can get a matrix for all the pairwise correlations by typing: > cor(cystfibr) age sex height weight bmp fev1 age 1.0000000-0.16712203 0.9260520 0.9058675 0.3777643 0.2944880 sex -0.1671220 1.00000000-0.1675482-0.1904400-0.1375611-0.5282571 height 0.9260520-0.16754816 1.0000000 0.9206953 0.4407623 0.3166636 weight 0.9058675-0.19043998 0.9206953 1.0000000 0.6725463 0.4488393 bmp 0.3777643-0.13756107 0.4407623 0.6725463 1.0000000 0.5455204 fev1 0.2944880-0.52825710 0.3166636 0.4488393 0.5455204 1.0000000 rv -0.5519445 0.27135157-0.5695199-0.6215056-0.5823729-0.6658557 frc -0.6393569 0.18360547-0.6242769-0.6172561-0.4343888-0.6651149 tlc -0.4693733 0.02423487-0.4570819-0.4184676-0.3649035-0.4429945 pemax 0.6134741-0.28856921 0.5992195 0.6352220 0.2295148 0.4533757 rv frc tlc pemax age -0.5519445-0.6393569-0.46937332 0.6134741 sex 0.2713516 0.1836055 0.02423487-0.2885692 height -0.5695199-0.6242769-0.45708185 0.5992195 weight -0.6215056-0.6172561-0.41846764 0.6352220 bmp -0.5823729-0.4343888-0.36490350 0.2295148 fev1-0.6658557-0.6651149-0.44299453 0.4533757 rv 1.0000000 0.9106029 0.58913911-0.3155501 frc 0.9106029 1.0000000 0.70439993-0.4172078 tlc 0.5891391 0.7043999 1.00000000-0.1816157 pemax -0.3155501-0.4172078-0.18161570 1.0000000 note that this will enable us to browse through all the pairwise linear associations. 7

1 Multiple Regression Modeling Here, the response variable is pemax. All the other variables are considered as the independent variables. One might be interested to start with the most complete model. That is, we consider the additive model of: pemax = β 0 + β 1 age + β 2 sex + β 3 height + β 4 weight + β 5 bmp + β 6 fev1+β 7 rv + β 8 frc+ β 9 tlc (1) here is the output for the full model: > summary(lm(pemax~age+sex+height+weight+bmp+fev1+rv+frc+tlc)) Call: lm(formula = pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc) Residuals: Min 1Q Median 3Q Max -37.338-11.532 1.081 13.386 33.405 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 176.0582 225.8912 0.779 0.448 age -2.5420 4.8017-0.529 0.604 sex -3.7368 15.4598-0.242 0.812 height -0.4463 0.9034-0.494 0.628 weight 2.9928 2.0080 1.490 0.157 bmp -1.7449 1.1552-1.510 0.152 fev1 1.0807 1.0809 1.000 0.333 rv 0.1970 0.1962 1.004 0.331 frc -0.3084 0.4924-0.626 0.540 tlc 0.1886 0.4997 0.377 0.711 Residual standard error: 25.47 on 15 degrees of freedom Multiple R-Squared: 0.6373, Adjusted R-squared: 0.4197 F-statistic: 2.929 on 9 and 15 DF, p-value: 0.03195 > test<-lm(pemax~age+sex+height+weight+bmp+fev1+rv+frc+tlc) > step(test) Start: AIC= 169.11 pemax ~ age + sex + height + weight + bmp + fev1 + rv + frc + tlc 8

Df Sum of Sq RSS AIC - sex 1 37.9 9769.2 167.2 - tlc 1 92.4 9823.7 167.3 - height 1 158.3 9889.6 167.5 - age 1 181.8 9913.1 167.6 - frc 1 254.6 9985.8 167.8 - fev1 1 648.4 10379.7 168.7 - rv 1 653.8 10385.0 168.7 <none> 9731.2 169.1 - weight 1 1441.2 11172.5 170.6 - bmp 1 1480.1 11211.4 170.6 Step: AIC= 167.2 pemax ~ age + height + weight + bmp + fev1 + rv + frc + tlc Df Sum of Sq RSS AIC - tlc 1 115.9 9885.1 165.5 - height 1 131.2 9900.4 165.5 - age 1 145.6 9914.7 165.6 - frc 1 221.5 9990.7 165.8 - rv 1 636.2 10405.3 166.8 <none> 9769.2 167.2 - weight 1 1446.2 11215.4 168.7 - bmp 1 1474.7 11243.9 168.7 - fev1 1 1770.4 11539.6 169.4 Step: AIC= 165.5 pemax ~ age + height + weight + bmp + fev1 + rv + frc Df Sum of Sq RSS AIC - frc 1 133.2 10018.3 163.8 - height 1 215.8 10100.9 164.0 - age 1 252.2 10137.3 164.1 - rv 1 543.5 10428.6 164.8 <none> 9885.1 165.5 - fev1 1 1727.4 11612.5 167.5 - weight 1 2132.5 12017.6 168.4 - bmp 1 2354.3 12239.4 168.8 Step: AIC= 163.83 pemax ~ age + height + weight + bmp + fev1 + rv 9

Df Sum of Sq RSS AIC - age 1 145.3 10163.6 162.2 - height 1 158.2 10176.5 162.2 - rv 1 568.1 10586.3 163.2 <none> 10018.3 163.8 - weight 1 2027.2 12045.5 166.4 - bmp 1 2324.1 12342.3 167.0 - fev1 1 2851.2 12869.5 168.1 Step: AIC= 162.19 pemax ~ height + weight + bmp + fev1 + rv Df Sum of Sq RSS AIC - height 1 191.0 10354.6 160.7 - rv 1 829.0 10992.6 162.2 <none> 10163.6 162.2 - weight 1 2603.5 12767.0 165.9 - bmp 1 2743.5 12907.1 166.2 - fev1 1 3210.9 13374.5 167.1 Step: AIC= 160.66 pemax ~ weight + bmp + fev1 + rv Df Sum of Sq RSS AIC <none> 10354.6 160.7 - rv 1 1183.6 11538.2 161.4 - bmp 1 3072.6 13427.2 165.2 - fev1 1 3717.1 14071.7 166.3 - weight 1 10930.2 21284.8 176.7 Call: lm(formula = pemax ~ weight + bmp + fev1 + rv) Coefficients: (Intercept) weight bmp fev1 rv 63.9467 1.7489-1.3772 1.5477 0.1257 10

The Best Model > summary(lm(formula = pemax ~ weight + bmp + fev1 + rv)) Call: lm(formula = pemax ~ weight + bmp + fev1 + rv) Residuals: Min 1Q Median 3Q Max -39.77-11.74 4.33 15.66 35.07 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 63.94669 53.27673 1.200 0.244057 weight 1.74891 0.38063 4.595 0.000175 *** bmp -1.37724 0.56534-2.436 0.024322 * fev1 1.54770 0.57761 2.679 0.014410 * rv 0.12572 0.08315 1.512 0.146178 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 22.75 on 20 degrees of freedom Multiple R-Squared: 0.6141, Adjusted R-squared: 0.5369 F-statistic: 7.957 on 4 and 20 DF, p-value: 0.000523 > anova(lm(formula = pemax ~ weight + bmp + fev1 + rv)) Analysis of Variance Table Response: pemax Df Sum Sq Mean Sq F value Pr(>F) weight 1 10827.2 10827.2 20.9128 0.0001846 *** bmp 1 1914.9 1914.9 3.6987 0.0688086. fev1 1 2552.4 2552.4 4.9299 0.0381131 * rv 1 1183.6 1183.6 2.2861 0.1461776 Residuals 20 10354.6 517.7 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 11

A Naive Bootstrap Analysis of the Best Model B<-1000 Tboot<-matrix(0,nrow=B,ncol=5) for(i in 1:B) { e.star<-sample(cyst.resid,25,replace=t) y.star<- cyst.predict+e.star Tboot[i,]<-summary(lm(y.star~weight + bmp + fev1 + rv))$coefficients[,1] #Tcov<-summary(lm(y.star~weight + bmp + fev1 + rv))$cov print(i) } par(mfrow=c(2,3)) for(i in 1:5) { hist(tboot[,i]) } > mean(tboot[,1]) [1] 62.86821 > sd(tboot[,1]) [1] 48.52069 > > mean(tboot[,2]) [1] 1.748008 > sd(tboot[,2]) [1] 0.3344789 > > mean(tboot[,3]) [1] -1.370281 > sd(tboot[,3]) [1] 0.4970465 > > mean(tboot[,4]) [1] 1.552385 > sd(tboot[,4]) [1] 0.5046212 > > mean(tboot[,5]) [1] 0.1271266 > sd(tboot[,5]) [1] 0.07545277 12

Histogram of Tboot[, i] Histogram of Tboot[, i] Histogram of Tboot[, i] Frequency 0 100 200 300 400 Frequency 0 50 100 150 200 Frequency 0 100 200 300 100 0 50 150 250 Tboot[, i] 1.0 1.5 2.0 2.5 Tboot[, i] 3 2 1 0 Tboot[, i] Histogram of Tboot[, i] Histogram of Tboot[, i] Frequency 0 100 200 300 400 Frequency 0 50 100 150 200 250 0 1 2 3 Tboot[, i] 0.1 0.0 0.1 0.2 0.3 0.4 Tboot[, i] figure 2. The Distribution of the Bootstrap Parameter Estimates. 13

14