CHILD HEALTH AND DEVELOPMENT STUDY

Similar documents
WELCOME! Lecture 11 Thommy Perlinger

Multiple Regression Analysis

Daniel Boduszek University of Huddersfield

Chapter 3 CORRELATION AND REGRESSION

STATISTICS INFORMED DECISIONS USING DATA

1.4 - Linear Regression and MS Excel

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Linear Regression in SAS

Section 3.2 Least-Squares Regression

CHAPTER TWO REGRESSION

This tutorial presentation is prepared by. Mohammad Ehsanul Karim

10. LINEAR REGRESSION AND CORRELATION

Chapter 3: Examining Relationships

Multiple Regression Using SPSS/PASW

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

6. Unusual and Influential Data

Business Statistics Probability

Correlation and Regression

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

CHAPTER 3 Describing Relationships

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page Influence Analysis 1

Business Research Methods. Introduction to Data Analysis

Effects of Nutrients on Shrimp Growth

Pitfalls in Linear Regression Analysis

AP Statistics Practice Test Ch. 3 and Previous

Study Guide #2: MULTIPLE REGRESSION in education

Lecture 6B: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

Regression Equation. November 29, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

12.1 Inference for Linear Regression. Introduction

Simple Linear Regression the model, estimation and testing

STAT 201 Chapter 3. Association and Regression

Undertaking statistical analysis of

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Answer all three questions. All questions carry equal marks.

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Introduction to regression

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

11/24/2017. Do not imply a cause-and-effect relationship

3.2A Least-Squares Regression

On the purpose of testing:

bivariate analysis: The statistical analysis of the relationship between two variables.

Small Group Presentations

Doctors Fees in Ireland Following the Change in Reimbursement: Did They Jump?

Lecture 12: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

Choosing a Significance Test. Student Resource Sheet

Correlation and regression

CLASSICAL AND. MODERN REGRESSION WITH APPLICATIONS

Statistics 571: Statistical Methods Summer 2003 Final Exam Ramón V. León

IAPT: Regression. Regression analyses

3.2 Least- Squares Regression

Chapter 3: Describing Relationships

NORTH SOUTH UNIVERSITY TUTORIAL 2

Diabetic Ketoacidosis

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test February 2016

APPENDIX D REFERENCE AND PREDICTIVE VALUES FOR PEAK EXPIRATORY FLOW RATE (PEFR)

Simple Linear Regression

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Understandable Statistics

Chapter 1: Exploring Data

Unit 1 Exploring and Understanding Data

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Normal Q Q. Residuals vs Fitted. Standardized residuals. Theoretical Quantiles. Fitted values. Scale Location 26. Residuals vs Leverage

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

Daniel Boduszek University of Huddersfield

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Chapter 10: Moderation, mediation and more regression

Sample Exam Paper Answer Guide

Regression Discontinuity Analysis

Chapter 3 Review. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

3.4 What are some cautions in analyzing association?

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Simple Linear Regression One Categorical Independent Variable with Several Categories

Section 6: Analysing Relationships Between Variables

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

STAT445 Midterm Project1

CHAPTER ONE CORRELATION

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

EXECUTIVE SUMMARY DATA AND PROBLEM

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

isc ove ring i Statistics sing SPSS

Still important ideas

MULTIPLE REGRESSION OF CPS DATA

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

Data Analysis in Practice-Based Research. Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine

SCATTER PLOTS AND TREND LINES

Modern Regression Methods

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Math 124: Module 2, Part II

Statistical Techniques. Meta-Stat provides a wealth of statistical tools to help you examine your data. Overview

Class 7 Everything is Related

Anale. Seria Informatică. Vol. XVI fasc Annals. Computer Science Series. 16 th Tome 1 st Fasc. 2018

STATISTICS 201. Survey: Provide this Info. How familiar are you with these? Survey, continued IMPORTANT NOTE. Regression and ANOVA 9/29/2013

7 Statistical Issues that Researchers Shouldn t Worry (So Much) About

Transcription:

CHILD HEALTH AND DEVELOPMENT STUDY 9. Diagnostics In this section various diagnostic tools will be used to evaluate the adequacy of the regression model with the five independent variables developed in Section 8. These tools include residual plots to investigate whether the assumptions of linearity of the response variable, and normality and constant variance of the error appear to be met. They also provide insight into how the predictor variables are related to one another and how they influence the model. 9. Checking the Linearity of Birth Weights 9. Checking Constant Variance of the Error Assumption 9. Checking the Normality of the Error Assumption 9. Multicollinearity of the Independent Variables 9.5 Diagnostics for Outliers and Influential Cases 9. Checking the Linearity of Birth Weights In the previous section we have described the relationship between the response variable BWT and the predictors GESTWKS, MNOCIG, MHEIGHT, MPPWT, and FHEIGHT by the following multiple linear regression model: BWT 5 * FHEIGHT GESTWKS ERROR. MNOCIG MHEIGHT MPPWT The random variable ERROR is assumed to follow a normal distribution with the mean of zero and an unknown standard deviation. The standard deviation is constant at all levels of the response variable BWT under a range of settings of the independent variables GESTWKS, MNOCIG, MHEIGHT, MPPWT, and FHEIGHT. We have found the estimates of the model coefficients and discussed the significance of the five independent variables in the model. The above model is based on the assumption of a linear relationship between birth weight and the predictor variables. The assumption of linearity is easily examined through plots of residuals (standardized residuals) against predicted values and also against each independent variable. The standardized residuals are ordinary residuals divided by the sample standard deviation of the residuals and have mean and standard deviation. If the linearity assumption appears to be met, then these residual plots should exhibit a random scatter of points about zero. Any consistent curvilinear pattern in the residuals indicates a nonlinear relationship between the dependent variable and independent variables and calls for a nonlinear regression model. In order to check the assumption for the above multiple linear regression model with SPSS, we will obtain the plots of the standardized residuals against the predicted values and against each independent variable. As the plots will also be used to assess the constant variance of the error assumption, we will discuss the two assumptions together in the next section.

9. Checking Constant Variance Assumption In order to see whether the assumption of constant variance is not violated, we plot residuals (standardized residuals) against the fitted and also against each predictor variable. If the assumptions of linearity and constant variance appear to be met, then these residual plots should exhibit a random scatter of points with similar spread across all levels of fitted and independent variable values. The plot displayed below shows the scatterplot of standardized residuals against the corresponding fitted values. No obvious difficulties are revealed in this display. With the exception of the smallest fitted values, the variability appears to be quite similar across all levels of fitted values. A random pattern is apparent in the plot, the linearity assumption is not violated. Plot of s vs. Predicted Values Dependent Variable: BWT - - - - -6 - - Standardized Predicted Value The next plot gives a similar display of standardized residuals plotted against gestation time. Plot of s vs. Gestation Time - - - - 5 Gestation Time (in weeks)

The points in the plot are randomly scattered and constant variability across all values for the independent variable is supported. The random pattern in the plot does not provide any evidence to question the linearity assumption. The next plot gives a similar plot for the amount of maternal smoking variable. Plot of s vs. Maternal Smoking - - - - - 5 6 Number of Cigarettes Smoked per Day At the first glance, it seems that the variability decreases as the number of cigarettes smoked per day increases. As the change in spread refers to a relatively small fraction of observations, this plot does not suggest any weakness of the model with respect to nonconstant variability. The next plot shows the standardized residuals versus maternal height. Plot of s vs. Maternal Height - - - - 56 58 6 6 6 66 68 7 7 Maternal Height (in inches)

There appear to be some differences of variability in residuals for different levels of education, but the effect is not severe. There is not strong enough evidence to question the assumption of equal variance. Now we plot the standardized residuals against maternal pre-pregnancy weight. Plot of Residuals vs. Maternal Pre-pregnancy Weight - - - - 5 5 5 Maternal Pre-pregnancy Weight (in pounds) The plot shows residuals falling randomly, with relatively equal spread about zero and no strong tendency to be either greater or less than zero. Finally we examine the plot of standardized residuals against father's height. Plot of s vs. Father's Height - - - - 6 7 Father's Height (in inches) 8 There appear to be some differences of variability in residuals for different levels of father's height, but the effect is not severe. There is not strong enough evidence to question the assumption of linearity and equal variance. In summary, the assumptions of linearity and constant variance of the error seem very reasonable for this choice of regression model

9. Checking Normality of the Error Assumption In order to assess whether the assumption is not violated, the normal P-P plot of regression standardized residuals is obtained. The plot plots the cumulative proportions of standardized residuals against the cumulative proportions of the normal distribution. If the normality assumption is not violated, points will cluster around a straight line. Normal P-P Plot of s. Dependent Variable: BWT Expected Cum Probabilities.75.5.5...5.5.75. Observed Cum Probabilities As the points in the plot are lying along a straight line, the normality assumption is strongly supported. 9. Multicollinearity of the Independent Variables It is well known that collinearity and multicollinearity can have harmful effects on multiple regression, both in the interpretation of the results and in how they are obtained. In particular, collinearity affects parameter estimates and their standard errors, and consequently t ratios. Inflated standard errors mean wider confidence intervals for the regression coefficients and a diminished ability of tests to find significant results. The use of several variables as predictors in the child health and development regression model makes the assessment of multiple correlation necessary to identify multicollinearity. But this is not possible by examining only the correlation matrix, which shows only simple correlations between two variables. The simplest means of identifying collinearity is an examination of the correlation matrix for the predictor variables. The presence of high correlations (generally.9 and above) is the first indication of substantial collinearity. Lack of any high correlation values, however, does not ensure a lack of collinearity. Collinearity may be due to the combined effect of two or more other independent variables. The correlation matrix shows only simple correlations between two variables. The correlation matrix for the child health data does not reveal any high correlation between two predictor variables. The highest correlation of.87 is between age of

Two measures for assessing both pairwise and multiple variable collinearity available in SPSS are the tolerance and the variance inflation factor (VIF). Tolerance is the amount of variability of the selected independent variable not explained by the other independent variables. It is obtained by making each independent variable a dependent variable and regressing it against the remaining independent variables. Tolerance values approaching zero indicate that the variable is highly collinear with the other predictor variables. The variance inflation factor (VIF) is inversely related to the tolerance value: VIF / TOLERANCE. Large VIF values (a usual threshold is., which corresponds to a tolerance of.) indicate a high degree of collinearity or multicollinearity among the independent variables. The following output displays the values of tolerance and VIF for the predictor variables in the case study. ---------------------- Variables in the Equation ------------------------------------ Variable B SE B Beta Tolerance VIF T GESTWKS.85.96.96.9958.8.77 MNOCIG -.69. -.5857.99.8 -.5 MHEIGHT..758.98.786..7 MPPWT.88..5979.75.7.565 FHEIGHT.97.87.959.975..76 (Constant) -8..9-5.758 No VIF value exceeds., and the tolerance values show that collinearity does not explain more than percent of any independent variable's variance. There is no evidence of a significant collinearity in the problem. SPSS regression collinearity diagnostics includes also the condition indices and the regression coefficient variance-decomposition matrix. A large condition index (over ) indicates a high degree of collinearity. The regression coefficient variance-decomposition matrix shows the proportion of variance for each regression coefficient (and its associated variable) attributable to each condition index. In order to examine collinearity, we first identify all condition indices above the threshold value of. Then for all condition indices exceeding the threshold, we identify variables with variance proportions above.9. A collinearity problem is indicated when a condition index identified as above the threshold value accounts for a substantial proportion of variance (.9 or above) for two or more coefficients. Thus each row in the matrix with the proportions exceeding.9 for at least two coefficients indicates significant correlations among the corresponding variables. The collinearity diagnostics table for the child health and development data is displayed below:

Collinearity Diagnostics Num Eigen Cond Variance Proportions Index Constant GEST MNOCIG MHEIG MPPWT FHEIGHT 5.9...8.96..5..657.857..5.9856..6..55 9.7...7.96.796.6.9 5.585.66.7698.675.569.5.89 5.9 76.85.57.97..65.99.699 6.8 5.79.9876.9.9.899.779.78 As you can see, there are three condition indices exceeding the threshold value of. However, none of the three condition indices accounts for a substantial proportion of variance (.9 or above) for two or more coefficients. Thus, we can find no support for the existence of multicollinearity. 9.5 Diagnostics for Outliers and Influential Cases Now we consider diagnostics for outliers and influential cases. An outlier is not necessarily an influential point, nor do all influential points have to be outliers. Thus, different statistical tools are used to identify outliers and influential observations. Studentized residuals are used for flagging outliers, and leverages and Cook's distances for flagging influential cases. A studentized residual is a residual divided by its estimated standard deviation. The standardization makes the residuals directly comparable (larger predicted values have larger residuals). The studentized residual is the primary indicator of an observation that is an outlier on the dependent variable. With a fairly large sample size (5 or above), we may use a rule of thumb that studentized residuals smaller than - or larger than are substantial. Observations falling outside the range can be considered potential outliers. Instead of using cutoffs based on distributional assumptions, many researchers plot the studentized residuals, looking for points that stand apart from the others. The following plot is a plot of studentized residuals versus case number for the child health and development data.

Plot of Studentized Residual vs. Case Number Studentized Residual - - 5 6 7 Case Number There are several cases with the studentized residuals above or below - threshold. However, there are no points that stand apart from the others. The leverage of a case is a measure of the distance between its explanatory variable values and the average of the explanatory variable values in the entire data set. This observation has substantial impact on the regression results due to its differences from other observations. Leverages are greater than /n and less than, and the average of all leverages in a data set is always p/n, where p is the number of regression variables. While a large leverage does not necessarily indicate that the case is influential, it does imply that the case has a high potential for influence. Statisticians use (*p)/n as a lower cutoff point for flagging potential influential cases (if p> and n>5), (*p)/n otherwise. Instead of using cutoffs, many researchers are looking for points that stand apart from the others. In the child health problem, the threshold leverage value is (*6)/68=.6. There are several cases with the leverage exceeding the threshold value, but two of them are obviously substantially larger than the rest:.957 (case 67) and.569 (case 9).. Plot of Leverage vs. Case Number.8 Leverage.6... 5 6 7 Case Number

In order to get some overall assessment of influence and see whether the above cases 67 and 9 are indeed influential, we will look at some other case influence statistics. One of these statistics is the Cook's distance. Cook's Distance measures overall influence of a single case on the estimated regression coefficients when the case is deleted from the estimation process. Large values (usually greater than ) indicate substantial influence by the case in affecting the estimated regression coefficients. However, even if no observations exceed this threshold, additional attention is dictated if a small set of observations has substantially higher values than the rest. The following plot is the plot of Cook's distances against case number for the data..7 Plot of Cook's Distance vs. Case Number.6 Cook's Distance.5..... 5 6 7 Case Number Although the value of Cook's Distance for case 56 is only equal to.68, which is considerably smaller than, it is obviously substantially larger than the rest. Therefore, it is worthy to rerun the regression without the case to see its influence on the regression results. The case 56 is a case of a heavy smoker with the pre-pregnancy weight of pounds who gave birth to a child with extremely low birth weight of.5 pounds. Running multiple regression without the case changes slightly the coefficient of determination (it is equal to.6 without the case and.65 for all observations), the residual sum of squares (588.69 without the case and 598. for all observations), and the coefficients of the regression equation. The regression equation without the case is { BWT}.6.* FHEIGHT GESTWKS 7.979.. MNOCIG.6 MHEIGHT. MPPWT