Pitfalls in Linear Regression Analysis

Similar documents
Understandable Statistics

1.4 - Linear Regression and MS Excel

Simple Linear Regression the model, estimation and testing

NORTH SOUTH UNIVERSITY TUTORIAL 2

SCATTER PLOTS AND TREND LINES

Section 3.2 Least-Squares Regression

Simple Linear Regression

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

The Pretest! Pretest! Pretest! Assignment (Example 2)

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Before we get started:

BOOTSTRAPPING CONFIDENCE LEVELS FOR HYPOTHESES ABOUT REGRESSION MODELS

bivariate analysis: The statistical analysis of the relationship between two variables.

Chapter 3: Examining Relationships

IAPT: Regression. Regression analyses

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Chapter 3: Describing Relationships

3.2 Least- Squares Regression

Dr. Kelly Bradley Final Exam Summer {2 points} Name

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

South Australian Research and Development Institute. Positive lot sampling for E. coli O157

Part 1. Online Session: Math Review and Math Preparation for Course 5 minutes Introduction 45 minutes Reading and Practice Problem Assignment

STAT445 Midterm Project1

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Chapter 3 CORRELATION AND REGRESSION

6. Unusual and Influential Data

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

BOOTSTRAPPING CONFIDENCE LEVELS FOR HYPOTHESES ABOUT QUADRATIC (U-SHAPED) REGRESSION MODELS

CHILD HEALTH AND DEVELOPMENT STUDY

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

10. LINEAR REGRESSION AND CORRELATION

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

Diurnal Pattern of Reaction Time: Statistical analysis

STAT 201 Chapter 3. Association and Regression

AP Statistics. Semester One Review Part 1 Chapters 1-5

Statistics and Probability

Business Statistics Probability

Section 3 Correlation and Regression - Teachers Notes

12.1 Inference for Linear Regression. Introduction

CHAPTER TWO REGRESSION

Regression Equation. November 29, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Statistics 2. RCBD Review. Agriculture Innovation Program

Phallosan Study. Statistical Report

Clincial Biostatistics. Regression

REGRESSION MODELLING IN PREDICTING MILK PRODUCTION DEPENDING ON DAIRY BOVINE LIVESTOCK

Designing Better Graphs by Including Distributional Information and Integrating Words, Numbers, and Images

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

MEASURES OF ASSOCIATION AND REGRESSION

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

Daniel Boduszek University of Huddersfield

M15_BERE8380_12_SE_C15.6.qxd 2/21/11 8:21 PM Page Influence Analysis 1

Performance of Median and Least Squares Regression for Slightly Skewed Data

CHAPTER 3 Describing Relationships

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Measuring the User Experience

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

Correlation and regression

Chapter 1: Exploring Data

STATISTICS INFORMED DECISIONS USING DATA

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Mark J. Anderson, Patrick J. Whitcomb Stat-Ease, Inc., Minneapolis, MN USA

Statistical Methods and Reasoning for the Clinical Sciences

Score Tests of Normality in Bivariate Probit Models

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Excel 2007 Exercise. SECTION 1. (Completed before face-to-face sections begin) (2 hours)

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Still important ideas

Bangor University Laboratory Exercise 1, June 2008

Unit 7 Comparisons and Relationships

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

AP Stats Chap 27 Inferences for Regression

Unit 1 Exploring and Understanding Data

STATISTICS AND RESEARCH DESIGN

isc ove ring i Statistics sing SPSS

The Role of Sulfate as a Driver for Mercury Methylation in the Everglades What Does Statistics Really Have to Say?

AP STATISTICS 2010 SCORING GUIDELINES

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

Multiple Linear Regression

Statistics Guide. Prepared by: Amanda J. Rockinson- Szapkiw, Ed.D.

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA

MEA DISCUSSION PAPERS

STATISTICS & PROBABILITY

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

Linear Regression in SAS

Choosing a Significance Test. Student Resource Sheet

1. The figure below shows the lengths in centimetres of fish found in the net of a small trawler.

While correlation analysis helps

5/20/ Administration & Analysis of Surveys

IAS 3.9 Bivariate Data

From Bivariate Through Multivariate Techniques

Stat 13, Lab 11-12, Correlation and Regression Analysis

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Problem Set 5 ECN 140 Econometrics Professor Oscar Jorda. DUE: June 6, Name

Math 075 Activities and Worksheets Book 2:

Making Inferences from Experiments

Transcription:

Pitfalls in Linear Regression Analysis Due to the widespread availability of spreadsheet and statistical software for disposal, many of us do not really have a good understanding of how to use regression analysis properly. Indeed, before one can use regression analysis in a proper manner, one should realize the various difficulties associated in using it, namely: 1. Lacking an awareness of the assumptions of least-squares regression 2. Not knowing how to evaluate the assumptions of least-squares regression 3. Not knowing what the alternatives to least-squares regression are if a particular assumptions is violated 4. Worst still, using a regression model without knowledge of the subject matter How can a user be expected to know what the alternatives to least-squares regression are if a particular assumption is violated, when he or she in many instances is not even aware of the assumptions of regression, let alone how the assumptions can be evaluated? Hence, it is necessary to go beyond the basic number crunching exercise such as the computation of the Y-intercept, the gradient (slope) and r 2. Let us recall what the three major assumptions of regression and correlation are: 1. Assumption of normality: This assumption requires that errors around the line of regression be normally distributed at each value of X. Like the t test and the ANOVA F test, regression analysis is fairly robust against departures from the normality assumption. As long as the distribution of the errors around the line of regression at each level of X is not extremely different from a normal distribution, inferences about the line of regression and the regression coefficient will not be seriously affected. 2. Assumption of homoscedasticity: This requires that the variation around the line of regression be constant for all values of X. This means that the errors in the Y-responses vary by the same amount when X is a low value as when X is a high value. This homoscedasticity assumption is important for using the least-squares method of determining the regression coefficients. If there are serious departures from this assumption, either data transformations or weighted least-squares methods can be applied. 1

3. Assuming independence of errors: This requires that the errors should be independent for each value of X. This assumption is particularly important when data are collected over a period of time. In such situation, the errors for a particular time period are often correlated with those of the previous time period. If this occurs, alternatives to least-squares regression analysis need to be considered. This discussion of pitfalls in regression can be best illustrated by referring to Table 1 on three sets of artificial data that demonstrate the importance of observations through scatter plots and residual analysis. Table 1: Three sets of artificial data Data Set A Data Set B Data Set C X Y X Y X Y 4 4.26 4 3.10 4 5.39 5 5.68 5 4.74 5 5.73 6 7.24 6 6.13 6 6.08 7 4.82 7 7.26 7 6.42 8 6.95 8 8.14 8 6.77 9 8.81 9 8.77 9 7.11 10 8.04 10 9.14 10 7.46 11 8.33 11 9.26 11 7.81 12 10.84 12 9.13 12 8.15 13 7.58 13 8.74 13 12.74 14 9.96 14 8.10 14 8.84 It is interesting to note that these three sets of data yield almost the same statistical parameter values: Y = 0.50X + 3.00 r 2 = 0.667 SSR = 27.51 SSE = 13.76 SST = 41.27 Had we stopped our analysis at this point, we would lose valuable information in the data collected and might be led to wrong inferences and conclusions. However, if we were to examine the scatter diagrams of these three sets of data as shown in Figure 1 and the residual plots in Figure 2, we would see how different the data sets were. Figure 1: Scatter Plots of Data Sets A, B and C 2

Data Set A 12 10 8 y = 0.5001x + 3.0001 R² = 0.6665 6 4 2 0 Data Set B 1 1 8.00 y = 0.5x + 3.0009 R² = 0.6662 6.00 4.00 Data Set C 14 12 10 8 6 4 2 0 y = 0.4997x + 3.0025 R² = 0.6663 3

Figure 2: Residual plots for Data Sets A, B and C Residual Data Set A 2.50 1.50 1.00 0.50-0.50-1.00-1.50 - -2.50 Residual Data Set B 1.50 1.00 0.50-0.50-1.00-1.50 - -2.50 Residual Data Set C 3.50 3.00 2.50 1.50 1.00 0.50-0.50-1.00-1.50 From the Figures 1 and 2, we see that the only data set that seems to follow an approximate straight line is data set A. The residual plot for data set A does show random patterns around the zero and does not appear to have 4

outlying residuals. This is certainly not the case for data sets B and C. The scatter plot of data set B seems to indicate that a quadratic regression model should be more suitable. This observation is reinforced by the clear parabolic form of the residual plot for data set B. Indeed, the quadratic equation Y = -0.1267X 2 + 2.7808X - 5.9957 fits the points very well with r² = 1 as shown in Figure 3 below: 1 Fig 3: Data Set B with a quadratic form curve 1 8.00 y = 0.5x + 3.0009 R² = 0.6662 6.00 4.00 y = -0.1267x 2 + 2.7808x -5.9957 R² = 1 The scatter diagram and the residual plot for data set C clearly depict what may very well be an outlying observation. If this is the case, we may want to remove the obvious outlier and re-estimate the basic model. It should be noted that upon removal of the outlier, the result of re-estimating the model might lead to a relationship that is very much different from the one originally conjectured. We see here that residual plots are of vital importance to a complete regression analysis. They do provide good basic information to a credible analysis. Therefore, these plots should always be included as part of a regression analysis. We summarize a strategy for avoiding the pitfalls of regression as follows: 1. Always start with a scatter plot to observe the possible relationship between X and Y 2. Check the assumptions of regression after the regression model has been fitted, before moving on to using the results of the model 3. Plot the residuals versus the independent variable. This chart will enable 5

us to determine whether the model fitted to the data is an appropriate one and will allow us to check visually for violation of the homoscedasticity assumption; 4. We can use a histogram, box-and-whisker plot, or normal probability plot of the residuals to evaluate graphically whether the normality assumption has been seriously violated; 5. If the evaluation done in (3) and (4) indicates violations in the assumptions, we then use methods alternative to least-squares regression or alternative least-squares models (quadratic or multiple regression), depending on what the evaluation has indicated any of this direction; 6. If the evaluation done in (3) and (4) does not indicate violation of the assumptions, then the inferential aspects of the regression analysis can then be proceeded. Tests for the significance of the regression coefficients can be done and confidence and prediction intervals can then be developed. 6