Linear Regression in SAS

Similar documents
CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Simple Linear Regression the model, estimation and testing

NORTH SOUTH UNIVERSITY TUTORIAL 2

Simple Linear Regression

Professor Rose-Helleknat's PCR Data for Breast Cancer Study

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

CHILD HEALTH AND DEVELOPMENT STUDY

CHAPTER TWO REGRESSION

Multiple Regression Analysis

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

Simple Linear Regression One Categorical Independent Variable with Several Categories

Stat Wk 9: Hypothesis Tests and Analysis

1.4 - Linear Regression and MS Excel

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Chapter 3 CORRELATION AND REGRESSION

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Lab 8: Multiple Linear Regression

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

HZAU MULTIVARIATE HOMEWORK #2 MULTIPLE AND STEPWISE LINEAR REGRESSION

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Effect of Source and Level of Protein on Weight Gain of Rats

(C) Jamalludin Ab Rahman

IAPT: Regression. Regression analyses

Statistical Techniques. Meta-Stat provides a wealth of statistical tools to help you examine your data. Overview

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Step 3 Tutorial #3: Obtaining equations for scoring new cases in an advanced example with quadratic term

Intro to SPSS. Using SPSS through WebFAS

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

Daniel Boduszek University of Huddersfield

Statistics 2. RCBD Review. Agriculture Innovation Program

Stat Wk 8: Continuous Data

ANOVA in SPSS (Practical)

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

Notes for laboratory session 2

Regression. Page 1. Variables Entered/Removed b Variables. Variables Removed. Enter. Method. Psycho_Dum

Dan Byrd UC Office of the President

isc ove ring i Statistics sing SPSS

Correlation and regression

Modeling Sentiment with Ridge Regression

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA

bivariate analysis: The statistical analysis of the relationship between two variables.

The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation Multivariate Analysis of Variance

Pitfalls in Linear Regression Analysis

Section 3.2 Least-Squares Regression

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

4Stat Wk 10: Regression

Section 6: Analysing Relationships Between Variables

12.1 Inference for Linear Regression. Introduction

Business Statistics Probability

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

One-Way Independent ANOVA

CSE 258 Lecture 1.5. Web Mining and Recommender Systems. Supervised learning Regression

Multiple Linear Regression Analysis

ExperimentalPhysiology

MULTIPLE REGRESSION OF CPS DATA

Chapter 9. Factorial ANOVA with Two Between-Group Factors 10/22/ Factorial ANOVA with Two Between-Group Factors

Stat 13, Lab 11-12, Correlation and Regression Analysis

2016 ACP IM-ITE Regression Based Individual Analyses

In this module I provide a few illustrations of options within lavaan for handling various situations.

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

Answer to exercise: Growth of guinea pigs

WELCOME! Lecture 11 Thommy Perlinger

A response variable is a variable that. An explanatory variable is a variable that.

Chapter 6 Measures of Bivariate Association 1

STA 3024 Spring 2013 EXAM 3 Test Form Code A UF ID #

10. LINEAR REGRESSION AND CORRELATION

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

12/30/2017. PSY 5102: Advanced Statistics for Psychological and Behavioral Research 2

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Chapter 3: Examining Relationships

Using SPSS for Correlation

f WILEY ANOVA and ANCOVA A GLM Approach Second Edition ANDREW RUTHERFORD Staffordshire, United Kingdom Keele University School of Psychology

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

This tutorial presentation is prepared by. Mohammad Ehsanul Karim

STP 231 Example FINAL

Choosing a Significance Test. Student Resource Sheet

Logistic Regression Predicting the Chances of Coronary Heart Disease. Multivariate Solutions

14.1: Inference about the Model

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

ID# Exam 1 PS306, Spring 2005

Lesson 9: Two Factor ANOVAS

Normal Q Q. Residuals vs Fitted. Standardized residuals. Theoretical Quantiles. Fitted values. Scale Location 26. Residuals vs Leverage

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Part 8 Logistic Regression

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 8 One Way ANOVA and comparisons among means Introduction

South Australian Research and Development Institute. Positive lot sampling for E. coli O157

Basic Biostatistics. Chapter 1. Content

Small Group Presentations

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA

Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers

Hungry Mice. NP: Mice in this group ate as much as they pleased of a non-purified, standard diet for laboratory mice.

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Analytic Strategies for the OAI Data

Knowledge is Power: The Basics of SAS Proc Power

Chapter 14: More Powerful Statistical Methods

Data Analysis with SPSS

ANOVA. Thomas Elliott. January 29, 2013

Transcription:

1 Suppose we wish to examine factors that predict patient s hemoglobin levels. Simulated data for six patients is used throughout this tutorial. data hgb_data; input id age race $ bmi hgb; cards; 21 25 W 18.0 9.3 27 34 W 26.2 11.1 45 21 B 17.9 9.1 12 28 B 22.6 10.5 13 23 W 28.7 11.6 44 32 W 20.4 10.2 ; run; If we wish to investigate whether BMI can be used to predict hemoglobin levels, we can use simple linear regression. If we wish to investigate whether BMI can be used to predict hemoglobin levels but we also expect age to impact hemoglobin levels, we can use a multiple linear regression to investigate the relationship of both BMI and age with hemoglobin levels. CAUTION: Sample size is an important consideration when conducting a linear regression. Whenever possible, a sample size calculation should be made prior to data collection. However, we offer as a rule of thumb that the number of observations be at least 10 times the number of predictors.

2 Plotting the data: We begin by looking at a scatterplot to visualize the relationship between hemoglobin and BMI. Hemoglobin is represented by the variable hgb and BMI is represented by the variable bmi. proc sgplot data=hgb_data; scatter x=bmi y=hgb; title "Relationship of Hemoglobin and BMI"; xaxis label="bmi"; yaxis label="hemoglobin"; run; We are able to see that higher hemoglobin levels correspond with higher BMI. This relationship can be described as a positive linear relationship.

3 Simple Linear Regression: There are two procedures in SAS that are typically used to perform linear regression, PROC GLM and PROC REG. We will illustrate the use of PROC GLM to investigate whether BMI can be used to predict hemoglobin levels. The syntax for a simple linear regression is as follows: proc glm data=hgb_data; model hgb=bmi; run; quit; For proc glm, a quit statement is needed so that proc glm will not continue to run in the background. By default, SAS outputs both the number of observations in the dataset (Number of Observations Read) and the number of observations used in the analysis (Number of Observations Used). SAS also includes the overall ANOVA table, R-square value, Type I SS analysis, Type III SS analysis, and, for cases where all covariates are continuous, Parameter Estimates.

4 The overall ANOVA table indicates whether the entire model explains a significant amount of the variability in the outcome variable. The Type I SS analysis computes F- tests for sequential sums of squares while the Type III SS analysis computes F-tests for partial sums of squares. In the case of simple linear regression, the overall ANOVA, Type I SS analysis, and Type III SS analysis are equivalent. In simple linear regression, the R-square value represents the proportion of variability in the outcome explained by the predictor. In this example, approximately 96% of the variability in hemoglobin levels can be explained by BMI. The Parameter Estimates can be used to determine the regression equation. For this example:

5 This means that we can predict that an increase of one point in BMI will result in a 0.218 increase in hemoglobin level. Also by default, in simple linear regression, SAS produces a scatterplot which includes the line of best fit, 95% Confidence Limits for the line of best fit, and 95% Prediction Limits for the line of best fit. The production of any plots can be suppressed by including the plots=none option. proc glm data=hgb_data plots=none; model hgb=bmi; run; quit; Multiple Linear Regression: The syntax for a multiple linear regression investigating the effects of both BMI and age on hemoglobin levels using PROC GLM is as follows: proc glm data=hgb_data plots=none; model hgb=bmi age;

6 run; quit; As in the simple linear regression case SAS outputs both the number of observations in the dataset (Number of Observations Read) and the number of observations used in the analysis (Number of Observations Used). The overall ANOVA table, R-square value, Type I SS analysis, Type III SS analysis, and, for cases where all covariates are continuous, Parameter Estimates are all included as in the simple linear regression case.

7 The overall ANOVA table indicates whether the entire model explains a significant amount of the variability in the outcome variable. The Type I SS analysis computes F- tests for sequential sums of squares while the Type III SS analysis computes F-tests for partial sums of squares. In many applications, the Type III SS are preferred because the F-tests for each effect are adjusted for all other variables in the model. This means that in this example, there is a significant relationship between BMI and hemoglobin levels, even after taking the patients age into account.

8 In multiple linear regression, the R-square value represents the proportion of variability in the outcome explained by the model. In this example, approximately 98% of the variability in hemoglobin levels can be explained by the model. The Parameter Estimates can be used to determine the regression equation. For this example: This means that, holding age constant, we can predict that an increase of one point in BMI will result in a 0.210 increase in hemoglobin level. Diagnostic Plots: There are several diagnostic plots that may be requested after fitting a regression model. For example, we may wish to check some of the key assumptions of the linear regression model: normality of the error and linearity of the output and predictor variables. The normality of error terms is most commonly assessed via a Q-Q plot, while the assumption of linearity is typically tested by plotting the residuals against the fitted values and checking for random scatter about zero. Diagnostic plots are obtained and interpreted in the same manner for both simple linear regression and multiple linear regression. To obtain a panel of diagnostic plots for the simple linear regression example: proc glm data=hgb_data plots=diagnostics; model hgb=bmi; run; quit;

9 The Q-Q plot is the first plot of the middle row of the panel. The more closely the points follow the diagonal line, the more normally distributed the residual quantiles appear to be. In this example, the assumption of normally distributed errors appears to be a reasonable assumption. The residual plot (where residuals are plotted against the fitted values) is the first plot on the first row of the panel. In this example, the points are randomly scattered around zero so the assumption of linearity also appears to be a reasonable assumption.