Simple Linear Regression One Categorical Independent Variable with Several Categories

Similar documents
Study Guide #2: MULTIPLE REGRESSION in education

Regression Including the Interaction Between Quantitative Variables

Intro to SPSS. Using SPSS through WebFAS

MULTIPLE OLS REGRESSION RESEARCH QUESTION ONE:

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Simple Linear Regression

Regression. Page 1. Variables Entered/Removed b Variables. Variables Removed. Enter. Method. Psycho_Dum

WELCOME! Lecture 11 Thommy Perlinger

Multiple Linear Regression Analysis

CHAPTER TWO REGRESSION

Analysis of Variance (ANOVA) Program Transcript

Chapter Eight: Multivariate Analysis

Daniel Boduszek University of Huddersfield

Correlation and Regression

SPSS output for 420 midterm study

Part 8 Logistic Regression

Chapter Eight: Multivariate Analysis

Daniel Boduszek University of Huddersfield

Example of Interpreting and Applying a Multiple Regression Model

Two-Way Independent ANOVA

THE STATSWHISPERER. Introduction to this Issue. Doing Your Data Analysis INSIDE THIS ISSUE

ANOVA in SPSS (Practical)

SPSS output for 420 midterm study

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

Chapter 11 Multiple Regression

POL 242Y Final Test (Take Home) Name

Ordinary Least Squares Regression

One-Way Independent ANOVA

2 Assumptions of simple linear regression

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Linear Regression in SAS

SPSS GUIDE FOR THE DATA ANALYSIS TARGIL Michael Shalev January 2005

Lab 8: Multiple Linear Regression

5 To Invest or not to Invest? That is the Question.

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

NORTH SOUTH UNIVERSITY TUTORIAL 2

Item-Total Statistics

Chapter 3: Examining Relationships

1. You want to find out what factors predict achievement in English. Develop a model that

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Chapter 10: Moderation, mediation and more regression

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

Introduction to Quantitative Methods (SR8511) Project Report

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Basic Biostatistics. Chapter 1. Content

ID# Exam 1 PS306, Spring 2005

Daniel Boduszek University of Huddersfield

Your Task: Find a ZIP code in Seattle where the crime rate is worse than you would expect and better than you would expect.

Business Research Methods. Introduction to Data Analysis

Step 3 Tutorial #3: Obtaining equations for scoring new cases in an advanced example with quadratic term

Sample Exam Paper Answer Guide

Chapter 12: Analysis of covariance, ANCOVA

STAT 201 Chapter 3. Association and Regression

7. Bivariate Graphing

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Introduction to SPSS. Katie Handwerger Why n How February 19, 2009

Statistical questions for statistical methods

Measuring the User Experience

One-Way ANOVAs t-test two statistically significant Type I error alpha null hypothesis dependant variable Independent variable three levels;

ID# Exam 3 PS 217, Spring 2009 (You must use your official student ID)

ANOVA. Thomas Elliott. January 29, 2013

Two-Way Independent Samples ANOVA with SPSS

ID# Exam 3 PS 306, Spring 2005 (You must use official ID#)

0= Perempuan, 1= Laki-Laki

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Introduction to SPSS: Defining Variables and Data Entry

Daniel Boduszek University of Huddersfield

Section 6: Analysing Relationships Between Variables

SPSS Portfolio. Brittany Murray BUSA MWF 1:00pm-1:50pm

Problem set 2: understanding ordinary least squares regressions

You can use this app to build a causal Bayesian network and experiment with inferences. We hope you ll find it interesting and helpful.

Psych 5741/5751: Data Analysis University of Boulder Gary McClelland & Charles Judd. Exam #2, Spring 1992

Lesson: A Ten Minute Course in Epidemiology

Overview of Lecture. Survey Methods & Design in Psychology. Correlational statistics vs tests of differences between groups

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Section 3.2 Least-Squares Regression

bivariate analysis: The statistical analysis of the relationship between two variables.

1. To review research methods and the principles of experimental design that are typically used in an experiment.

LAMPIRAN A KUISIONER

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

To open a CMA file > Download and Save file Start CMA Open file from within CMA

Logistic Regression Predicting the Chances of Coronary Heart Disease. Multivariate Solutions

3.2 Least- Squares Regression

Add_A_Class_with_Class_Number_Revised Thursday, March 18, 2010

Appendix. I. Map of former Car Doctors site. Soil for the greenhouse study was collected in the area indicated by the arrow.

SPSS Correlation/Regression

Improve Your Success with Food Logging in the dotfit Program

Chapter 7: Correlation

CHILD HEALTH AND DEVELOPMENT STUDY

Chapter 9: Comparing two means

Sociology Exam 3 Answer Key [Draft] May 9, 201 3

Section 3 Correlation and Regression - Teachers Notes

You can use this app to build a causal Bayesian network and experiment with inferences. We hope you ll find it interesting and helpful.

How to Conduct On-Farm Trials. Dr. Jim Walworth Dept. of Soil, Water & Environmental Sci. University of Arizona

Political Science 15, Winter 2014 Final Review

EWMBA 296 (Fall 2015)

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

CNV PCA Search Tutorial

Transcription:

Simple Linear Regression One Categorical Independent Variable with Several Categories Does ethnicity influence total GCSE score? We ve learned that variables with just two categories are called binary variables and are simple to use in regression. However, many variables have more than two categories. Suppose you want to see how the total GCSE score of the respondents is related to their ethnic group. In the YCS dataset, the variable seth2 has five categories (=White, 2=Black, 3=Asian, 4=Mixed, and 5=Other). Much like with sex discussed above, the codes, 2, 3, 4, and 5 assigned to each ethnicity do not represent anything the order is arbitrary. However, because linear regression assumes all independent variables are numerical, if we were to enter the variable seth2 into a linear regression model, the coded values of the five categories would be interpreted as numerical values of each category. As we found with sgender, using seth2 in a linear regression without changing the coded values of the categories would give us results that would not make sense. To avoid error, we re going to create dummy variables for seth2. This is done in much the same way that we created the dummies for sex. However, before we begin, let s check the value labels for seth2, to make sure this variable is ready for analysis. Find seth2 in the Variable View window. Click on the Values cell in the seth2 row to view the values assigned to our variable. You should see that in addition to the five ethnicity categories, there is a category called Not answered, which is labelled as -9.00. Because we are interested in the influence of ethnicity on GCSE score, these unanswered cases are effectively missing data, so we can code -9.00 as missing. To do so, just click to open the Missing cell, select Discrete missing value, and enter -9.00 into the text box. Click OK. Now that we ve edited seth2, we re ready to build our dummy variables and conduct our linear regression analysis! Dummy Variables Remember that a dummy variable is a variable created to assign numerical value to levels of categorical variables. Each dummy variable represents one category of the explanatory variable and is coded if the case falls in that category and zero if not. For example, in the dummy variable for Mixed ethnicity, all cases in which the young person s ethnicity is Mixed will be coded as and all other cases are coded as 0. In the dummy variable for White, all cases in which the young person s ethnicity is White will be coded as and all other cases are coded as 0. The same will be done in the Black, in the Asian, and in the Other ethnicity dummy variables. This allows us to enter in the ethnicity values as numerical. To begin creating our five dummy variables (one for each of the categories in seth2), select Transform and then Recode into Different Variables. Find seth2 in the variable list on the left and move it to the Numeric Variable Output Variable text box. Under the Output Variable header, type in the name and label of the first dummy variable you want to create. Because =White in the seth2 variable values, we can start by creating the WHITE dummy variable. When you ve finished entering the Output variable name and label, click Change. You should see your output variable (in this case, WHITE), in the dialogue box like this:

Next, click Old and New Values. Because =White in seth2, enter under the Old Value header and under the New Value header. Click Add. You should see in the Old New text box. Now, because in this dummy variable we want all the other values to be 0, click All other values under the Old Value header and enter 0 under the New Value header. Click Add. Click Continue, and then OK in the original Recode into Different Variables dialogue box. 2

To check that you have successfully created a dummy variable called WHITE, scroll down to the end of the variable list in Variable View. WHITE should be the last variable in the list, as it is the latest variable to be created. Repeat the above steps for the BLACK, ASIAN, MIXED, and OTHER ethnicity categories. However, remember when entering these following categories, you must use their corresponding values when recoding: 2=Black, 3=Asian, 4=Mixed, and 5=Other. For example, when recoding seth2 into the BLACK dummy variable, you ll use 2 as the Old Value and as the New Value for BLACK, and recode all other values to 0. When creating the ASIAN dummy variable, you will use 3 as the Old Value and as the New Value for ASIAN, while recoding all other values to 0. You ll use 4 as the Old Value and as the New Value for MIXED, while recoding all other values to 0. And, finally, you ll use 5 as the Old Value and as the New Value for OTHER, while recoding all other values to 0. When you ve finished, you should have five new dummy variables at the end of your variable list. Now we re ready to fit a linear regression model for this categorical data! This does seem very long winded, and it is, but this is the process you need to go through each time you are conducting linear regression with a categorical variable with more than two categories. Before we begin: when we fit our model in SPSS, we need to select one dummy variable as the baseline category (the category against which we compare all the other categories). In this example, we will use WHITE as the baseline category. As the WHITE variable is now our baseline, we don t have to include it the linear regression model. We will, however, need to include all of the other dummy variables for ethnicity in the model. Basically, this means that we are comparing all the ethnicities to the WHITE ethnicities. To perform simple linear regression, just select Analyze, Regression, and then Linear In the dialogue box that appears, move sgcseptsnew to the Dependent box and MIXED, ASIAN, BLACK, and OTHER to the Independent(s) box. Click OK. You should get the following output: Variables Entered/Removed a Model Variables Variables Method Entered Removed Other Ethnic. Enter Group, Mixed Ethnic Group, Black Ethnic Group, Asian Ethnic Group b a. Dependent Variable: ks4 pts score on new basis not capped b. All requested variables entered. Model Summary 3

Model R R Square Adjusted R Square Std. Error of the Estimate.066 a.004.004 25.53087 a. Predictors: (Constant), Other Ethnic Group, Mixed Ethnic Group, Black Ethnic Group, Asian Ethnic Group ANOVA a Model Sum of Squares df Mean Square F Sig. Regression 943245.567 4 2358.392 4.965.000 b Residual 26908848.523 3765 5757.998 Total 27852094.090 3769 a. Dependent Variable: ks4 pts score on new basis not capped b. Predictors: (Constant), Other Ethnic Group, Mixed Ethnic Group, Black Ethnic Group, Asian Ethnic Group Coefficients a Model Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. (Constant) 394.550.56 34.322.000 Black Ethnic Group -45.857 6.663 -.059-6.883.000 Asian Ethnic Group 8.000 3.799.08 2.06.035 Mixed Ethnic Group 4.974 7.2.006.690.490 Other Ethnic Group 30.367 2.798.020 2.373.08 a. Dependent Variable: ks4 pts score on new basis not capped Take a look at the p-values calculated for our ethnicity dummy variables. Which of them are statistically significant at the p < 0.05 level? All of the dummy variables are statistically significant except for the Mixed ethnic group variable. The p-value in that case is 0.490, much higher than the p < 0.05 threshold. This means that any predicted values we may be able to calculate for the Mixed ethnic group won t be significant. We can use our SPSS results to write out the fitted regression equation for this model and use it to predict values of sgcseptsnew for given certain values of seth2. In this case, WHITE is our baseline, and therefore the Constant coefficient value of 3.550 represents the predicted police confidence score of a respondent in that category. Remember that the dummy variables used in this regression model are coded as Mixed=, Asian=, Black=, and Other=. This means that when calculating each of their predicted values, we will enter in as the value for X in the regression equation The predicted scores are as follows: Y = a + bx 4

sgcseptsnew = 394.550 + (-45.857 x ) = 340.693 (Black Ethnic Group) sgcseptsnew = 394.550 + (8.000 x ) = 402.55 (Asian Ethnic Group) sgcseptsnew = 394.550 + (4.974 x ) = 399.524 (Mixed Ethnic Group) sgcseptsnew = 394.550 + (30.367 x ) = 424.97 (Other Ethnic Group) Looking at the predicted total GCSE scores above, we can see that on average, Asian young people earn GCSE scores that are 8 points higher than White students (as WHITE is our baseline category). On average, Black students earn GCSE scores that are how many points lower than White respondents? On average, respondents in the Other ethnic group category earn GCSE scores that are how many points higher than White respondents? Remember, we can use the r 2 statistic (which is calculated in the Model summary output table) to gauge how much variation in the dependent variable is explained by the independent variable. In our ethnicity example, the r 2 is low at.004, or 0.4%. Only 0.4% of the variation in total GCSE score is explained by ethnicity. Summary Here, we ve used linear regression to determine the statistical significance of GCSE scores in people from various ethnic backgrounds. We ve created dummy variables in order to use our ethnicity variable, a categorical variable with several categories, in this regression. We ve learned that there is, in fact, a statistically significant relationship between GCSE score and ethnicity, and we ve predicted GCSE scores using the ethnicity coefficients presented to us in the linear regression. Now, we may want to see how our predicted scores change if we run a linear regression using both sex and ethnicity as independent variables. ***Note: as we are making changes to a dataset we ll continue using for the rest of this section, please make sure to save your changes before you close down SPSS. This will save you having to repeat sections you ve already completed! 5