Stat 13, Lab 11-12, Correlation and Regression Analysis

Similar documents
5 To Invest or not to Invest? That is the Question.

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

Statistical Reasoning in Public Health 2009 Biostatistics 612, Homework #2

Chapter 3: Examining Relationships

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

1.4 - Linear Regression and MS Excel

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

Introduction to regression

Math 075 Activities and Worksheets Book 2:

bivariate analysis: The statistical analysis of the relationship between two variables.

STAT 135 Introduction to Statistics via Modeling: Midterm II Thursday November 16th, Name:

STAT 201 Chapter 3. Association and Regression

Bangor University Laboratory Exercise 1, June 2008

This tutorial presentation is prepared by. Mohammad Ehsanul Karim

Chapter 3: Describing Relationships

14.1: Inference about the Model

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Section 3.2 Least-Squares Regression

AP Statistics Practice Test Ch. 3 and Previous

AP STATISTICS 2010 SCORING GUIDELINES

Simple Linear Regression the model, estimation and testing

Multiple Regression Analysis

Homework #3. SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Lecture 6B: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

Multiple Linear Regression Analysis

Use the above variables and any you might need to construct to specify the MODEL A/C comparisons you would use to ask the following questions.

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

MULTIPLE REGRESSION OF CPS DATA

Bayesian approaches to handling missing data: Practical Exercises

q3_2 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

A response variable is a variable that. An explanatory variable is a variable that.

Pitfalls in Linear Regression Analysis

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Intro to SPSS. Using SPSS through WebFAS

HW 3.2: page 193 #35-51 odd, 55, odd, 69, 71-78

Chapter 4: Scatterplots and Correlation

Unit 1 Exploring and Understanding Data

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

STATISTICS & PROBABILITY

Lab 8: Multiple Linear Regression

3.2 Least- Squares Regression

Simple Linear Regression

Linear Regression in SAS

Problem set 2: understanding ordinary least squares regressions

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Chapter 14. Inference for Regression Inference about the Model 14.1 Testing the Relationship Signi!cance Test Practice

MEASURES OF ASSOCIATION AND REGRESSION

Chapter 1: Exploring Data

Economics 345 Applied Econometrics

Your Task: Find a ZIP code in Seattle where the crime rate is worse than you would expect and better than you would expect.

Chapter 3 CORRELATION AND REGRESSION

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

isc ove ring i Statistics sing SPSS

MEA DISCUSSION PAPERS

CHAPTER 3 Describing Relationships

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

Lecture 12: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

STATISTICS INFORMED DECISIONS USING DATA

Reminders/Comments. Thanks for the quick feedback I ll try to put HW up on Saturday and I ll you

Risk Aversion in Games of Chance

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Understandable Statistics

Class 7 Everything is Related

ANOVA. Thomas Elliott. January 29, 2013

Simple Linear Regression One Categorical Independent Variable with Several Categories

Practice First Midterm Exam

10. LINEAR REGRESSION AND CORRELATION

6. Unusual and Influential Data

7) Briefly explain why a large value of r 2 is desirable in a regression setting.

Statistics and Probability

3.2A Least-Squares Regression

Analysis of Categorical Data from the Ashe Center Student Wellness Survey

Biostatistics II

Case Study: Lead Exposure in Children

IAPT: Regression. Regression analyses

CHAPTER TWO REGRESSION

3. For a $5 lunch with a 55 cent ($0.55) tip, what is the value of the residual?

Carrying out an Empirical Project

Lesson 1: Distributions and Their Shapes

Experiment 1: Scientific Measurements and Introduction to Excel

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

The Effectiveness of Captopril

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

Ordinary Least Squares Regression

V. LAB REPORT. PART I. ICP-AES (section IVA)

Homework Linear Regression Problems should be worked out in your notebook

(a) 50% of the shows have a rating greater than: impossible to tell

Answer all three questions. All questions carry equal marks.

A Penny for Your Thoughts: Scientific Measurements and Introduction to Excel

Simple Linear Regression: Prediction. Instructor: G. William Schwert

Problem 1) Match the terms to their definitions. Every term is used exactly once. (In the real midterm, there are fewer terms).

Data Analysis Using Regression and Multilevel/Hierarchical Models

SPSS Correlation/Regression

Psychology of Perception PSYC Spring 2017 Laboratory 2: Perception of Loudness

7) Find the sum of the first 699 positive even integers. 7)

Transcription:

Stat 13, Lab 11-12, Correlation and Regression Analysis Part I: Before Class Objective: This lab will give you practice exploring the relationship between two variables by using correlation, linear regression and graphical techniques. Before starting this lab, you should 1) Be familiar with these terms: - response y (or dependent) and explanatory x (or independent) variables; - slope and intercept in a linear regression equation; - positive and negative correlations. Part II: In-class Activity Suppose you were a Broadway producer. You would want your show to make as much money as possible, and one way of deciding whether or not to invest your time and money into a particular show would be to examine past shows to see how they did. We'll examine a simple question: how does the size of the theater affect box office receipts? To begin, download the data: use http://www.stat.ucla.edu/~dinov/courses_students.dir/stat13_fall01/stat13_fall01/data.dir/broadway Some potentially useful Stata Commands: use filename -> loads a Stata-format file dataset from the Web. If filename is specified without the Stata extension,.dta is assumed. edit -> opens Stata s spreadsheet. label variable varname label -> enables you to attach an extra piece of information (up to 80 characters) about a variable. sort x -> arranges the observations of the current data in ascending order of the values of the variables. There is no limit to the number of the variables in the data. Missing values are interpreted as being larger than any other number and are thus placed last. graph y x -> for a scatterplot of y vs.x. graph y x, xlabel ylabel -> for scatterplots with improved axes (numerical lists). regress y x -> estimates a model from the list of variables using least-squares regression.

quietly regress y x -> suppresses the regression output for the duration of the command. predict newvar-> calculates the predicted values of a variable in a linear regression for each observation. The new values are stored under the name newvar. This command must follow a regress command. predict newvar, residuals -> calculates the residuals from a regression and places them in the variable named newvar. This command must follow a regress command. graph y newvar x, connect (.s) symbol (oi) -> displays a linear regression graph between two variables with fitted values (newvar) connected by a line. corr x y z w-> displays the correlation or covariance matrix for two or more continuous variables, or if they are not specified, for all variables in the data. Observations are excluded from the calculation when values are missing. Does the size of a theater "predict" the average box-office receipts? Your TA might ask the class the following questions, so you should jot down your observations and thoughts for class discussion. 1. Make a scatterplot of the receipts against the capacity graph receipts capacity If you want to reveal the name of the show for any unusual observations, issue this command: graph receipts capacity, symbol([show]) Which show had the highest box office receipts? Which show appeared in a theater with the most seats? 2. Describe the trend: how are receipts and capacity related? Would you say this is a linear relationship? 3. We can quantify the linear relationship with a least squares regression. (This works whether or not the relationship is really linear. If it is not linear, then our least squares regression will be a very poor description -- but we can still compute it.) Note that Stata gives us a lot more information than we are ready for right now. But you'll return to this later in your studies. Type: regress receipt capacity 4. Look in the column headed by "Coef." (Coefficient) to find the estimated intercept and slope. Write the equation of the line here: Interpret the slope. 5. To graph the line on top of the scatterplot, type:

quietly receipts capacity predict preceipts graph receipts preceipts capacity, s(oi) c(.l) The first command quietly regress receipts capacity performs a regression that computes the slope and intercept of the regression line. The next command, predict preceipts, calculates the predicted receipts for each value of capacity. The predicted values all fall on the regression line. The last command, graph receipts preceipts capacity, s(oi) c(.l) does the actual graphing. The command plots receipts vs. capacity and then superimposes a plot of preceipts vs. capacity. The s(oi) sets the symbols for the plot, so that the first plot is done with circles (the o option) and the second plot is done with no symbols (the i for invisible option). The c(.l) option controls how the points are connected; in the first plot the points are not connected (the. option) but in the second plot the points are connected with a line (the l option). 6. What is the interpretation of this regression line? Is capacity a good predictor of average receipts? Explain. 7. An examination of the residuals of a regression can help us discover errors. A residual is the difference between an observed value and the predicted value. In this context, it's the difference between what a show actually made and what the linear regression says it should have made. Put slightly differently, it's the difference between the actual average receipts for a particular show and the average receipts of all shows in theaters with the same capacity. To examine the residuals, type quietly regress receipts capacity predict resids, residuals graph resids capacity, symbol([show]) Name two shows that made more than expected. Name two shows that made less. Which show did the best, in terms of beating expectations? Are we better at predicting for smaller theaters than large? Are there any outliers? 8. Based on the scatterplot of receipts and capacity, what would you guess the correlation between these variables will be? Check your guess by typing corr receipts capacity You can check the correlations between all pairs of variables by typing corr receipts capacity attendnc ratio Before you do, play this guessing game: which pair will have the highest positive correlation? Which pairs, if any, will have a negative correlation?

Part III: Take-home Problem You've probably been told, since the first day you complained about school, that education will help you get a better job. Certainly many jobs require a level of education, but does all that schoolwork pay off? Load this data set into Stata: use http://www.stat.ucla.edu/projects/datasets/twins This is data from a study of twins. You can learn details about the data, including how and why they were collected, at http://www.stat.ucla.edu/projects/datasets/twins-explanation.html. Two variables of interest are hrwageh and hrwagel. These are the hourly wage of twin "1" and "2". (The twins were arbitrarily numbered.) You might want to focus your investigation on the difference in their hourly wage. To create this variable, type gen diffwage = hrwageh - hrwagel Two other interesting variables are educh, the self-reported education level (in years) of the twin who reported earning hrwageh, and educl: the self-reported education level of the twin who reported earning hrwagel. For explanations of the other variables, see http://www.stat.ucla.edu/projects/datasets/twins-explanation.html Are education and income related? Investigate this question with these data. Report on your findings. Your report should include answers to these questions: 1) Do you expect the correlation between the twins' incomes to be positive or negative? High (close to positive or negative 1) or low (close to 0)? Check. 2) Find the correlation matrix for these variables: hrwageh, hrwagel, educl, educh, diffwage, diffeduc. What's the correlation between hrwageh and hrwagel? Interpret. Why does the correlation between hrwagel and diffeduc have a different sign than the correlation between hrwageh and diffeduc? 3) What's the typical difference in hourly wage between twins? Is it what you expected? 4) Describe the distribution of the difference in hourly wage. Are there any unusual features? 5) Make a scatterplot of the difference in income against the education level of either one of the twins. Interpret. Does it matter which twin's education level you chose? 6) Perform a regression of difference in income against a twin's education level.

What does the estimated slope say about the effect of education on income? (You might want to superimpose the regression line on the graph for a clearer picture.) Does your conclusion depend on which twin's education you used to predict the difference in income? 7) Create a new variable, diffeduc, that is the difference in education levels between twins. Perform a regression and use it to answer this question: is there evidence that the twin with more education makes more money? 8) Examine the residuals from this last regression. For what types of twins did the model have the largest error (that is, the greatest difference between the predicted value and the observed value)? Do you see any possible outliers? This Lab was originally created by Prof. R. Gould A and Prof. V. Lew