Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

Similar documents
Stat 13, Lab 11-12, Correlation and Regression Analysis

5 To Invest or not to Invest? That is the Question.

Analysis of Categorical Data from the Ashe Center Student Wellness Survey

3.2A Least-Squares Regression

7. Bivariate Graphing

Chapter 3: Examining Relationships

(a) 50% of the shows have a rating greater than: impossible to tell

Math 124: Module 2, Part II

Eating and Sleeping Habits of Different Countries

A response variable is a variable that. An explanatory variable is a variable that.

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

M 140 Test 1 A Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

How Faithful is the Old Faithful? The Practice of Statistics, 5 th Edition 1

Statistics: Bar Graphs and Standard Error

STAT 201 Chapter 3. Association and Regression

MEASURES OF ASSOCIATION AND REGRESSION

AP Statistics Practice Test Ch. 3 and Previous

bivariate analysis: The statistical analysis of the relationship between two variables.

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

(a) 50% of the shows have a rating greater than: impossible to tell

ANOVA. Thomas Elliott. January 29, 2013

Chapter 3 CORRELATION AND REGRESSION

Bangor University Laboratory Exercise 1, June 2008

3.2 Least- Squares Regression

CHAPTER 3 Describing Relationships

Undertaking statistical analysis of

Lecture 12: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

10/4/2007 MATH 171 Name: Dr. Lunsford Test Points Possible

Math 075 Activities and Worksheets Book 2:

Chapter 1: Exploring Data

Regression Equation. November 29, S10.3_3 Regression. Key Concept. Chapter 10 Correlation and Regression. Definitions

STAT 135 Introduction to Statistics via Modeling: Midterm II Thursday November 16th, Name:

Two-Way Independent ANOVA

Homework #3. SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Introduction to regression

Section 3.2 Least-Squares Regression

STAT 503X Case Study 1: Restaurant Tipping

What Are Your Odds? : An Interactive Web Application to Visualize Health Outcomes

Lesson 1: Distributions and Their Shapes

Section 6: Analysing Relationships Between Variables

Part 1. For each of the following questions fill-in the blanks. Each question is worth 2 points.

Scatter Plots and Association

Regression CHAPTER SIXTEEN NOTE TO INSTRUCTORS OUTLINE OF RESOURCES

STP 231 Example FINAL

Multiple Linear Regression Analysis

Unit 1 Exploring and Understanding Data

MA 151: Using Minitab to Visualize and Explore Data The Low Fat vs. Low Carb Debate

8.SP.1 Hand span and height

Practice First Midterm Exam

Chapter 7: Descriptive Statistics

STAT445 Midterm Project1

IAPT: Regression. Regression analyses

Reminders/Comments. Thanks for the quick feedback I ll try to put HW up on Saturday and I ll you

STATISTICS 201. Survey: Provide this Info. How familiar are you with these? Survey, continued IMPORTANT NOTE. Regression and ANOVA 9/29/2013

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

The Effectiveness of Captopril

Chapter 3: Describing Relationships

Unit 7 Comparisons and Relationships

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

CHAPTER TWO REGRESSION

Business Statistics Probability

end-stage renal disease

One-Way Independent ANOVA

7) Briefly explain why a large value of r 2 is desirable in a regression setting.

Regression Including the Interaction Between Quantitative Variables

1. To review research methods and the principles of experimental design that are typically used in an experiment.

Statistics and Probability

MA 250 Probability and Statistics. Nazar Khan PUCIT Lecture 7

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

THE STATSWHISPERER. Introduction to this Issue. Doing Your Data Analysis INSIDE THIS ISSUE

Week 8 Hour 1: More on polynomial fits. The AIC. Hour 2: Dummy Variables what are they? An NHL Example. Hour 3: Interactions. The stepwise method.

Statistical Reasoning in Public Health 2009 Biostatistics 612, Homework #2

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

Statistics: Interpreting Data and Making Predictions. Interpreting Data 1/50

STATISTICS 8 CHAPTERS 1 TO 6, SAMPLE MULTIPLE CHOICE QUESTIONS

SPSS Correlation/Regression

Students will understand the definition of mean, median, mode and standard deviation and be able to calculate these functions with given set of

Psychology Research Process

INTERPRET SCATTERPLOTS

Lecture 6B: more Chapter 5, Section 3 Relationships between Two Quantitative Variables; Regression

Descriptive statistics

The Pretest! Pretest! Pretest! Assignment (Example 2)

How to interpret scientific & statistical graphs

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

12.1 Inference for Linear Regression. Introduction

3. For a $5 lunch with a 55 cent ($0.55) tip, what is the value of the residual?

Statistics Coursework Free Sample. Statistics Coursework

Homework Linear Regression Problems should be worked out in your notebook

STATISTICS & PROBABILITY

CCM6+7+ Unit 12 Data Collection and Analysis

Standard Deviation and Standard Error Tutorial. This is significantly important. Get your AP Equations and Formulas sheet

Ordinary Least Squares Regression

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Examining differences between two sets of scores

CHAPTER ONE CORRELATION

Using SPSS for Correlation

STATISTICS INFORMED DECISIONS USING DATA

Example The median earnings of the 28 male students is the average of the 14th and 15th, or 3+3

Transcription:

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups. Activity 1 Examining Data From Class Background Download http://www.stat.ucla.edu/~rgould/datasets/m12s00.dta This is a Stata object, and so it should load automatically. This data set includes data collected from Prof. Gould's Stats m12 course in Spring 2000. Students were asked seven questions: 1) Gender (m/f) 2) Height (inches) 3) Weight (pounds) 4) Do you smoke? (yes == 1, no == 0) 5) Who do you want for President? Bush, Gore, other. 6) Rate your math ability: (1,2,3,4,5) 1 is much below average, 3 is average, 5 is much above average 7) Rate your math anxiety; 1 is much below average, 3 is average This provides a nice data set for you to experiment on and learn some Stata techniques. It also provokes (maybe) some questions: Does Gore/Bush have stronger support among men than women? What is the relation between height and weight, and is this relation different from men than women? Do people who smoke weigh less than those who do not? Are smokers less anxious about math? Of course the experimental design (which was very haphazard) does not really lend itself to answering these questions with much confidence. But these questions should help motivate you to look at this data to learn how to use Stata to answer such questions when you have a better data set. (What would a better data set be? How would you collect one?) Commands graph x, histogram bin(n) sort by(varname) regress y x quietly regress y x predict varname tabulate x y

corr x y Histograms How does weight for men compare with weight for women? Yes, we all know men tend to weigh more, on average, than women. But what about the distribution of weights? Let's see how at least this class compares. 1. Before looking, what do you think a histogram of weight (men and women combined) would look like? 2. Make a histogram of weight. How wide are the bins? Note that it has 5 bins. Stata always defaults to 5 bins. To change to, say, 10 bins, type graph weight, histogram bin(10) 3. Change the histogram for weight so that it has 15 bins. How wide are the bins now? 4. Change the histogram so that it has 2 bins. Notice that there is a trade-off. Fewer bins means less detail. But if you increase the number of bins, you might get too much detail. Change the histogram so that it has 50 bins. Note that it now looks very craggy, and it is hard to see the general shape. 5. We can compare histograms of weights based on gender through the following commands: sort gender graph weight, by(gender) total The first command orders the variables so that the m's and f's are together. The "by(gender)" option tells Stata to make a separate histogram for each value of the variable gender. By including the word "total" we also get a histogram with men and women combined. Note: you could also have typed sort gender graph weight, histogram by(gender) total If there is only one variable given after the graph command, Stata draws a histogram. Make separate histograms for men and women. What do you observe? 6. Notice that Stata only labels the minimum and maximum values. To fix this type: graph weight, histogram by(gender) bin(10) total xlabel ylabel 7. Approximately what percent of men weigh less than 150 in this class? What percent of women? What percent of everyone in the class?

Boxplots Boxplots are another method to compare two distributions. They are somewhat cruder than histograms, but are often easier to read. 1. Make a boxplot to compare heights: graph height, box by(gender) ylabel Note: You might have to type sort gender before this command. 2. How tall is a woman if 50% of the women in the class are taller than she? 3. What is the median height of the men in the class? 4. What is the height of the tallest woman? About what percentage of the men in the class are taller than this? Tables Are men more likely than women to vote for Bush? Are women more likely to vote for Gore than Bush? We can see how at least this class might vote. Note that the variables gender and president are categorical. So trying to make a histogram is futile. You can try it, but Stata won't reward you much for your efforts. 1. Instead, we'll make a table. Type tabulate president gender, cell In the first column, you'll see the number of women voting for (from bottom to top) Other, Gore, and Bush. The first row has a dot (.), which means that these are people who had no response. The dot is Stata's symbol for a missing value. In each cell, below the number, there is a percentage. This is the number of people in that cell, divided by the total number of people. So 5 females prefer Bush, and there are 69 people, and so these 5 represent 5/69 *100% = 7.25% of the sample. 2. What percent of the class are men? 3. What percent of the class prefer Bush for President? 4. What percent of women prefer Bush? To answer this, type tabulate president gender, column

Now the cell counts are given as before, but the percentages are now given separately for each column. These are called column percentages. So there are still 5 females for Bush, but now this is out of the 43 women in the class. So 5/43 represents 11.63% 5. Does this table suggest that women in this class are more likely than the men to vote for Gore? Explain. 6. If you want to row percentages, just type "row" where we typed "column". You can include both column and row to get both, or just type tabulate president gender, column row cell to get cell, column, and row percentages. Scatterplots/Regression How are heights and weights related? Can the relationship be summarized as linear? 1. Make a scatterplot of the heights and weights with height on the x-axis and weight on the y-axis: graph height weight Print the graph. 2. Describe the trend: how are height and weight related? Would you say this is (roughly) a linear relationship? 3. We can quantify the linear relationship with a least squares regression. (This works whether or not the relationship is really linear. If it is not linear, then our least squares regression will be a very poor description -- but we can still compute it.) Note that Stata gives us a lot more information than we are ready for right now. But you'll return to this later in your studies. Type regress weight height Note: the first variable is the response (or dependent) variable, the second is the predictor or explanatory variable. The format is regress y x. 4. Look in the column headed by "Coef." (Coefficient) to find the least squares intercept and slope. Write the equation of the line here: 5. To graph the line on top of the scatterplot quietly regress weight height <RETURN> predict pweight <RETURN> graph weight pweight height, s(oi) c(.l)

Here is how the commands work. The first command quietly regress weight height performs a regression that computes the slope and intercept of the regression line. The next command, predict pweight, calculates the predicted values for weight. The predicted values all fall on the regression line. The last command, graph weight pweight height, s(oi) c(.l) does the actual graphing. The command plots weight and pweight versus height. The s(oi) sets the symbols for the plot, such that, weight versus height is done with circles (the o option) and pweight versus height uses no symbol (the i for invisible option). The c(.l) option controls how the points are connected, such that, weight versus height is not connected (the. option) and pweight versus height is connected with a line (the l option) 6. Print the graph. What is the interpretation of this regression line? Is height a good predictor of weight? Explain. 7. Calculate the correlation between height and weight: There are two ways to do this. If you have a calculator, you can take the square-root of the number that appears in the regression output beside the words "R-squared = ". Or, type corr height weight Interpret this number. 8. Now fit a linear regression line with height as the response variable and weight as the explanatory variable. a. Write the equation for the regression line here. b. Is this equation different from the equation you obtained with weight as the response variable and height as the explanatory variable? Explain why or why not. c. Write the R-squared value here. d. Is the R-squared value the same or different as the R-squared value you obtained when weight was the response variable and height was the explanatory variable? Explain why or why not. Before you leave, turn in the following: 1) Your answers to 1-3, 5, 7 under Histograms 2) Your answers to 2-4 under Boxplots (you might want to include your boxplot here.) 3) Your answers to 2, 3, 5 under Tables 4) Your answers to 2, 4, 6-8 under Scatterplots/Regression

Activity 2: Old Faithful Revisited Remember the Old Faithful data form Lab 2? Re-load it into Stata. First, you must type clear and then you can load from http://www.stat.ucla.edu/~rgould/datasets/oldfaith.dta As you may recall, our goal is to give a recently arrived busload of tourists as accurate an estimate of when the geyser will next erupt as we can. A problem with this is that there is quite a bit of spread as far as the times between eruptions, so it is difficult to predict with much precision. However, there is a theory that says that the time between eruptions is related to the length of the previous eruption. If the previous eruption was very long, then it might take longer to replenish the supply of hot water, to put it as non-technically as possible. Is there evidence of this? Assignment Write a report to the Rangers Station at Old Faithful. The Rangers want to predict the time until next eruption as accurately as possible. Your report should contain: a) A description of the relationship between the length of an eruption and the time until the next. b) A means for predicting the time until the next eruption if you know the length of the current eruption. c) An evaluation of how good or bad this prediction is. What to turn in: Your report to the Rangers Station.