Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups. Activity 1 Examining Data From Class Background Download http://www.stat.ucla.edu/~rgould/datasets/m12s00.dta This is a Stata object, and so it should load automatically. This data set includes data collected from Prof. Gould's Stats m12 course in Spring 2000. Students were asked seven questions: 1) Gender (m/f) 2) Height (inches) 3) Weight (pounds) 4) Do you smoke? (yes == 1, no == 0) 5) Who do you want for President? Bush, Gore, other. 6) Rate your math ability: (1,2,3,4,5) 1 is much below average, 3 is average, 5 is much above average 7) Rate your math anxiety; 1 is much below average, 3 is average This provides a nice data set for you to experiment on and learn some Stata techniques. It also provokes (maybe) some questions: Does Gore/Bush have stronger support among men than women? What is the relation between height and weight, and is this relation different from men than women? Do people who smoke weigh less than those who do not? Are smokers less anxious about math? Of course the experimental design (which was very haphazard) does not really lend itself to answering these questions with much confidence. But these questions should help motivate you to look at this data to learn how to use Stata to answer such questions when you have a better data set. (What would a better data set be? How would you collect one?) Commands graph x, histogram bin(n) sort by(varname) regress y x quietly regress y x predict varname tabulate x y
corr x y Histograms How does weight for men compare with weight for women? Yes, we all know men tend to weigh more, on average, than women. But what about the distribution of weights? Let's see how at least this class compares. 1. Before looking, what do you think a histogram of weight (men and women combined) would look like? 2. Make a histogram of weight. How wide are the bins? Note that it has 5 bins. Stata always defaults to 5 bins. To change to, say, 10 bins, type graph weight, histogram bin(10) 3. Change the histogram for weight so that it has 15 bins. How wide are the bins now? 4. Change the histogram so that it has 2 bins. Notice that there is a trade-off. Fewer bins means less detail. But if you increase the number of bins, you might get too much detail. Change the histogram so that it has 50 bins. Note that it now looks very craggy, and it is hard to see the general shape. 5. We can compare histograms of weights based on gender through the following commands: sort gender graph weight, by(gender) total The first command orders the variables so that the m's and f's are together. The "by(gender)" option tells Stata to make a separate histogram for each value of the variable gender. By including the word "total" we also get a histogram with men and women combined. Note: you could also have typed sort gender graph weight, histogram by(gender) total If there is only one variable given after the graph command, Stata draws a histogram. Make separate histograms for men and women. What do you observe? 6. Notice that Stata only labels the minimum and maximum values. To fix this type: graph weight, histogram by(gender) bin(10) total xlabel ylabel 7. Approximately what percent of men weigh less than 150 in this class? What percent of women? What percent of everyone in the class?
Boxplots Boxplots are another method to compare two distributions. They are somewhat cruder than histograms, but are often easier to read. 1. Make a boxplot to compare heights: graph height, box by(gender) ylabel Note: You might have to type sort gender before this command. 2. How tall is a woman if 50% of the women in the class are taller than she? 3. What is the median height of the men in the class? 4. What is the height of the tallest woman? About what percentage of the men in the class are taller than this? Tables Are men more likely than women to vote for Bush? Are women more likely to vote for Gore than Bush? We can see how at least this class might vote. Note that the variables gender and president are categorical. So trying to make a histogram is futile. You can try it, but Stata won't reward you much for your efforts. 1. Instead, we'll make a table. Type tabulate president gender, cell In the first column, you'll see the number of women voting for (from bottom to top) Other, Gore, and Bush. The first row has a dot (.), which means that these are people who had no response. The dot is Stata's symbol for a missing value. In each cell, below the number, there is a percentage. This is the number of people in that cell, divided by the total number of people. So 5 females prefer Bush, and there are 69 people, and so these 5 represent 5/69 *100% = 7.25% of the sample. 2. What percent of the class are men? 3. What percent of the class prefer Bush for President? 4. What percent of women prefer Bush? To answer this, type tabulate president gender, column
Now the cell counts are given as before, but the percentages are now given separately for each column. These are called column percentages. So there are still 5 females for Bush, but now this is out of the 43 women in the class. So 5/43 represents 11.63% 5. Does this table suggest that women in this class are more likely than the men to vote for Gore? Explain. 6. If you want to row percentages, just type "row" where we typed "column". You can include both column and row to get both, or just type tabulate president gender, column row cell to get cell, column, and row percentages. Scatterplots/Regression How are heights and weights related? Can the relationship be summarized as linear? 1. Make a scatterplot of the heights and weights with height on the x-axis and weight on the y-axis: graph height weight Print the graph. 2. Describe the trend: how are height and weight related? Would you say this is (roughly) a linear relationship? 3. We can quantify the linear relationship with a least squares regression. (This works whether or not the relationship is really linear. If it is not linear, then our least squares regression will be a very poor description -- but we can still compute it.) Note that Stata gives us a lot more information than we are ready for right now. But you'll return to this later in your studies. Type regress weight height Note: the first variable is the response (or dependent) variable, the second is the predictor or explanatory variable. The format is regress y x. 4. Look in the column headed by "Coef." (Coefficient) to find the least squares intercept and slope. Write the equation of the line here: 5. To graph the line on top of the scatterplot quietly regress weight height <RETURN> predict pweight <RETURN> graph weight pweight height, s(oi) c(.l)
Here is how the commands work. The first command quietly regress weight height performs a regression that computes the slope and intercept of the regression line. The next command, predict pweight, calculates the predicted values for weight. The predicted values all fall on the regression line. The last command, graph weight pweight height, s(oi) c(.l) does the actual graphing. The command plots weight and pweight versus height. The s(oi) sets the symbols for the plot, such that, weight versus height is done with circles (the o option) and pweight versus height uses no symbol (the i for invisible option). The c(.l) option controls how the points are connected, such that, weight versus height is not connected (the. option) and pweight versus height is connected with a line (the l option) 6. Print the graph. What is the interpretation of this regression line? Is height a good predictor of weight? Explain. 7. Calculate the correlation between height and weight: There are two ways to do this. If you have a calculator, you can take the square-root of the number that appears in the regression output beside the words "R-squared = ". Or, type corr height weight Interpret this number. 8. Now fit a linear regression line with height as the response variable and weight as the explanatory variable. a. Write the equation for the regression line here. b. Is this equation different from the equation you obtained with weight as the response variable and height as the explanatory variable? Explain why or why not. c. Write the R-squared value here. d. Is the R-squared value the same or different as the R-squared value you obtained when weight was the response variable and height was the explanatory variable? Explain why or why not. Before you leave, turn in the following: 1) Your answers to 1-3, 5, 7 under Histograms 2) Your answers to 2-4 under Boxplots (you might want to include your boxplot here.) 3) Your answers to 2, 3, 5 under Tables 4) Your answers to 2, 4, 6-8 under Scatterplots/Regression
Activity 2: Old Faithful Revisited Remember the Old Faithful data form Lab 2? Re-load it into Stata. First, you must type clear and then you can load from http://www.stat.ucla.edu/~rgould/datasets/oldfaith.dta As you may recall, our goal is to give a recently arrived busload of tourists as accurate an estimate of when the geyser will next erupt as we can. A problem with this is that there is quite a bit of spread as far as the times between eruptions, so it is difficult to predict with much precision. However, there is a theory that says that the time between eruptions is related to the length of the previous eruption. If the previous eruption was very long, then it might take longer to replenish the supply of hot water, to put it as non-technically as possible. Is there evidence of this? Assignment Write a report to the Rangers Station at Old Faithful. The Rangers want to predict the time until next eruption as accurately as possible. Your report should contain: a) A description of the relationship between the length of an eruption and the time until the next. b) A means for predicting the time until the next eruption if you know the length of the current eruption. c) An evaluation of how good or bad this prediction is. What to turn in: Your report to the Rangers Station.