LAB ASSIGNMENT 4 1 INFERENCES FOR NUMERICAL DATA In this lab assignment, you will analyze the data from a study to compare survival times of patients of both genders with different primary cancers. First, you will examine the study design and the variables that might affect survival times. Then you will apply graphical, numerical, and inferential tools in StatCrunch to compare the distributions of survival for each type of cancer. You will also compare the survival time for a particular cancer for men and women and examine the relationship between survival time and age for each gender. Before you start working on the assignment, you should review the course material about designing statistical studies, analysis of variance, comparing two population means, linear regression, and correlation. Comparison of Cancer Survival* In Canada, cancer is a major cause of death, second only to heart disease. One-half of cancer deaths in Canada in 2004 were due to four kinds of malignancies: lung, colorectal, female breast, and male prostate. It is widely believed that the average survival times of cancer patients might depend on the organ of origin of the cancer. In order to verify this hypothesis, some researchers conducted a study to compare the survival times of patients who had various types of terminal cancers. In the study, 100 terminal cancer patients were obtained by random selection from the alphabetical index of cancer patients in a hospital. Survival times were measured from the date the cancer was established to be untreatable. Moreover, the age and gender of each patient was also recorded. In this assignment, we will use a subset of the data for 64 patients, who had advanced cancers of the breast, bronchus, colon, ovary, or stomach, to determine if patient survival differed with respect to the organ affected by the cancer. Moreover, the relationship of survival time with age and gender will also be examined. The data were originally obtained from an online Data and Story Library (DASL). We have modified the data for the special purpose of this assignment. The data are available in the StatCrunch file lab4.txt located on the STAT 151 Laboratories web site at http://www.stat.ualberta.ca/statslabs/index.htm (click Stat 151 link). In order to download the data for the lab, click on the link Data for lab 4 and follow the instructions. The data are not to be printed in your submission. The following is the description of the variables in the data file: Column Variable Name Description of Variable 1 SURVIVAL survival time, in days (since the cancer was established to be untreatable), 2 ORGAN organ affected by the cancer (breast, bronchus, colon, ovary, or stomach), 3 AGE age at death (in years), 4 SEX gender (F-female or M-male). Answer the following questions using the data. * NOTE: With respect, this data can be considered sensitive to some individuals, but the intent of this lab is to better understand some aspects of an unfortunate medical issue in society through the use of science; in this case, statistical analysis. Please keep in this mind as you complete the lab assignment. 1
1. First, you will analyze the study design. What is the purpose of the study? What kind of inferences can be made from the study? Can we conclude that any observed differences among survival times for the five cancer groups are due to the cancer type? Can the results of the study be generalized to the populations of all patients with one of the five cancers discussed in the study in all hospitals? Except for types of cancers, gender, or age, what are other possible variables that might have affected the survival times of the cancer patients? How can you control some of the variables? Provide brief explanations. Notice that the survival time is measured since the cancer was established to be untreatable. The cancer can be determined to be untreatable when a substantial part of the organ has been attacked by the malignancy, the cancer has begun to spread outside the organ, the level of aggressiveness of the cancer is high, etc. Explain how developing consistent criteria for untreatability across the five types of cancer is important for the outcome of the study. 2. Now you will display the relationship between the type of cancer and survival with boxplots. Obtain the side-by-side boxplots of survival (ignore gender) for the five types of cancer. Check the Use fences to identify outliers option. Paste the side-by-side boxplots into your report. Make sure that the boxplot has properly labelled axes and a title. Do the plots indicate any substantial differences in the survival time for the five types of cancer? Answer the question by comparing the centers and spreads of the five distributions. What is the most likely shape of each distribution? Are there any outliers? Use the Summary Stats feature in the Stat menu to obtain the summary statistics of survival time for each of the five types of cancer and paste the summaries into your report. Comment about the differences in the means and standard deviations of survival time for the five types of cancer. Which cancer type has the shortest median survival time and which the longest? 3. In this part, you will compare the survival time of patients diagnosed with colon and stomach cancer. Is there evidence of a difference in the average survival time for colon and stomach cancer patients? Using α = 0.05, carry out the appropriate test to answer the question. Do not assume equal variances. Paste the output into your report. State the null and alternative hypotheses. (Do not just copy from the output.) Report the value of the appropriate test statistic, the distribution of the test statistic under the null hypothesis, and the P-value of the test. State your conclusion. Obtain a 95% confidence interval for the difference in the average survival time for colon and stomach cancer patients. Does the interval confirm the outcome of the test in part? Explain briefly. 4. In this question, you will compare the survival by gender for colon cancer. Obtain the side-by-side boxplots of survival by gender for colon cancer. Check the Use fences to identify outliers option. Paste the side-by-side boxplot into your report. Do the plots in part indicate any substantial differences in the survival time of colon cancer between men and women? Comment about the centers and spreads of the distributions for each gender. Are there any outliers? 2
Is there evidence of a difference in the average survival time of men and women with colon cancer? Using α = 0.05, carry out the appropriate test to answer the question. Do not assume equal variances. Paste the output into your report. State the null and alternative hypotheses. (Do not just copy from the output.) Report the value of the appropriate test statistic, the distribution of the test statistic under the null hypothesis, and the P-value of the test. State your conclusion. (d) Obtain the corresponding 95% confidence interval for the difference in average survival time of men and women with colon cancer. Do not assume equal variances. Paste the appropriate output into your report. Does the interval confirm the outcome of the test in part? Explain briefly. 5. Is there a relationship between survival time and age? You will display the relationship between the two variables with scatterplots and quantify the strength of the relationship with correlation. Is there a linear relationship between survival time and age for each gender? Obtain a scatterplot of survival time versus age for both genders (one graph). Paste the plot into your report (no need to use a colour printer). You may obtain separate plots for each gender to get a better insight into the relationship for each but do not paste these plots into your report. Answer the above question. Obtain the value of the correlation between survival time and age for each gender. Do they confirm the scatterplot in part? Does one relationship appear substantially stronger than the other? Explain briefly. 6. In this part, you will display and quantify the relationship between survival time for colon cancer (response variable) and age (explanatory variable) for each gender. In particular: Obtain a scatterplot of survival time for colon cancer versus age for both genders (one graph). Paste the plot into your report (no need to use a colour printer). Does simple linear regression appear appropriate for each gender? Are there any obvious outliers? What is the equation of the least-squares regression line for each gender? What is the meaning of the slope of the regression line for each gender? Use the regression line in part to estimate the survival time in days for a 70-year-old female diagnosed with colon cancer. 7. Is there any evidence that patients affected by certain types of cancer survived longer than patients affected by other types of cancer, on average? Answer the question by running the one-way ANOVA test in StatCrunch. First, you will check the assumptions necessary for the test. The test requires that groups are independent of each other, have similar variances, and each of them follows approximately a normal distribution (the last assumption can be replaced by lack of skewness or outliers). Refer to your analysis in part of Question 2. Does it look that the assumption of equal variances may be violated? One possible way to reduce right skewness in the data and make the differences in spreads smaller is to apply the logarithm transformation. The logarithm transformation will compress the upper tails of the distributions while stretching out the lower tails. Use the Transform data command in the Data menu to obtain a new variable, ln(survival), defined as the natural logarithm of the SURVIVAL variable. Obtain and paste the descriptive statistics for each group for the logtransformed data into your report. Compare the averages and the standard deviations of the logtransformed values. 3
Obtain and paste the side-by-side boxplots of the log-transformed survival time for the five types of cancer. Comment about the shape (symmetric, skewed) of each distribution. Was the log transformation successful in removing most of the skewness present in the data on the original scale of measurement? Are there any outliers in the log-transformed data? Is the normality assumption reasonable for the log-transformed data? Explain. (d) Run the one-way ANOVA test to determine if there are any differences in log survival times between different types of cancer, on average. Paste the ANOVA output into your report. Define the null the alternative hypotheses in terms of the population parameters of interest that correspond to the above question. What is the pooled estimate of the variance? What is the value of the test statistic, the distribution of the test statistic under the null hypothesis, and the P-value of the test? State your conclusions. (e) By hand (or equation editor), demonstrate how to obtain the value of the test statistic, given the ANOVA output in part (d). LAB ASSIGNMENT 4 MARKING SCHEMA Proper Header: 10 points Question 1 (12) Purpose of the study: 2 points Causal inferences: 2 points Population inferences: 2 points Confounding variables: 2 points How to control the confounding variables: 2 points Developing consistent criteria for untreatability: 2 points Question 2 (19) Side-by-side boxplots: 4 points Center, spread, and shape of each distribution: 6 points (2 points each) Summary statistics output: 2 points Differences in means and standard deviations: 4 points (2 points each) The shortest median survival time: 1 point The longest median survival time: 1 point Question 3 (17) Two-sample t-test output: 3 points 95% confidence interval for the difference: 2 points Consistency with the outcome of the test in part : 2 points 4
Question 4 (25) (d) Side-by-side boxplot of survival time for colon cancer for each gender: 4 points Medians and spreads of survival time for each gender: 2 points Outliers: 2 points Two-sample t-test output: 3 points 95% confidence interval for the difference: 2 points Consistency with the outcome of the test in part : 2 points Question 5 (11) Scatterplot: 3 points Linear relationship: 2 points Correlation between survival time and age for each gender: 4 points (2 points each) Comparison of strengths for each gender: 2 points Question 6 (16) Scatterplot: 3 points Appropriateness for each gender: 2 points Equation of the least-squares regression line for each gender: 4 points (2 points each) Meaning of the slope for each gender: 4 points (2 points each) Estimated survival time for 70-year-old female with colon cancer: 2 points Question 7 (36) (d) (e) Appropriateness of using ratio for equal variance assumption: 2 points Descriptive statistics for the log-transformed data: 4 points Comparison of means and standard deviations of the log-transformed values: 2 points Side-by-side boxplots for the log-transformed data: 4 points Comment about skewness: 2 points Comment about the normality assumption: 2 points ANOVA output: 3 points Pooled estimate of the variance: 2 points Calculation of test statistic: 4 points TOTAL = 146 5