LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

Similar documents
STATISTICS & PROBABILITY

Understandable Statistics

Profile Analysis. Intro and Assumptions Psy 524 Andrew Ainsworth

Business Statistics Probability

Bangor University Laboratory Exercise 1, June 2008

One-Way Independent ANOVA

Math 075 Activities and Worksheets Book 2:

Before we get started:

Chapter 1: Exploring Data

The North Carolina Health Data Explorer

Stat 13, Lab 11-12, Correlation and Regression Analysis

10. LINEAR REGRESSION AND CORRELATION

Hour 2: lm (regression), plot (scatterplots), cooks.distance and resid (diagnostics) Stat 302, Winter 2016 SFU, Week 3, Hour 1, Page 1

Dr. Kelly Bradley Final Exam Summer {2 points} Name

STATS Relationships between variables: Correlation

Pitfalls in Linear Regression Analysis

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

AP Statistics. Semester One Review Part 1 Chapters 1-5

Unit 1 Exploring and Understanding Data

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

Chapter 3: Examining Relationships

Still important ideas

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Multiple Linear Regression Analysis

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Two-Way Independent ANOVA

bivariate analysis: The statistical analysis of the relationship between two variables.

Examining differences between two sets of scores

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Chapter 1: Explaining Behavior

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

ANOVA in SPSS (Practical)

Biostatistics II

1. To review research methods and the principles of experimental design that are typically used in an experiment.

M 140 Test 1 A Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

Daniel Boduszek University of Huddersfield

EPS 625 INTERMEDIATE STATISTICS TWO-WAY ANOVA IN-CLASS EXAMPLE (FLEXIBILITY)

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

isc ove ring i Statistics sing SPSS

04/12/2014. Research Methods in Psychology. Chapter 6: Independent Groups Designs. What is your ideas? Testing

Still important ideas

Identify two variables. Classify them as explanatory or response and quantitative or explanatory.

Day 11: Measures of Association and ANOVA

Psychology of Perception PSYC Spring 2017 Laboratory 2: Perception of Loudness

investigate. educate. inform.

STATISTICS AND RESEARCH DESIGN

1.4 - Linear Regression and MS Excel

12.1 Inference for Linear Regression. Introduction

HS Exam 1 -- March 9, 2006

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

CHAPTER 3 Describing Relationships

Missy Wittenzellner Big Brother Big Sister Project

AP STATISTICS 2010 SCORING GUIDELINES (Form B)

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

9 research designs likely for PSYC 2100

MEA DISCUSSION PAPERS

Midterm Exam MMI 409 Spring 2009 Gordon Bleil

10/4/2007 MATH 171 Name: Dr. Lunsford Test Points Possible

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test February 2016

Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

CRITERIA FOR USE. A GRAPHICAL EXPLANATION OF BI-VARIATE (2 VARIABLE) REGRESSION ANALYSISSys

CHAPTER ONE CORRELATION

SPSS Correlation/Regression

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 8 One Way ANOVA and comparisons among means Introduction

TEACHING REGRESSION WITH SIMULATION. John H. Walker. Statistics Department California Polytechnic State University San Luis Obispo, CA 93407, U.S.A.

SUMMER 2011 RE-EXAM PSYF11STAT - STATISTIK

5 To Invest or not to Invest? That is the Question.

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 5 Residuals and multiple regression Introduction

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

I. Identifying the question Define Research Hypothesis and Questions

Chapter 3: Describing Relationships

YSU Students. STATS 3743 Dr. Huang-Hwa Andy Chang Term Project 2 May 2002

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

Psychology Research Process

Problem set 2: understanding ordinary least squares regressions

Chapter 4: More about Relationships between Two-Variables Review Sheet

MEASURES OF ASSOCIATION AND REGRESSION

Statistics: Making Sense of the Numbers

Undertaking statistical analysis of

Introduction to Quantitative Methods (SR8511) Project Report

Two-Way Independent Samples ANOVA with SPSS

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

STAT 201 Chapter 3. Association and Regression

Section 6: Analysing Relationships Between Variables

Reminders/Comments. Thanks for the quick feedback I ll try to put HW up on Saturday and I ll you

Chapter 11. Experimental Design: One-Way Independent Samples Design

(C) Jamalludin Ab Rahman

STAT445 Midterm Project1

Transcription:

LAB ASSIGNMENT 4 1 INFERENCES FOR NUMERICAL DATA In this lab assignment, you will analyze the data from a study to compare survival times of patients of both genders with different primary cancers. First, you will examine the study design and the variables that might affect survival times. Then you will apply graphical, numerical, and inferential tools in StatCrunch to compare the distributions of survival for each type of cancer. You will also compare the survival time for a particular cancer for men and women and examine the relationship between survival time and age for each gender. Before you start working on the assignment, you should review the course material about designing statistical studies, analysis of variance, comparing two population means, linear regression, and correlation. Comparison of Cancer Survival* In Canada, cancer is a major cause of death, second only to heart disease. One-half of cancer deaths in Canada in 2004 were due to four kinds of malignancies: lung, colorectal, female breast, and male prostate. It is widely believed that the average survival times of cancer patients might depend on the organ of origin of the cancer. In order to verify this hypothesis, some researchers conducted a study to compare the survival times of patients who had various types of terminal cancers. In the study, 100 terminal cancer patients were obtained by random selection from the alphabetical index of cancer patients in a hospital. Survival times were measured from the date the cancer was established to be untreatable. Moreover, the age and gender of each patient was also recorded. In this assignment, we will use a subset of the data for 64 patients, who had advanced cancers of the breast, bronchus, colon, ovary, or stomach, to determine if patient survival differed with respect to the organ affected by the cancer. Moreover, the relationship of survival time with age and gender will also be examined. The data were originally obtained from an online Data and Story Library (DASL). We have modified the data for the special purpose of this assignment. The data are available in the StatCrunch file lab4.txt located on the STAT 151 Laboratories web site at http://www.stat.ualberta.ca/statslabs/index.htm (click Stat 151 link). In order to download the data for the lab, click on the link Data for lab 4 and follow the instructions. The data are not to be printed in your submission. The following is the description of the variables in the data file: Column Variable Name Description of Variable 1 SURVIVAL survival time, in days (since the cancer was established to be untreatable), 2 ORGAN organ affected by the cancer (breast, bronchus, colon, ovary, or stomach), 3 AGE age at death (in years), 4 SEX gender (F-female or M-male). Answer the following questions using the data. * NOTE: With respect, this data can be considered sensitive to some individuals, but the intent of this lab is to better understand some aspects of an unfortunate medical issue in society through the use of science; in this case, statistical analysis. Please keep in this mind as you complete the lab assignment. 1

1. First, you will analyze the study design. What is the purpose of the study? What kind of inferences can be made from the study? Can we conclude that any observed differences among survival times for the five cancer groups are due to the cancer type? Can the results of the study be generalized to the populations of all patients with one of the five cancers discussed in the study in all hospitals? Except for types of cancers, gender, or age, what are other possible variables that might have affected the survival times of the cancer patients? How can you control some of the variables? Provide brief explanations. Notice that the survival time is measured since the cancer was established to be untreatable. The cancer can be determined to be untreatable when a substantial part of the organ has been attacked by the malignancy, the cancer has begun to spread outside the organ, the level of aggressiveness of the cancer is high, etc. Explain how developing consistent criteria for untreatability across the five types of cancer is important for the outcome of the study. 2. Now you will display the relationship between the type of cancer and survival with boxplots. Obtain the side-by-side boxplots of survival (ignore gender) for the five types of cancer. Check the Use fences to identify outliers option. Paste the side-by-side boxplots into your report. Make sure that the boxplot has properly labelled axes and a title. Do the plots indicate any substantial differences in the survival time for the five types of cancer? Answer the question by comparing the centers and spreads of the five distributions. What is the most likely shape of each distribution? Are there any outliers? Use the Summary Stats feature in the Stat menu to obtain the summary statistics of survival time for each of the five types of cancer and paste the summaries into your report. Comment about the differences in the means and standard deviations of survival time for the five types of cancer. Which cancer type has the shortest median survival time and which the longest? 3. In this part, you will compare the survival time of patients diagnosed with colon and stomach cancer. Is there evidence of a difference in the average survival time for colon and stomach cancer patients? Using α = 0.05, carry out the appropriate test to answer the question. Do not assume equal variances. Paste the output into your report. State the null and alternative hypotheses. (Do not just copy from the output.) Report the value of the appropriate test statistic, the distribution of the test statistic under the null hypothesis, and the P-value of the test. State your conclusion. Obtain a 95% confidence interval for the difference in the average survival time for colon and stomach cancer patients. Does the interval confirm the outcome of the test in part? Explain briefly. 4. In this question, you will compare the survival by gender for colon cancer. Obtain the side-by-side boxplots of survival by gender for colon cancer. Check the Use fences to identify outliers option. Paste the side-by-side boxplot into your report. Do the plots in part indicate any substantial differences in the survival time of colon cancer between men and women? Comment about the centers and spreads of the distributions for each gender. Are there any outliers? 2

Is there evidence of a difference in the average survival time of men and women with colon cancer? Using α = 0.05, carry out the appropriate test to answer the question. Do not assume equal variances. Paste the output into your report. State the null and alternative hypotheses. (Do not just copy from the output.) Report the value of the appropriate test statistic, the distribution of the test statistic under the null hypothesis, and the P-value of the test. State your conclusion. (d) Obtain the corresponding 95% confidence interval for the difference in average survival time of men and women with colon cancer. Do not assume equal variances. Paste the appropriate output into your report. Does the interval confirm the outcome of the test in part? Explain briefly. 5. Is there a relationship between survival time and age? You will display the relationship between the two variables with scatterplots and quantify the strength of the relationship with correlation. Is there a linear relationship between survival time and age for each gender? Obtain a scatterplot of survival time versus age for both genders (one graph). Paste the plot into your report (no need to use a colour printer). You may obtain separate plots for each gender to get a better insight into the relationship for each but do not paste these plots into your report. Answer the above question. Obtain the value of the correlation between survival time and age for each gender. Do they confirm the scatterplot in part? Does one relationship appear substantially stronger than the other? Explain briefly. 6. In this part, you will display and quantify the relationship between survival time for colon cancer (response variable) and age (explanatory variable) for each gender. In particular: Obtain a scatterplot of survival time for colon cancer versus age for both genders (one graph). Paste the plot into your report (no need to use a colour printer). Does simple linear regression appear appropriate for each gender? Are there any obvious outliers? What is the equation of the least-squares regression line for each gender? What is the meaning of the slope of the regression line for each gender? Use the regression line in part to estimate the survival time in days for a 70-year-old female diagnosed with colon cancer. 7. Is there any evidence that patients affected by certain types of cancer survived longer than patients affected by other types of cancer, on average? Answer the question by running the one-way ANOVA test in StatCrunch. First, you will check the assumptions necessary for the test. The test requires that groups are independent of each other, have similar variances, and each of them follows approximately a normal distribution (the last assumption can be replaced by lack of skewness or outliers). Refer to your analysis in part of Question 2. Does it look that the assumption of equal variances may be violated? One possible way to reduce right skewness in the data and make the differences in spreads smaller is to apply the logarithm transformation. The logarithm transformation will compress the upper tails of the distributions while stretching out the lower tails. Use the Transform data command in the Data menu to obtain a new variable, ln(survival), defined as the natural logarithm of the SURVIVAL variable. Obtain and paste the descriptive statistics for each group for the logtransformed data into your report. Compare the averages and the standard deviations of the logtransformed values. 3

Obtain and paste the side-by-side boxplots of the log-transformed survival time for the five types of cancer. Comment about the shape (symmetric, skewed) of each distribution. Was the log transformation successful in removing most of the skewness present in the data on the original scale of measurement? Are there any outliers in the log-transformed data? Is the normality assumption reasonable for the log-transformed data? Explain. (d) Run the one-way ANOVA test to determine if there are any differences in log survival times between different types of cancer, on average. Paste the ANOVA output into your report. Define the null the alternative hypotheses in terms of the population parameters of interest that correspond to the above question. What is the pooled estimate of the variance? What is the value of the test statistic, the distribution of the test statistic under the null hypothesis, and the P-value of the test? State your conclusions. (e) By hand (or equation editor), demonstrate how to obtain the value of the test statistic, given the ANOVA output in part (d). LAB ASSIGNMENT 4 MARKING SCHEMA Proper Header: 10 points Question 1 (12) Purpose of the study: 2 points Causal inferences: 2 points Population inferences: 2 points Confounding variables: 2 points How to control the confounding variables: 2 points Developing consistent criteria for untreatability: 2 points Question 2 (19) Side-by-side boxplots: 4 points Center, spread, and shape of each distribution: 6 points (2 points each) Summary statistics output: 2 points Differences in means and standard deviations: 4 points (2 points each) The shortest median survival time: 1 point The longest median survival time: 1 point Question 3 (17) Two-sample t-test output: 3 points 95% confidence interval for the difference: 2 points Consistency with the outcome of the test in part : 2 points 4

Question 4 (25) (d) Side-by-side boxplot of survival time for colon cancer for each gender: 4 points Medians and spreads of survival time for each gender: 2 points Outliers: 2 points Two-sample t-test output: 3 points 95% confidence interval for the difference: 2 points Consistency with the outcome of the test in part : 2 points Question 5 (11) Scatterplot: 3 points Linear relationship: 2 points Correlation between survival time and age for each gender: 4 points (2 points each) Comparison of strengths for each gender: 2 points Question 6 (16) Scatterplot: 3 points Appropriateness for each gender: 2 points Equation of the least-squares regression line for each gender: 4 points (2 points each) Meaning of the slope for each gender: 4 points (2 points each) Estimated survival time for 70-year-old female with colon cancer: 2 points Question 7 (36) (d) (e) Appropriateness of using ratio for equal variance assumption: 2 points Descriptive statistics for the log-transformed data: 4 points Comparison of means and standard deviations of the log-transformed values: 2 points Side-by-side boxplots for the log-transformed data: 4 points Comment about skewness: 2 points Comment about the normality assumption: 2 points ANOVA output: 3 points Pooled estimate of the variance: 2 points Calculation of test statistic: 4 points TOTAL = 146 5