Kidane Tesfu Habtemariam, MASTAT, Principle of Stat Data Analysis Project work

Similar documents
Profile Analysis. Intro and Assumptions Psy 524 Andrew Ainsworth

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

ANOVA in SPSS (Practical)

Before we get started:

bivariate analysis: The statistical analysis of the relationship between two variables.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Understandable Statistics

Biostatistics II

Unit 1 Exploring and Understanding Data

Title: A new statistical test for trends: establishing the properties of a test for repeated binomial observations on a set of items

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Study of cigarette sales in the United States Ge Cheng1, a,

One-Way Independent ANOVA

HS Exam 1 -- March 9, 2006

Simple Linear Regression the model, estimation and testing

Business Research Methods. Introduction to Data Analysis

CHAPTER ONE CORRELATION

appstats26.notebook April 17, 2015

Problem #1 Neurological signs and symptoms of ciguatera poisoning as the start of treatment and 2.5 hours after treatment with mannitol.

Chapter 25. Paired Samples and Blocks. Copyright 2010 Pearson Education, Inc.

STATISTICS AND RESEARCH DESIGN

LAB ASSIGNMENT 4 INFERENCES FOR NUMERICAL DATA. Comparison of Cancer Survival*

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Business Statistics Probability

Basic Biostatistics. Chapter 1. Content

Midterm Exam MMI 409 Spring 2009 Gordon Bleil

Applied Statistical Analysis EDUC 6050 Week 4

SUMMER 2011 RE-EXAM PSYF11STAT - STATISTIK

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

Reflection Questions for Math 58B

Here are the various choices. All of them are found in the Analyze menu in SPSS, under the sub-menu for Descriptive Statistics :

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Examining differences between two sets of scores

Overview of Non-Parametric Statistics

Daniel Boduszek University of Huddersfield

Still important ideas

Chapter 1: Exploring Data

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Statistical reports Regression, 2010

Poisson regression. Dae-Jin Lee Basque Center for Applied Mathematics.

Advanced ANOVA Procedures

1) What is the independent variable? What is our Dependent Variable?

Missy Wittenzellner Big Brother Big Sister Project

Collecting & Making Sense of

Demonstrating Client Improvement to Yourself and Others

Creative Commons Attribution-NonCommercial-Share Alike License

isc ove ring i Statistics sing SPSS

MODULE S1 DESCRIPTIVE STATISTICS

Health Consciousness of Siena Students

Still important ideas

11/24/2017. Do not imply a cause-and-effect relationship

Stat Wk 9: Hypothesis Tests and Analysis

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

Types of Statistics. Censored data. Files for today (June 27) Lecture and Homework INTRODUCTION TO BIOSTATISTICS. Today s Outline

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

UPPER MIDWEST MARKETING AREA ANALYSIS OF COMPONENT LEVELS AND SOMATIC CELL COUNT IN INDIVIDUAL HERD MILK AT THE FARM LEVEL 2007

YSU Students. STATS 3743 Dr. Huang-Hwa Andy Chang Term Project 2 May 2002

Math Section MW 1-2:30pm SR 117. Bekki George 206 PGH

EXECUTIVE SUMMARY DATA AND PROBLEM

Introduction & Basics

Analysis of Variance (ANOVA) Program Transcript

STAT445 Midterm Project1

South Australian Research and Development Institute. Positive lot sampling for E. coli O157

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

Section 6: Analysing Relationships Between Variables

Intro to SPSS. Using SPSS through WebFAS

Data, frequencies, and distributions. Martin Bland. Types of data. Types of data. Clinical Biostatistics

To open a CMA file > Download and Save file Start CMA Open file from within CMA

COAL COMBUSTION RESIDUALS RULE STATISTICAL METHODS CERTIFICATION SOUTHERN ILLINOIS POWER COOPERATIVE (SIPC)

EPS 625 INTERMEDIATE STATISTICS TWO-WAY ANOVA IN-CLASS EXAMPLE (FLEXIBILITY)

Tutorial 3: MANOVA. Pekka Malo 30E00500 Quantitative Empirical Research Spring 2016

Appendix III Individual-level analysis

Statistics Guide. Prepared by: Amanda J. Rockinson- Szapkiw, Ed.D.

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

Simple Linear Regression

NORTH SOUTH UNIVERSITY TUTORIAL 1

PRINTABLE VERSION. Quiz 1. True or False: The amount of rainfall in your state last month is an example of continuous data.

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

APPENDIX N. Summary Statistics: The "Big 5" Statistical Tools for School Counselors

MATH 1040 Skittles Data Project

Midterm STAT-UB.0003 Regression and Forecasting Models. I will not lie, cheat or steal to gain an academic advantage, or tolerate those who do.

STAT 503X Case Study 1: Restaurant Tipping

One way Analysis of Variance (ANOVA)

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

POST GRADUATE DIPLOMA IN BIOETHICS (PGDBE) Term-End Examination June, 2016 MHS-014 : RESEARCH METHODOLOGY

9 research designs likely for PSYC 2100

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Chapter 11. Experimental Design: One-Way Independent Samples Design

Introduction to Quantitative Methods (SR8511) Project Report

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

Collecting & Making Sense of

REVIEW ARTICLE. A Review of Inferential Statistical Methods Commonly Used in Medicine

Biostatistics for Med Students. Lecture 1

Introduction to SPSS. Katie Handwerger Why n How February 19, 2009

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Transcription:

1

1. INTRODUCTION Food label tells the extent of calories contained in the food package. The number tells you the amount of energy in the food. People pay attention to calories because if you eat more calories than your body uses, you might gain weight. This project paper presents a statistical analysis report of a research problem concerned with the accuracy of labeling of diet and health taken from the work published by David B.- Allison; September, 1993 in the Journal of the American Medical Association (JAMA). Research setting; Foods were sampled from retail merchants throughout the borough of Manhattan, New York, NY. The researcher sampled 40 different food items across regionally distributed, nationally advertised and locally prepared. They measured the caloric content of each food item via bomb calorimeter and converted the readings into an estimate of total metabolically energy. In addition, they calculated the percentage difference between the measured calories and the labeled calories for each item and per gram. 1.1 Objectives Determine accuracy of caloric labeling of diet and health foods. Assess whether the accuracy differs for certain categories of food suppliers. Evaluate if there is evidence of overall underreporting/over reporting of calories per gram on food labels. Assess if the degree of underreporting/over reporting of calories per gram differ regional versus national. Evaluate if there is variability of under reporting/over reporting of calories per gram regional versus national. Analyze the degree of underreporting/over reporting of calories per item differ across food suppliers. Examine whether there is any relationship between the relative frequency of underreporting of calories per gram and the type of food supplier.

1. Data and Methodology Data: A sample of 40 food items including regionally distributed (n=1), nationally advertised (n=0), and locally prepared items (n=8). The data contains 8 missing values for locally prepared food in food label of per gram. Additionally, classification represents the three different food suppliers denoted as (R, N &L). The measurement of calories was on food per item and food per gram. For each food type a percentage difference between measured calorie minus labeled calorie ( +ve, underreporting) and percentage difference between measured calorie minus labeled calories(-ve, over reporting) obtained. Methodology Percentage difference of caloric labeling with positive value indicates underreporting whereas negative implies over reporting. In the first analysis a descriptive plots such as box plot, QQplots, histograms and bar plot were used to demonstrate the nature and pattern of the dataset. The histogram for per item overall indicates extremely right skewed and for per gram moderately right skewed. The QQ plot for per item is far from normality and the remedial measure taken was to transform into log scale where as plots of per gram behaves some how a normal with some extreme values at the upper tail of QQ plot. The QQ plot (per gram) is very sensitive for outlier and a solution was to remove outliers. The data contains 8 missing values in the locally prepared food for food label per gram; this is whole set of missing data for a particular variable. Possible solution applied was omitting them, reasoning can be they are whole set of data no way to replace them. After having confirmed that per gram measurements are approximately normally distributed with Shapiro test, Two sample T test was applied to evaluate if there is overall over reporting or under reporting exists. Further more, two sample T test was performed with in food labeling on per gram across the regionally advertised and nationally advertised foods to check if there is over reporting or underreporting. To examine if there is variability between region and nation on food labeling of food per gram, an F test, a test of variance was performed. Concurrently, in the caloric labeling per item group, caloric labeling was compared between the 3 food suppliers by one-way analysis of variance, after having transformed them into log 3

scale and removal of influential outliers (), then after inspected that food labeling measurements are approximately normal distributed with the same variance in each of the 3 study areas after necessary transformation taken place using Bartlett test. Post-hoc analysis was based on Tukey s Honest Significant Difference method. Food labeling effect estimates were obtained as explained in Section., and are reported together with Bonferroni-adjusted p-values and 95% confidence intervals. In examining the relationship between relative frequency of underreporting or over reporting of per gram across classification Fisher exact and Pearson Chi square was applied. In all analyses, examination of normality was based on QQ-plots and Shapiro test and of homoscedasticity (i.e. constancy of variance) on the F test and Bartlett test. P-values below 0.05 (or 5%) will be termed statistically significant. All analyses were conducted in R Version.6, using the package faraway, stats and car. Results Section Part I: Els Goetghebeur.1 Descriptive Statistics The mean percentage over label per item and per gram overall is 4% and 5% respectively. Mean percentage over label per item in regionally distributed foods were 5% (SD = 16%). Nationally advertised foods on per item their mean percentage over label were 0.13% (SD = 11%). Where as locally prepared foods on per item their mean percentage over label (mean difference) were 8% (SD = 84%). Regionally distributed foods per gram had mean % over label of 15% (SD = 19%). While nationally advertised foods per gram had mean percentage over label of -0.95% (SD = 8%) In locally prepared foods per gram data is missing (values not reported). More is explained in Table.1 Table.1 Statistical indicators % difference over food Labeling per gram Classification % difference over food Labeling per item Classification L*= locally prepared N*= advertised R*= distributed Nationally Regionally Classification L N R L N R L = 8, N = 0, R = 1 Mean % over label NA s -0.95% 14.67% 81.75% 0.15% 5.1% Std.Deviation NA s 8.10% 18.7% 83.97% 10.5% 16.07% Median NA s -1.0% 1.50% 70%.5% 6.50% Total N = 40 NA s = Missing 4

Fig.1.1.1 Check for Normality of per item via plots The left panel plots (fig.1) of different type are indicating for untransformed data of per item. In the box plot section there appears to be with many extreme outliers and the mean and median are different (table.1). A plot of histogram was depicted to demonstrate the nature of skew ness and it can be seen in the plot that the data are right skewed. A final plot of Normal QQ-plot was done to assess the normality pattern and linearity of the graph and it appears that the data set of per item is not normally distributed. Alternative remedy will be transformation of the per item dataset in to log scale (solution for right skewed data). 5

.1. Check for Normality of per gram via plots The right panel plots (fig.1) of different type are indicating for untransformed data of per gram. Looking at the boxplot the mean and median are almost the same but the data contains some few outliers. An extension for checking skewenes was performed using histogram and final plot for inspection of normality is QQ plot and the data behaves some how normal even though at the upper tail the QQ plot deviates and this a signal for existence of outlier and possible remedy is removal of outlier.. Normality of the data Fig. 6

As the data for per item are right skewed transforming them into log scale gave the solution to meet the normality. In the case of data per gram an alternative measure for normality was only to cut the outliers and these outliers are very few and we can tolerate the absence of these values. The removed outliers are values (greater of 5) which are only 3 values. This way brings the dataset in to normality. Section Part II (Stefan Van Aelst) 3.1 Evaluation of overall underreporting/over reporting over labeling per gram In order to evaluate if there is an evidence of overall underreporting/over reporting of calories per gram a statistical test which is Independent T test was performed. To apply the T test assumptions has to be fulfilled and these assumptions are normality and independent observations. As it is briefly explained in the methodology section since the dataset per gram consists extreme values (outliers>5) one way to handle this problem is to cut the outlier and fit the normality test. The Shapiro test for normality have confirmed that the dataset for per gram is approximately normal with (P = 0.78) and the T test can be applied. Furthermore, the missing values were omitted during the analysis. Fig 3.1( Plot of per gram ) Fig 3.1 depicts the caloric labeling of per gram after removal of outliers and this ensures the normality of the data. 7

The null hypothesis for the T test is H : 0 0 gram Versus H : 0 1 gram Let gram represents the average percentage difference of caloric measurement over labeling. = gram measuredcalory - Labelledcalory > 0 ========= underreporting ( t 0.68) with P-val(P = 0.5) and 95% of CI is (-.56, 5.11) The out put from R is 8,0.05 The p value is larger than 0.05 and 95% of confidence interval includes zero. Thus; there is insufficient information or evidence to reject the null hypothesis (Ho: represents the mean percentage difference between measured calorie and labeled calorie is the same). With 95% confidence the true mean % difference of food labeling per gram lies some where between -.56 and 5.11. Section Part III B.1 & B. (Stijn Vansteelandt) 3. Evaluation of the degree of underreporting/over reporting of calorie per gram To determine the degree of underreporting/over reporting of calories per gram across regionally distributed foods and nationally advertised foods a statistical test has been performed. The test statistics is independent T test. To guarantee the use of this test the assumptions of normality has to be fulfilled. The Shapiro test for normality have confirmed the normality of the data per gram with regionally distributed foods (P = 0.36) and with nationally distributed foods (P =0.99). In both cases the assumption is met and we can continue to apply to Independent T test. Fig 3. 8

The null hypothesis for the T test across nationally advertised foods H : 0 0 NA Versus H : 0 1 NA Let NA represents the average percentage difference of caloric measurement over labeling for nationally advertised foods. = - Labelledcalorie > 0 ========= underreporting NA measuredcalorie ( t 0.5) with (P = 0.61) and 95% of CI is (-4.74,.84) The out put from R is 19,0.05 The p value is larger than 0.05 and 95% of confidence interval includes zero. Thus; there is insufficient information or evidence to reject the null hypothesis (Ho: represents the mean percentage difference between measured calorie and labeled calorie is the same for nationally advertised food labels). With 95% confidence the true mean % difference of food labeling per gram of nationally advertised food lies some where between 4.74 and.84. The null hypothesis for the T test across regionally distributed foods H : 0 Versus H : 0 1 Let 0 RG RG RG represents the average percentage difference of caloric measurement over labeling for regionally distributed foods. = measuredcalorie - Labelledcalorie > 0 ========= underreporting RG ( t.71) with (P = 0.0) and 95% of CI is (.77, 6.56) The out put from R is 11,0.05 The p value is less than 0.05 and 95% of confidence interval excludes zero. Thus; the null hypothesis is rejected at 5% level of significance and conclude that the mean percentage difference of calorie measure per gram for regionally distributed food is not zero. With 95% confidence the true mean % difference of food labeling per gram of regionally distributed food lies some where between.77 and 6.56. 3.3 Evaluation of the variability in overall of underreporting/over reporting of calorie per gram on food labels. Fig 3.(box plot) indicates that the median for the food suppliers varies and to examine the variability an F test is performed. F test demonstrates the variability in variance and tests if variance is constant or not with in the two independent samples. 9

Let s formulate our null hypothesis as follows. 1 (Variance of nationally advertised foods per gram) and (variance of regionally distributed foods per gram) H : Versus 0 1 H : 1 1 F ration test = > F = S / S ===== ratio of variances 1 ( F 0.19) with (P = 0.001) and 95% of CI is (0.06, 0.5). The out put from R is 19,11 The P value is highly significant and hence we reject the null hypothesis and conclude that the variance is not constant across nationally advertised foods per gram and regionally distributed foods per gram. Section Part III C (Stijn Vansteelandt) Figure 3.3 Figure 3.3 suggests that the average percentage difference of caloric labeling of food per item is higher in the locally prepared food than in the regional and national food suppliers. The m 10

This is confirmed by the one-way analysis of variance test, which reveals a significant difference in average percentage difference of caloric labeling between the food suppliers. The box plot depicted in fig 3.3 (green colors) indicates the untransformed value of caloric labeling per item deviates extremely from normality. One-way analysis of variance test prerequisite is homogeneity of variance and the Bartlett s test turns to be highly significant before transformation (P = 3.e-07). To meet the requirement for normality transformation of the per item variable was performed by changing into logscale. Even with this method again the Bartlett s test brought significant difference in variance (P = 0.034). There seems to have some influential outliers in the dataset per item for locally distributed foods, and one way to handle this problem is to cut off the extreme values which are observations (not danger). After all the Bartlett s test have confirmed for constant variance and appropriateness of the test with (P =0.1). The results for the F value are in logscale which are (F, 5 =5.91) and P-value (P=0.0078) which are highly significant. The null hypothesis for one-way analysis of variance H : 0 L R N Versus H A : at least 1 of the population means differs. F betweenmse Ho F withinmse k 1, n k Post hoc analysis shows that the average % difference in caloric labeling measurement in per item in log scale equals -1.8 (95% confidence interval (-.49,-0.07, P =0.036) between food suppliers of national and local,-0.01(95% confidence interval (-1., 1.19, P = 0.99) between food suppliers of regional and local, 1.7(95% confidence interval (0.5,.8, P = 0.01) between food suppliers of regional and national. All the above values are reported in log scale. 11

Section Part V (Yves Rossel) In evaluating for a possible relationship that may exist between the (relative) frequency of the underreporting of calories per gram and the type of food supplier, a Fisher s exact test for count data shows that nationally advertised food supplier didn t have significant relationship (P= 0.06) with regionally advertised food suppliers with respect to the frequency of the underreporting of calories per gram. The test is based on small sample method. The odd ration is 0.173 and the 95% CI is (0.015, 1.13). A similar result of non-significance was obtained using the Pearson s Chi square test (P = 0.056). In sum there is no significant relationship between the relative frequency of underreporting of the calories per gram and the type of food supplier. In examining the degree of association between results were also obtained using Sakoda s method with 0.46 which can be associated as weak relationship. Another technique applied was fitting 3x tables (see appendix and R code) in which the missing values where fitted as structural zeros for locally prepared foods. The log linear model from the Poisson distribution fitted gave a deviance (G = 4.89) with 1 degree of freedom. Conclusion These findings suggest that food labels may be inadequate sources for caloric monitoring. Health care professionals should consider the accuracy of caloric labeling when advising patients to use food labels to help monitor their caloric intake. All locally prepared food labels per item had reported significant difference of underreporting/over reporting of caloric measurement. All regionally distributed food labels per item had reported significant difference of underreporting/over reporting of caloric measurement. All nationally advertised food labels per item had no significance difference of under reporting/over reporting of caloric measurement. The overall underreporting/over reporting of calories per item differs significantly regional versus national and national versus local where as comparison between regional and local is not significant. The overall underreporting/over reporting of calories per gram differs significantly between the regional and national food labels. There is no significant association between the relative frequencies of underreporting/over reporting of calories per gram with the type of food suppliers. 1

Appendix I ( The data set) Appendices Appendix II (Barplot of data by classification 13