Kidane Tesfu Habtemariam, MASTAT, Principle of Stat Data Analysis Project work

Size: px

Start display at page:

Download "Kidane Tesfu Habtemariam, MASTAT, Principle of Stat Data Analysis Project work"

Margaret Pierce
5 years ago
Views:

1 1

2 1. INTRODUCTION Food label tells the extent of calories contained in the food package. The number tells you the amount of energy in the food. People pay attention to calories because if you eat more calories than your body uses, you might gain weight. This project paper presents a statistical analysis report of a research problem concerned with the accuracy of labeling of diet and health taken from the work published by David B.- Allison; September, 1993 in the Journal of the American Medical Association (JAMA). Research setting; Foods were sampled from retail merchants throughout the borough of Manhattan, New York, NY. The researcher sampled 40 different food items across regionally distributed, nationally advertised and locally prepared. They measured the caloric content of each food item via bomb calorimeter and converted the readings into an estimate of total metabolically energy. In addition, they calculated the percentage difference between the measured calories and the labeled calories for each item and per gram. 1.1 Objectives Determine accuracy of caloric labeling of diet and health foods. Assess whether the accuracy differs for certain categories of food suppliers. Evaluate if there is evidence of overall underreporting/over reporting of calories per gram on food labels. Assess if the degree of underreporting/over reporting of calories per gram differ regional versus national. Evaluate if there is variability of under reporting/over reporting of calories per gram regional versus national. Analyze the degree of underreporting/over reporting of calories per item differ across food suppliers. Examine whether there is any relationship between the relative frequency of underreporting of calories per gram and the type of food supplier.

3 1. Data and Methodology Data: A sample of 40 food items including regionally distributed (n=1), nationally advertised (n=0), and locally prepared items (n=8). The data contains 8 missing values for locally prepared food in food label of per gram. Additionally, classification represents the three different food suppliers denoted as (R, N &L). The measurement of calories was on food per item and food per gram. For each food type a percentage difference between measured calorie minus labeled calorie ( +ve, underreporting) and percentage difference between measured calorie minus labeled calories(-ve, over reporting) obtained. Methodology Percentage difference of caloric labeling with positive value indicates underreporting whereas negative implies over reporting. In the first analysis a descriptive plots such as box plot, QQplots, histograms and bar plot were used to demonstrate the nature and pattern of the dataset. The histogram for per item overall indicates extremely right skewed and for per gram moderately right skewed. The QQ plot for per item is far from normality and the remedial measure taken was to transform into log scale where as plots of per gram behaves some how a normal with some extreme values at the upper tail of QQ plot. The QQ plot (per gram) is very sensitive for outlier and a solution was to remove outliers. The data contains 8 missing values in the locally prepared food for food label per gram; this is whole set of missing data for a particular variable. Possible solution applied was omitting them, reasoning can be they are whole set of data no way to replace them. After having confirmed that per gram measurements are approximately normally distributed with Shapiro test, Two sample T test was applied to evaluate if there is overall over reporting or under reporting exists. Further more, two sample T test was performed with in food labeling on per gram across the regionally advertised and nationally advertised foods to check if there is over reporting or underreporting. To examine if there is variability between region and nation on food labeling of food per gram, an F test, a test of variance was performed. Concurrently, in the caloric labeling per item group, caloric labeling was compared between the 3 food suppliers by one-way analysis of variance, after having transformed them into log 3

4 scale and removal of influential outliers (), then after inspected that food labeling measurements are approximately normal distributed with the same variance in each of the 3 study areas after necessary transformation taken place using Bartlett test. Post-hoc analysis was based on Tukey s Honest Significant Difference method. Food labeling effect estimates were obtained as explained in Section., and are reported together with Bonferroni-adjusted p-values and 95% confidence intervals. In examining the relationship between relative frequency of underreporting or over reporting of per gram across classification Fisher exact and Pearson Chi square was applied. In all analyses, examination of normality was based on QQ-plots and Shapiro test and of homoscedasticity (i.e. constancy of variance) on the F test and Bartlett test. P-values below 0.05 (or 5%) will be termed statistically significant. All analyses were conducted in R Version.6, using the package faraway, stats and car. Results Section Part I: Els Goetghebeur.1 Descriptive Statistics The mean percentage over label per item and per gram overall is 4% and 5% respectively. Mean percentage over label per item in regionally distributed foods were 5% (SD = 16%). Nationally advertised foods on per item their mean percentage over label were 0.13% (SD = 11%). Where as locally prepared foods on per item their mean percentage over label (mean difference) were 8% (SD = 84%). Regionally distributed foods per gram had mean % over label of 15% (SD = 19%). While nationally advertised foods per gram had mean percentage over label of -0.95% (SD = 8%) In locally prepared foods per gram data is missing (values not reported). More is explained in Table.1 Table.1 Statistical indicators % difference over food Labeling per gram Classification % difference over food Labeling per item Classification L*= locally prepared N*= advertised R*= distributed Nationally Regionally Classification L N R L N R L = 8, N = 0, R = 1 Mean % over label NA s -0.95% 14.67% 81.75% 0.15% 5.1% Std.Deviation NA s 8.10% 18.7% 83.97% 10.5% 16.07% Median NA s -1.0% 1.50% 70%.5% 6.50% Total N = 40 NA s = Missing 4

5 Fig Check for Normality of per item via plots The left panel plots (fig.1) of different type are indicating for untransformed data of per item. In the box plot section there appears to be with many extreme outliers and the mean and median are different (table.1). A plot of histogram was depicted to demonstrate the nature of skew ness and it can be seen in the plot that the data are right skewed. A final plot of Normal QQ-plot was done to assess the normality pattern and linearity of the graph and it appears that the data set of per item is not normally distributed. Alternative remedy will be transformation of the per item dataset in to log scale (solution for right skewed data). 5

6 .1. Check for Normality of per gram via plots The right panel plots (fig.1) of different type are indicating for untransformed data of per gram. Looking at the boxplot the mean and median are almost the same but the data contains some few outliers. An extension for checking skewenes was performed using histogram and final plot for inspection of normality is QQ plot and the data behaves some how normal even though at the upper tail the QQ plot deviates and this a signal for existence of outlier and possible remedy is removal of outlier.. Normality of the data Fig. 6

7 As the data for per item are right skewed transforming them into log scale gave the solution to meet the normality. In the case of data per gram an alternative measure for normality was only to cut the outliers and these outliers are very few and we can tolerate the absence of these values. The removed outliers are values (greater of 5) which are only 3 values. This way brings the dataset in to normality. Section Part II (Stefan Van Aelst) 3.1 Evaluation of overall underreporting/over reporting over labeling per gram In order to evaluate if there is an evidence of overall underreporting/over reporting of calories per gram a statistical test which is Independent T test was performed. To apply the T test assumptions has to be fulfilled and these assumptions are normality and independent observations. As it is briefly explained in the methodology section since the dataset per gram consists extreme values (outliers>5) one way to handle this problem is to cut the outlier and fit the normality test. The Shapiro test for normality have confirmed that the dataset for per gram is approximately normal with (P = 0.78) and the T test can be applied. Furthermore, the missing values were omitted during the analysis. Fig 3.1( Plot of per gram ) Fig 3.1 depicts the caloric labeling of per gram after removal of outliers and this ensures the normality of the data. 7

8 The null hypothesis for the T test is H : 0 0 gram Versus H : 0 1 gram Let gram represents the average percentage difference of caloric measurement over labeling. = gram measuredcalory - Labelledcalory > 0 ========= underreporting ( t 0.68) with P-val(P = 0.5) and 95% of CI is (-.56, 5.11) The out put from R is 8,0.05 The p value is larger than 0.05 and 95% of confidence interval includes zero. Thus; there is insufficient information or evidence to reject the null hypothesis (Ho: represents the mean percentage difference between measured calorie and labeled calorie is the same). With 95% confidence the true mean % difference of food labeling per gram lies some where between -.56 and Section Part III B.1 & B. (Stijn Vansteelandt) 3. Evaluation of the degree of underreporting/over reporting of calorie per gram To determine the degree of underreporting/over reporting of calories per gram across regionally distributed foods and nationally advertised foods a statistical test has been performed. The test statistics is independent T test. To guarantee the use of this test the assumptions of normality has to be fulfilled. The Shapiro test for normality have confirmed the normality of the data per gram with regionally distributed foods (P = 0.36) and with nationally distributed foods (P =0.99). In both cases the assumption is met and we can continue to apply to Independent T test. Fig 3. 8

9 The null hypothesis for the T test across nationally advertised foods H : 0 0 NA Versus H : 0 1 NA Let NA represents the average percentage difference of caloric measurement over labeling for nationally advertised foods. = - Labelledcalorie > 0 ========= underreporting NA measuredcalorie ( t 0.5) with (P = 0.61) and 95% of CI is (-4.74,.84) The out put from R is 19,0.05 The p value is larger than 0.05 and 95% of confidence interval includes zero. Thus; there is insufficient information or evidence to reject the null hypothesis (Ho: represents the mean percentage difference between measured calorie and labeled calorie is the same for nationally advertised food labels). With 95% confidence the true mean % difference of food labeling per gram of nationally advertised food lies some where between 4.74 and.84. The null hypothesis for the T test across regionally distributed foods H : 0 Versus H : 0 1 Let 0 RG RG RG represents the average percentage difference of caloric measurement over labeling for regionally distributed foods. = measuredcalorie - Labelledcalorie > 0 ========= underreporting RG ( t.71) with (P = 0.0) and 95% of CI is (.77, 6.56) The out put from R is 11,0.05 The p value is less than 0.05 and 95% of confidence interval excludes zero. Thus; the null hypothesis is rejected at 5% level of significance and conclude that the mean percentage difference of calorie measure per gram for regionally distributed food is not zero. With 95% confidence the true mean % difference of food labeling per gram of regionally distributed food lies some where between.77 and Evaluation of the variability in overall of underreporting/over reporting of calorie per gram on food labels. Fig 3.(box plot) indicates that the median for the food suppliers varies and to examine the variability an F test is performed. F test demonstrates the variability in variance and tests if variance is constant or not with in the two independent samples. 9

10 Let s formulate our null hypothesis as follows. 1 (Variance of nationally advertised foods per gram) and (variance of regionally distributed foods per gram) H : Versus 0 1 H : 1 1 F ration test = > F = S / S ===== ratio of variances 1 ( F 0.19) with (P = 0.001) and 95% of CI is (0.06, 0.5). The out put from R is 19,11 The P value is highly significant and hence we reject the null hypothesis and conclude that the variance is not constant across nationally advertised foods per gram and regionally distributed foods per gram. Section Part III C (Stijn Vansteelandt) Figure 3.3 Figure 3.3 suggests that the average percentage difference of caloric labeling of food per item is higher in the locally prepared food than in the regional and national food suppliers. The m 10

11 This is confirmed by the one-way analysis of variance test, which reveals a significant difference in average percentage difference of caloric labeling between the food suppliers. The box plot depicted in fig 3.3 (green colors) indicates the untransformed value of caloric labeling per item deviates extremely from normality. One-way analysis of variance test prerequisite is homogeneity of variance and the Bartlett s test turns to be highly significant before transformation (P = 3.e-07). To meet the requirement for normality transformation of the per item variable was performed by changing into logscale. Even with this method again the Bartlett s test brought significant difference in variance (P = 0.034). There seems to have some influential outliers in the dataset per item for locally distributed foods, and one way to handle this problem is to cut off the extreme values which are observations (not danger). After all the Bartlett s test have confirmed for constant variance and appropriateness of the test with (P =0.1). The results for the F value are in logscale which are (F, 5 =5.91) and P-value (P=0.0078) which are highly significant. The null hypothesis for one-way analysis of variance H : 0 L R N Versus H A : at least 1 of the population means differs. F betweenmse Ho F withinmse k 1, n k Post hoc analysis shows that the average % difference in caloric labeling measurement in per item in log scale equals -1.8 (95% confidence interval (-.49,-0.07, P =0.036) between food suppliers of national and local,-0.01(95% confidence interval (-1., 1.19, P = 0.99) between food suppliers of regional and local, 1.7(95% confidence interval (0.5,.8, P = 0.01) between food suppliers of regional and national. All the above values are reported in log scale. 11

12 Section Part V (Yves Rossel) In evaluating for a possible relationship that may exist between the (relative) frequency of the underreporting of calories per gram and the type of food supplier, a Fisher s exact test for count data shows that nationally advertised food supplier didn t have significant relationship (P= 0.06) with regionally advertised food suppliers with respect to the frequency of the underreporting of calories per gram. The test is based on small sample method. The odd ration is and the 95% CI is (0.015, 1.13). A similar result of non-significance was obtained using the Pearson s Chi square test (P = 0.056). In sum there is no significant relationship between the relative frequency of underreporting of the calories per gram and the type of food supplier. In examining the degree of association between results were also obtained using Sakoda s method with 0.46 which can be associated as weak relationship. Another technique applied was fitting 3x tables (see appendix and R code) in which the missing values where fitted as structural zeros for locally prepared foods. The log linear model from the Poisson distribution fitted gave a deviance (G = 4.89) with 1 degree of freedom. Conclusion These findings suggest that food labels may be inadequate sources for caloric monitoring. Health care professionals should consider the accuracy of caloric labeling when advising patients to use food labels to help monitor their caloric intake. All locally prepared food labels per item had reported significant difference of underreporting/over reporting of caloric measurement. All regionally distributed food labels per item had reported significant difference of underreporting/over reporting of caloric measurement. All nationally advertised food labels per item had no significance difference of under reporting/over reporting of caloric measurement. The overall underreporting/over reporting of calories per item differs significantly regional versus national and national versus local where as comparison between regional and local is not significant. The overall underreporting/over reporting of calories per gram differs significantly between the regional and national food labels. There is no significant association between the relative frequencies of underreporting/over reporting of calories per gram with the type of food suppliers. 1

13 Appendix I ( The data set) Appendices Appendix II (Barplot of data by classification 13

Profile Analysis. Intro and Assumptions Psy 524 Andrew Ainsworth

Profile Analysis. Intro and Assumptions Psy 524 Andrew Ainsworth Profile Analysis Intro and Assumptions Psy 524 Andrew Ainsworth Profile Analysis Profile analysis is the repeated measures extension of MANOVA where a set of DVs are commensurate (on the same scale). Profile