MATH 1040 Skittles Data Project

Laura Boren MATH 1040 Data Project For our project in MATH 1040 everyone in the class was asked to buy a 2.17 individual sized bag of skittles and count the number of each color of candy in the bag. The class data was compiled and we used it for a number of different exercises involving a different aspect of statistics. For the first part of the project, we determined the proportion of each color of candy and created a Pareto chart and a pie chart for the total number of each color of candies in the entire class. We compared the class data to our own personal data and noted any similarities or differences. For part 2 of the project we used the skittles data to create statistics summaries of the mean, standard deviation and 5-number summary. We made a frequency histogram of the total number of candies as well as a box plot. Individually, I also wrote a paragraph about the significance of different qualitative and quantitative methods of analysis. The last part of the project involved confidence intervals. We found 3 different confidence intervals for the population proportion, mean, and standard deviation and wrote an analysis about what each confidence interval meant.

Laura Boren, Melissa Oneal, Justin Peck, Nathan Schafer Math 1040 Class Proportions Color Count Proportion of Total Red 564 0.199 Orange 564 0.199 Green 566 0.199 Purple 559 0.197 Yellow 586 0.206 Total Number of in the class 2839 1.000 MATH 1040 Data

Laura Boren, Melissa Oneal, Justin Peck, Nathan Schafer Does the Class data represent a random sample? Yes, the class data does represent a random sample. Although each student was asked to buy their own bag of skittles and not every bag of skittles in the region had an equal chance of being selected, the distribution of skittles from the central plant/warehouse was most likely random. The skittles company most likely does not count colors as they load the bags and simply loads by weight, and assuming students did not make any biased decisions about which bag to grab off the shelf every bag produced had an equal chance of being shipped to any location in the country and being selected at random by a student in the class. What would the population be? In this study, the sample is the class data. Since not everyone in the class is currently living in the same state, the population would be all 2.17 ounce skittles bags in the United States. There are currently different manufacturing plants operating overseas, therefore the population can only reasonably be expanded to include the United States distribution circuit.

Laura Boren Red Yellow Orange Green Purple Total Math 1040 Data Color Class Total Proportion My Total Proportion 564 0.199 16 0.258 586 0.206 11 0.177 564 0.199 10 0.161 566 0.199 15 0.242 559 0.197 10 0.161 2839 62 My skittles bag differed quite a bit from the class data. My bag had significantly more red and green skittles than the class total, but like the class data had the fewest purple skittles. I had always assumed that red was the most common skittles color, but that may just be due to the vibrancy of the color red and it being noticed more. In my skittles bag it was the most common, but that was not supported by the class data. I was surprised to see yellow skittles being the most common in the class.

1. Using the total number of candies in each bag in our class sample, compute the following measures for the variable Total candies in each bag : (a) mean number of candies per bag The mean number of candies per bag is 59.1 candies. (b) standard deviation of the number of candies per bag The standard deviation per bag is 6.4 candies. (c) 5-number summary for the number of candies per bag The 5-number summary is 34-58-60-62-71. Report these summary statistics rounded to one decimal place, if needed.

Math 1040 Skittle Data 2015

Laura Boren Data Part 3 1. From these graphs we can conclude that the Frequency Histogram is skewed to the left, although our boxplot appeared rather symmetrical, likely due to not having smaller value increments on the number line. This distribution and skew is expected because the median number of candies per bag is 60 but the mean is only 59.1. One of the main causes of the negative skew is that several of the skittles bags only had 30-40 candies in them, which is almost half as much as the median number of skittles per bag. Those bags represent outliers, and pull the data towards the left. My data agrees with the data collected by the whole class because the highest frequency of candies per bag was between 60-65 candies per bag. My bag had 62 candies, which falls right in that class. 2. Categorical variables are also known as qualitative variables. These variables can be put into different categories, such as a model of car, color, gender, etc. Quantitative data is data that can be ordered and measured. The number of candies in a bag of skittles is quantitative, whereas the color of the candy is categorical. Graphing quantitative data is best done with histograms, stem leaf plots, dot plots, bar graphs, and box plots. All of these types of graphs can be used to measure the quantity of a certain variable. Categorical data is best graphed using a method that lets you compare the groups to one another. A bar graph can work for both quantitative and categorical data, but a pie chart doesn t make sense for quantitative data because it is comparing categories to the whole. A pie chart would effectively show the percentage of each color of skittles in a bag (categorical data), but cannot effectively be used to show the number of skittles in a bag (quantitative data). When it comes to calculations, mean and median only make sense for quantitative data. The mean is the average quantity of something in an entire sample, therefore it is a more meaningful calculation when applied to quantitative data. The median represents the middle value of the data and once again makes the most sense only when applied to quantitative data. The best central tendency to apply to categorical data is the mode. When looking at the colors of candy in a skittles bag, you may not able to find the average color or the median color, but you can establish which color occurs the most often. Likewise, when looking at the number of candies in a skittles bag, the best values for probability distributions are going to be the average and median number of skittles.

Laura Boren, Nathan Schafer, Justin Peck, Melissa Oneal 99% Confidence Interval estimate for the population proportion of yellow candies X= 586 n= 2839 Z-value for 95% CI = 2.576 p= 586/2839 = 0.206 0.206 +/- 2.576 * (0.007596) 0.206 +/- 0.01957 99% Confidence Interval Estimate: (0.186, 0.226) Confidence Intervals estimated from a population proportion are used to determine, with the specified degree of confidence, the proportion of a characteristic found within a population. In relation to the skittles, we are 99% confident that the proportion of yellow skittles in any bag of skittles falls between 0.186 and 0.226. 95% Confidence Interval estimate for the population mean number of skittles per bag n= 49 Sx = 6.38 Sample mean= 59.15 Standard error of the mean = 0.9114 To find the t-value, a t-table was consulted using a degree of freedom of 50. The t-value is 2.009. 59.15 +/ t*(0.9114) 59.15 + 1.83 = 60.98 59.15-1.83 = 57.32 95% Confidence Interval Estimate: (57.32, 60.98) Confidence Interval estimates of the population mean use sample date to extrapolate an interval with the specified degree of confidence that the mean characteristic of a population should fall within. In this case, we are 95% confident that the mean number of skittles in any bag is between 57.32 and 60.98.

Laura Boren, Nathan Schafer, Justin Peck, Melissa Oneal 98% confidence interval estimate for the population standard deviation of the number of candies per bag n=49 s=6.378 S 2 =40.679 χ 2 1-a/2 = 0.99 χ 2 a/2 = 0.01 On the Chi square distribution chart, 50 degrees of freedom was used. The value for χ 2 1-a/2 was 29.707. For χ 2 a/2 it was 76.154. [ s 2 (df)/chi value] Lower bound: 5.06 Upper bound: 8.11 Confidence Interval estimates from the population standard deviation use the sample standard deviation in order to generate an interval that the population standard deviation of the number of candies should fall within, with the specified level of confidence. In this case, we are 98% confident that the population standard deviation is within 5.06 and 8.11 candies. The problem with confidence interval estimates taken from the sample standard deviation is that the sample standard deviation may be quite different from the actual population standard deviation.

Laura Boren The purpose of taking sample data and calculating statistics from them is to apply those statistics to a larger population. Since a population is larger than a sample, how well a sample statistic can be used to estimate a population parameter is an issue. A confidence interval helps to solve that issue by allowing us to provide a range of values that the population parameter is likely to fall within. The intervals are constructed with a certain level of confidence, reflected as a percentage such as 95%, 98% or 99%. This means that if the same population were to be examined on multiple occasions and a parameter interval calculated each time, the intervals would contain the true parameter in X% of cases.

Laura Boren Skittle Project Reflection When I first started the project, I was intimidated by the process of using statistical concepts to interpret real-life data. As the project went on I became much more comfortable with concepts such as confidence intervals and creating Pareto charts and frequency histograms. In my volunteer work as a lactation educator and also as a nursing student I sometimes find myself reading and interpreting peer-reviewed clinical research. Understanding what things like confidence intervals are and what makes data significant or unusual is very helpful in interpreting such studies and thinking critically about what the data actually means. There are even some aspects of statistics that I used before taking this class. In Human Physiology we were required to calculate the mean, median, and standard deviation of lung inspiratory volume as part of our laboratory unit on the respiratory system. Taking calculus really helped me to understand real-world math applications and statistics only supported what I already knew about the practicality of math. Statistics is a very fundamental part of scientific literacy and has numerous applications in the world of business and economics. By completing the skittles project it helped me to understand how businesses and corporations might need to use statistics, particularly standard deviations, in order to produce accurate and consistent products. Statistics can also be used to calculate demand and determine shipping and distribution needs, and evaluate product quality and customer satisfaction. In our skittles project we determined the average proportion of each color of skittles candy that came in a bag as well as a confidence interval of that population proportion. This could be helpful in evaluating customer candy preferences and overall satisfaction based on flavor preference. A company might use similar statistics in real life to ensure product standardization.