DOWNLOAD PDF SUMMARIZING AND INTERPRETING DATA : USING STATISTICS

Size: px

Start display at page:

Download "DOWNLOAD PDF SUMMARIZING AND INTERPRETING DATA : USING STATISTICS"

Florence Hopkins
5 years ago
Views:

1 Chapter 1 : Summarizing Numerical Data Sets Worksheets Stem and Leaf Activity Sheets with Answers. Students first create the stem and leaf plot. Then they use it to answer questions. This is a great way to see how stem and leaf plots help us make sense of data quickly. Variance Boxplot A boxplot provides a graphical summary of the distribution of a sample. The boxplot shows the shape, central tendency, and variability of the data. Interpretation Use a boxplot to examine the spread of the data and to identify any potential outliers. Boxplots are best when the sample size is greater than Skewed data Examine the shape of your data to determine whether your data appear to be skewed. When data are skewed, the majority of the data are located on the high or low side of the graph. Often, skewness is easiest to detect with a histogram or boxplot. Right-skewed Left-skewed The boxplot with right-skewed data shows wait times. Most of the wait times are relatively short, and only a few wait times are long. The boxplot with left-skewed data shows failure time data. A few items fail immediately, and many more items fail later. Outliers Outliers, which are data values that are far away from other data values, can strongly affect the results of your analysis. Often, outliers are easiest to identify on a boxplot. Try to identify the cause of any outliers. Correct any dataâ entry errors or measurement errors. Consider removing data values for abnormal, one-time events also called special causes. Then, repeat the analysis. For more information, go to Identifying outliers. CoefVar The coefficient of variation CoefVar is a measure of spread that describes the variation in the data relative to the mean. The coefficient of variation is adjusted so that the values are on a unitless scale. Because of this adjustment, you can use the coefficient of variation instead of the standard deviation to compare the variation in data that have different units or that have very different means. Interpretation The larger the coefficient of variation, the greater the spread in the data. For example, you are the quality control inspector at a milk bottling plant that bottles small and large containers of milk. You take a sample of each product and observe that the mean volume of the small containers is 1 cup with a standard deviation of 0. Although the standard deviation of the gallon container is five times greater than the standard deviation of the small container, their coefficients of variation support a different conclusion. In other words, although the large container has a greater standard deviation, the small container has much more variability relative to its mean. For this ordered data, the first quartile Q1 is 9. Histogram, with normal curve A histogram divides sample values into many intervals and represents the frequency of data values in each interval with a bar. Interpretation Use a histogram to assess the shape and spread of the data. Histograms are best when the sample size is greater than You can use a histogram of the data overlaid with a normal curve to examine the normality of your data. A normal distribution is symmetric and bell-shaped, as indicated by the curve. It is often difficult to evaluate normality with small samples. A probability plot is best for determining the distribution fit. Good fit Poor fit Individual value plot An individual value plot displays the individual values in the sample. Each circle represents one observation. An individual value plot is especially useful when you have relatively few observations and when you also need to assess the effect of each observation. Interpretation Use an individual value plot to examine the spread of the data and to identify any potential outliers. Individual value plots are best when the sample size is less than Right-skewed Left-skewed The individual value plot with right-skewed data shows wait times. The individual value plot with left-skewed data shows failure time data. On an individual value plot, unusually low or high data values indicate possible outliers. For this ordered data, the interquartile range is 8 Interpretation Use the interquartile range to describe the spread of the data. As the spread of the data increases, the IQR becomes larger. Kurtosis Kurtosis indicates how the peak and tails of a distribution differ from the normal distribution. Interpretation Use kurtosis to initially understand general characteristics about the distribution of your data. Kurtosis value of 0 Normally distributed data establish the baseline for kurtosis. A kurtosis value of 0 indicates that the data follow the normal distribution perfectly. A kurtosis value that significantly deviates from 0 may indicate that the data are not normally distributed. Positive kurtosis A distribution that has a positive kurtosis value indicates that the distribution has heavier tails Page 1

2 and a sharper peak than the normal distribution. For example, data that follow a t-distribution have a positive kurtosis value. The solid line shows the normal distribution, and the dotted line shows a distribution that has a positive kurtosis value. Negative kurtosis A distribution with a negative kurtosis value indicates that the distribution has lighter tails and a flatter peak than the normal distribution. For example, data that follow a beta distribution with first and second shape parameters equal to 2 have a negative kurtosis value. The solid line shows the normal distribution and the dotted line shows a distribution that has a negative kurtosis value. Maximum The maximum is the largest data value. In these data, the maximum is Page 2

3 Chapter 2 : ViSta: The Visual Statistics System Exam 3 (Chapter 7: Summarizing and Interpreting Data: Using Statistics) study guide by sec88 includes 72 questions covering vocabulary, terms and more. Quizlet flashcards, activities and games help you improve your grades. More explicitly, exactly half of the values in the group are smaller than the median, and the other half of the values in the group are greater than the median. If there are an odd number of measurements, the median is simply equal to the middle value of the group, when the values are arranged in ascending order. If there are an even number of measurements as here, the median is equal to the mean of the two middle values again, when the values are arranged in ascending order. For the "without compost" group, the median is equal to the mean of the values of the 3rd and 4th values, which happen to be 4 and 5: Notice that, by definition, three of the values 3, 4, and 4 are less than the median, and the other three values 5, 6, and 8 are greater than the median. What is the median of the "with compost" group The mode is the value that appears most frequently in the group of measurements. For the "without compost" group, the mode is 4, because that value is repeated twice, while all of the other values are only represented once. What is the mode of the "with compost" group? It is entirely possible for a group of data to have no mode at all, or for it to have more than one mode. If all values occur with the same frequency for example, if all values occur only once, then the group has no mode. If more than one value occurs at the highest frequency, then each of those values is a mode. Here is an example of a group of raw data with two modes: The two modes of this data set are 26 and 41, since each of those values appears twice, while all the other values appear only once. A data set with two modes is sometimes called "bimodal. Mean, Median, or Mode: Which Measure Should I Use? When would you choose to use one in preference to another? The following illustration shows the mean, median, and mode of the "without compost" data sample on a graph. The x-axis shows the number of leaves per plant. The height of each bar y-axis shows the number of plants that had a certain number of leaves. Compare the graph with the data in the table, and you will see that all of the raw data values are shown in the graph. This graph shows why the mean, median, and mode are all called measures of central tendency. The data values are spread out across the horizontal axis of the graph, but the mean, median, and mode are all clustered towards the center. Each one is a slightly different measure of what happened "on average" in the experiment. The mode 4 shows which number of leaves per plant occurred most frequently. The mean 5 is the arithmetic average of all the data points. In general, the mean is the descriptive statistic most often used to describe the central tendency of a group of measurements. Of the three measures, it is the most sensitive measurement, because its value always reflects the contributions of each of the data values in the group. The median and the mode are less sensitive to "outliers"â data values at the extremes of a group. Imagine that, for the "without compost" group, the plant with the greatest number of leaves had 11 leaves, not 8. Both the median and the mode would remain unchanged. Check for yourself and confirm that this is true. The mean, however, would now be 5. On the other hand, sometimes it is an advantage to have a measure of central tendency that is less sensitive to changes in the extremes of the data. For example, if your data set contains a small number of outliers at one extreme, the median may be a better measure of the central tendency of the data than the mean. If your results involve categories instead of continuous numbers, then the best measure of central tendency will probably be the most frequent outcome the mode. For example, imagine that you conducted a survey on the most effective way to quit smoking. A reasonable measure of the central tendency of your results would be the method that works most frequently, as determined from your survey. It is important to think about what you are trying to accomplish with descriptive statistics, not just use them blindly. If your data contains more than one mode, then summarizing them with a simple measure of central tendency such as the mean or median will obscure this fact. First, what are you trying to describe? Second, what does your data look like? Then, the best measure of central tendency is Groups, or classes of things. Survey results often fall in this category, such as, "What is the most effective way to quit smoking? Position on a ranking scale, such as: The median movie ranking in Page 3

4 this survey was 2. Measures on a linear scale e. The shape of this data is approximately the same on the left and the right side of the graph, so we call this symmetrical data. For symmetrical data, the mean is the best measurement of central tendency. Notice how the data in this graph is non-symmetrical. The peak of the data is not centered, and the body mass values fall off more sharply on the left of the peak than on the right. When the peak is shifted like this to one side or the other, we call it skewed data. For skewed data, the median is the best choice to measure central tendency. Notice how this graph has two peaks. We call data with two prominent peaks bimodal data. In the case of a bimodal distribution, you may have two populations, each with its own separate central tendency. Notice how this graph has three peaks and lots of overlap between the tails of the peaks. We call this multimodal data. There is no single central tendency. It is easiest to describe data like this by referring to the graph. In this case, the data is scattered all over the place. In some cases, this may indicate that you need to collect more data. In this case there is no central tendency. Range, Variance, and Standard Deviation Measures of central tendency describe the "average" of a data set. Another important quality to measure is the "spread" of a data set. For example, these two data sets both have the same mean 5: For which data set would you feel more comfortable using the average description of "5"? It would be nice to have another measure to describe the "spread" of a data set. Such a measure could let us know at a glance whether the values in a data set are generally close to or far from the mean. The descriptive statistics that measure the quality of scatter are called measures of dispersion. When added to the measures of central tendency discussed previously, measures of dispersion give a more complete picture of the data set. We will discuss three such measurements: Range The range of a data set is the simplest of the three measures. The range is defined by the smallest and largest data values in the set. The range gives only minimal information about the spread of the data, by defining the two extremes. It says nothing about how the data are distributed between those two endpoints. Two other related measures of dispersion, the variance and the standard deviation, provide a numerical summary of how much the data are scattered. When printing this document, you may NOT modify it in any way. For any other use, please contact Science Buddies. Page 4

5 Chapter 3 : Summarizing Data The term statistics refers to the analysis and interpretation of this numerical data. Psychologists use statistics to organize, summarize, and interpret the information they collect. Psychologists use statistics to organize, summarize, and interpret the information they collect. The median is another measure of central tendency. To get the median you have to order the data from lowest to highest. The median is the number in the middle. If the number of cases is odd the median is the single value, for an even number of cases the median is the average of the two numbers in the middle. The excel function is: By age there are more students 19 years old in the sample than any other group. The sample variance measures the dispersion of the data from the mean. It is the simple mean of the squared distance from the mean. It is calculated by: Indicates how close the data is to the mean. The excel formula is: It is a roughly test for normality in the data by dividing it by the SE. If it is positive there is more data on the left side of the curve right skewed, the median and the mode are lower than the mean. A negative value indicates that the mass of the data is concentrated on the right of the curve left tail is longer, left skewed, the median and the mode are higher than the mean. A normal distribution has a skew of 0. Skewness can also be estimated with the following function: The current view of kurtosis argues that it measures the peak of a distribution. According to Peter Westfall, that view is not quite correct. High kurtosis may suggest the presence of outliers. Technically speaking, kurtosis focuses more on the tails for the distribution than the peak, so positive kurtosis indicates too few cases in the tails or a tall distribution leptokurtic, negative kurtosis too many cases in the tails or a flat distribution platykurtic. A normal distribution has a kurtosis of 0 given a correction of â 3, otherwise it will have a kurtosis of 3. The excel function for kurtosis is: Exploring data using pivot tables To explore the data by groups you can sort the columns for the variables you want for example gender, or major or country, etc. You can also use pivot tables. In step 2 select the range for the range of all values as in the following picture: On the right side of the wizard layout you can see the list of all variables in the data. The wizard layout should look like this: The wizard layout should look like this. This is a crosstabulation between gender and major. Each cell represents the average SAT score for a student according to gender and major. For example a female student with an econ major has an average SAT score of cell B5 in the picture while a male student also with an econ major has B6. Overall econ major students have an average SAT score of B7. In general, female students have an average SAT score in this sample of For more information on pivot tables go to the following site. Page 5

6 Chapter 4 : Interpret all statistics and graphs for Descriptive Statistics - Minitab Express The science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making more effective decisions. Descriptive statistics methods of organizing, summarizing and presenting data in an informative way (describes raw data, chart/graph). A specific quantile or percentile is a value in the data set that holds a specific percentage of the values at or below it. The median is the 50th percentile, the third quartile is the 75th percentile and the maximum is the th percentile i. A box-whisker plot is a graphical display of these percentiles. The horizontal lines represent from the top the maximum, the third quartile, the median also indicated by the dot, the first quartile and the minimum. A box-whisker plot is meant to convey the distribution of a variable at a quick glance. Recall that in the full sample we determined that there were outliers both at the low and the high end See Table In Figure 12 the outliers are displayed as horizontal lines at the top and bottom of the distribution. At the low end of the distribution, there are 5 values that are considered outliers i. At the high end of the distribution, there are 12 values that are considered outliers i. The "whiskers" of the plot boldfaced horizontal brackets are the limits we determined for detecting outliers Figure 13 below shows side-by-side box-whisker plots of the distributions of weights, in pounds, for men and women in the Framingham Offspring Study. The figure clearly shows a shift in the distributions with men having much higher weights. In fact, the 25th percentile of the weights in men is approximately pounds and equal to the 75th percentile in women. There are many outliers at the high end of the distribution among both men and women. There are two outlying low values among men. There are again many outliers in the distributions in both men and women. However, when taking height into account by comparing body mass index instead of comparing weights alone, we see that the most extreme outliers are among the women. Some statistical computing packages use the following to determine outliers: Summary The first important aspect of any statistical analysis is an appropriate summary of the key analytic variables. This involves first identifying the type of variable being analyzed. This step is extremely important as the appropriate numerical and graphical summaries depend on the type of variable being analyzed. Variables are dichotomous, ordinal, categorical or continuous. The best numerical summaries for dichotomous, ordinal and categorical variables involve relative frequencies. The best numerical summaries for continuous variables include the mean and standard deviation or the median and interquartile range, depending on whether or not there are outliers in the distribution. The mean and standard deviation or the median and interquartile range summarize central tendency also called location and dispersion, respectively. The best graphical summary for dichotomous and categorical variables is a bar chart and the best graphical summary for an ordinal variable is a histogram. Both bar charts and histograms can be designed to display frequencies or relative frequencies, with the latter being the more popular display. Box-whisker plots provide a very useful and informative summary for continuous variables. Box-whisker plots are also useful for comparing the distributions of a continuous variable among mutually exclusive i. The following table summarizes key statistics and graphical displays organized by variable type. Page 6

7 Chapter 5 : Statistics science blog.quintoapp.com Descriptive statistics are useful for describing the basic features of data, for example, the summary statistics for the scale variables and measures of the data. In a research study with large data, these statistics may help us to manage the data and present it in a summary table. For instance, in. The first step of data analysis is to accurately summarize all of this data, both graphically and numerically, so that we can understand what the data reveals. To be able to use and interpret the data correctly is essential to making informed decisions. For instance, when you see a survey of opinion about a certain TV program, you may be interested in the proportion of those people who indeed like the program. In this unit, you will learn about descriptive statistics, which are used to summarize and display data. After completing this unit, you will know how to present your findings once you have collected data. For example, suppose you want to buy a new mobile phone with a particular type of a camera. Suppose you are not sure about the prices of any of the phones with this feature, so you access a website that provides you with a sample data set of prices, given your desired features. Looking at all of the prices in a sample can sometimes be confusing. A better way to compare this data might be to look at the median price and the variation of prices. The median and variation are two ways out of several ways that you can describe data. You can also graph the data so that it is easier to see what the price distribution looks like. In this unit, you will study precisely this; namely, you will learn both numerical and graphical ways to describe and display your data. You will understand the essentials of calculating common descriptive statistics for measuring center, variability, and skewness in data. You will learn to calculate and interpret these measurements and graphs. Descriptive statistics are, as their name suggests, descriptive. They do not generalize beyond the data considered. Descriptive statistics illustrate what the data shows. Numerical descriptive measures computed from data are called statistics. Numerical descriptive measures of the population are called parameters. Inferential statistics can be used to generalize the findings from sample data to a broader population. Completing this unit should take you approximately 22 hours. Elements of Probability and Random Variables Probabilities affect our everyday lives. In this unit, you will learn about probability and its properties, how probability behaves, and how to calculate and use it. You will study the fundamentals of probability and will work through examples that cover different types of probability questions. These basic probability concepts will provide a foundation for understanding more statistical concepts, for example, interpreting polling results. Though you may have already encountered concepts of probability, after this unit, you will be able to formally and precisely predict the likelihood of an event occurring given certain constraints. Probability theory is a discipline that was created to deal with chance phenomena. For instance, before getting a surgery, a patient wants to know the chances that the surgery might fail; before taking medication, you want to know the chances that there will be side effects; before leaving your house, you want to know the chance that it will rain today. Probability is a measure of likelihood that takes on values between 0 and 1, inclusive, with 0 representing impossible events and 1 representing certainty. The chances of events occurring fall between these two values. The skill of calculating probability allows us to make better decisions. We will also talk about random variables. A random variable describes the outcomes of a random experiment. A statistical distribution describes the numbers of times each possible outcome occurs in a sample. The values of a random variable can vary with each repetition of an experiment. Intuitively, a random variable, summarizing certain chance phenomenon, takes on values with certain probabilities. A random variable can be classified as being either discrete or continuous, depending on the values it assumes. Suppose you count the number of people who go to a coffee shop between 4 p. In this case, the number of people is an example of a discrete random variable and the amount of waiting time they spend is an example of a continuous random variable. Completing this unit should take you approximately 25 hours. Sampling Distributions The concept of sampling distribution lies at the very foundation of statistical inference. It is best to introduce sampling distribution using an example here. Suppose you want to estimate a Page 7

8 parameter of a population, say the population mean. There are two natural estimators: In particular, for a sample of even size n, the median is the mean of the middle two numbers. But which one is better, and in what sense? This involves repeated sampling, and you want to choose the estimator that would do better on average. It is clear that different samples may give different sample means and medians; some of them may be closer to the truth than the others. Consequently, we cannot compare these two sample statistics or, in general, any two sample statistics on the basis of their performance with a single sample. Instead, you should recognize that sample statistics are themselves random variables; therefore, sample statistics should have frequency distributions by taking into account all possible samples. In this unit, you will study the sampling distribution of several sample statistics. This unit will show you how the central limit theorem can help to approximate sampling distributions in general. Completing this unit should take you approximately 15 hours. Estimation with Confidence Intervals In this unit, you will learn how to use the central limit theorem and confidence intervals, the latter of which enables you to estimate unknown population parameters. The central limit theorem provides us with a way to make inferences from samples of non-normal populations. This theorem states that given any population, as the sample size increases, the sampling distribution of the means approaches a normal distribution. This powerful theorem allows us to assume that given a large enough sample, the sampling distribution will be normally distributed. You will also learn about confidence intervals, which provide you with a way to estimate a population parameter. Instead of giving just a one-number estimate of a variable, a confidence interval gives a range of likely values for it. This is useful, because point estimates will vary from sample to sample, so an interval with certain confidence level is better than a single point estimate. After completing this unit, you will know how to construct such confidence intervals and the level of confidence. Completing this unit should take you approximately 10 hours. Hypothesis Test A hypothesis test involves collecting and evaluating data from a sample. The data gathered and evaluated is then used to make a decision as to whether or not the data supports the claim that is made about the population. This unit will teach you how to conduct hypothesis tests and how to identify and differentiate between the errors associated with them. Many times, you need answers to questions in order to make efficient decisions. The process of hypothesis testing is a way of decision-making. In this unit, you will learn to establish your assumptions through null and alternative hypotheses. The null hypothesis is the hypothesis that is assumed to be true and the hypothesis you hope to nullify, while the alternative hypothesis is the research hypothesis that you claim to be true. This means that you need to conduct the correct tests to be able to accept or reject the null hypothesis. You will learn how to compare sample characteristics to see whether there is enough data to accept or reject the null hypothesis. Completing this unit should take you approximately 12 hours. Linear Regression In this unit, we will discuss situations in which the mean of a population, treated as a variable, depends on the value of another variable. One of the main reasons why we conduct such analyses is to understand how two variables are related to each other. The most common type of relationship is a linear relationship. For example, you may want to know what happens to one variable when you increase or decrease the other variable. You want to answer questions such as, "Does one variable increase as the other increases, or does the variable decrease? In this unit, you will also learn to measure the degree of a relationship between two or more variables. Both correlation and regression are measures for comparing variables. Correlation quantifies the strength of a relationship between two variables and is a measure of existing data. On the other hand, regression is the study of the strength of a linear relationship between an independent and dependent variable and can be used to predict the value of the dependent variable when the value of the independent variable is known. Page 8

9 Chapter 6 : Interpreting quartiles (practice) Khan Academy Part Two Preliminary Skills Needed for Conducting Research You have already familiarized yourself with the major sources and locations of published research. That was the first phase in learning to use the library profitably. Remember that the quality of the output depends on the quality of input. Garbage in, garbage out. Data scientists spend much of their time on data preparation before they jump into modelling, because understanding, generating and selecting useful features impacts model performance. It helps the data scientists to check assumptions required for fitting models. Depending on size and type of data, understanding and interpreting data sets can be challenging. What can be done? Use different exploratory data analysis and visualization techniques to get a better understanding. This includes summarizing main data set characteristics, finding representative or critical points and discovering relevant features. After gaining an overall understanding of the data set, you can think about which observations and features to use in modeling. Summary statistics with visualization Summary statistics help to analyze information about the sample data. It indicates something about the continuous interval and discrete nominal data set variables. Analyze those variables individually or together because they can help find: The distribution of feature values across different features can be compared, as can feature statistics for training and test data sets. This helps uncover differences between them. Be careful about summary statistics. Excessive trust of summary statistics can hide problems in the data set. Consider using additional techniques for a full understanding. Example-based explanations Assume the data set has millions of observations with thousands of variables. One approach to solve this problem is to use example-based explanations; techniques that can help pick important observations and dimensions. They can help interpret highly complex big data sets with different distributions. The techniques available to solve this problem include finding observations and dimensions to characterize, to criticize and to distinguish the data set groups. As humans, we usually use representative examples from the data for categorization and decision making. Those examples, usually called prototypes, are observations that best describe dataset categories. They can be used to interpret categories since it is hard to make interpretations using all the observations in a certain category. Finding prototypes alone is not sufficient to understand the data since it overgeneralizes. We need to show exceptions criticisms to the rules. Those observations can be considered as minority observations very different from the prototype, but still belonging in the same category. In the illustrations below, robot pictures in each category consist of robots with different head and body shapes. Robots in costumes can also belong to one of those categories, although they can be very different from a typical robot picture. Those pictures are needed to understanding the data since they are important minorities. Finding representatives may not always be enough. If the number of features is high, it will still be hard to understand the selected observations. This is because humans cannot comprehend long and complicated explanations. The explanations need to be simple. The most important features for those selected observations must be considered. Subspace representation is a solution to that problem. Using the prototype and subspace representation helps in interpretability. For that, find distinguishing dimensions in the data. A mind the gap model MGM combines extractive and selective approaches and reports a global set of distinguishable dimensions to assist with further exploration. In the above example, by looking at the features extracted from different robot pictures we can say that shape of the head is a distinguishing dimension. Embedding techniques An embedding is a mapping from discrete values, such as words or observations, to vectors. Different embedding techniques help visualize lower-dimensional representation. Embeddings can be in hundreds of dimensions. The common way to understand them is to project them into two or three dimensions. They are useful for many things: Use them to explore local neighborhoods. It may help to explore the closest points to a given point to make sure that they are related to each other. Select those points and do further analysis. Use them to understand the behavior of a model. Use them to analyze the global structure, seeking groups of points. This helps find clusters and outliers. There are many methods for obtaining Page 9

10 embedding, including: This is an effective algorithm to reduce dimensionality of data, especially if strong linear relationships exist among variables. It can be used to highlight the variations and eliminate dimensions. Remaining principal components account for trivial amounts of variance. They should not be retained for interpretability and analysis. T-distributed stochastic neighbor embedding t-sne: It is nonlinear and nondeterministic; and allows creation of 2 or 3D projections. T-SNE finds structures that other methods may miss. While preserving local structure, it may distort global structure. If more information is needed about t-sne, check out a great article at distill. Topological data analysis TDA Topology studies geometric features preserved when we deform the object without tearing it. Topological data analysis provides tools to study the geometric features of data using topology. This includes detecting and visualizing features, and the statistical measures related to those. Geometric features can be distinct clusters, loops and tendrils in the data. If there is a loop in this network, the conclusion is that a pattern occurs periodically. Mapper algorithms in TDA are useful for data visualization and clustering. Topological networks of a data set can be created in which nodes are the group of similar observations and the edges connect the nodes if they have a common observation. Conclusion When it comes to understanding and interpreting data, there is no one solution that fits all. Pick the one that best meets your need. Ilknur Kaynar Kabul is a scientist and manager at SAS, working at the intersection of computer science, statistics and optimization. Her work involves building scalable machine learning algorithms that help solve big data problems. She holds a doctorate of computer science from the University of North Carolina. Page 10

11 Chapter 7 : Course: MA Introduction to Statistics The field of statistics provides principles and methods for collecting, summarizing, and analyzing data, and for interpreting the results. You use statistics to describe data and make inferences. Then, you use the inferences to improve processes and products. Then, you examine the interval plot, individual value plot, and boxplot together to assess the equality of the means. Interpret the residual plots Use residual plots, which are available with many statistical commands, to verify statistical assumptions. Normal Probability Plot Use this plot to detect nonnormality. Points that approximately follow a straight line indicate that the residuals are normally distributed. Histogram Use this plot to detect multiple peaks, outliers, and nonnormality. Look for a normal histogram, which is approximately symmetric and bell-shaped. Versus Fits Use this plot to detect nonconstant variance, missing higher-order terms, and outliers. Look for residuals that are scattered randomly around zero. Versus Order Use this plot to detect the time dependence of the residuals. Inspect the plot to ensure that the residuals display no obvious pattern. For the shipping data, the four-in-one residual plots indicate no violations of statistical assumptions. Note In Minitab, you can display each of the residual plots on a separate page. Interpret the interval plot, individual value plot, and boxplot Examine the interval plot, individual value plot, and boxplot. Each graph indicates that the delivery time varies by shipping center, which is consistent with the histograms from the previous chapter. The boxplot for the Eastern shipping center has an asterisk. The asterisk identifies an outlier. This outlier is an order that has an unusually long delivery time. Examine the interval plot again. Hold the pointer over the points on the graph to view the means. The interval plot shows that the Western shipping center has the fastest mean delivery time 2. The Tukey confidence intervals show the following pairwise comparisons: Eastern shipping center mean minus Central shipping center mean Western shipping center mean minus Central shipping center mean Western shipping center mean minus Eastern shipping center mean Hold the pointer over the points on the graph to view the middle, upper, and lower estimates. The interval for the Eastern minus Central comparison is 0. That is, the mean delivery time of the Eastern shipping center minus the mean delivery time of the Central shipping center is between 0. You interpret the other Tukey confidence intervals similarly. Also, notice the dashed line at zero. If an interval does not contain zero, the corresponding means are significantly different. Therefore, all the shipping centers have significantly different average delivery times. Minitab provides detailed information about the Session window output and graphs for most statistical commands. On the Standard toolbar, click the Help button. Save all your work in a Minitab project. Navigate to the folder that you want to save your files in. In File name, enter. Page 11

12 Chapter 8 : Summarizing and Interpreting Data Sets Worksheets Join Curt Frye for an in-depth discussion in this video, Summarizing data using descriptive statistics, part of Excel Business Statistics. Descriptive statistics Descriptive statistics are tabular, graphical, and numerical summaries of data. The purpose of descriptive statistics is to facilitate the presentation and interpretation of data. Most of the statistical presentations appearing in newspapers and magazines are descriptive in nature. Univariate methods of descriptive statistics use data to enhance the understanding of a single variable; multivariate methods focus on using statistics to understand the relationships among two or more variables. To illustrate methods of descriptive statistics, the previous example in which data were collected on the age, gender, marital status, and annual income of individuals will be examined. Tabular methods The most commonly used tabular summary of data for a single variable is a frequency distribution. A frequency distribution shows the number of data values in each of several nonoverlapping classes. Another tabular summary, called a relative frequency distribution, shows the fraction, or percentage, of data values in each class. The most common tabular summary of data for two variables is a cross tabulation, a two-variable analogue of a frequency distribution. For a qualitative variable, a frequency distribution shows the number of data values in each qualitative category. For instance, the variable gender has two categories: Thus, a frequency distribution for gender would have two nonoverlapping classes to show the number of males and females. A relative frequency distribution for this variable would show the fraction of individuals that are male and the fraction of individuals that are female. Constructing a frequency distribution for a quantitative variable requires more care in defining the classes and the division points between adjacent classes. For instance, if the age data of the example above ranged from 22 to 78 years, the following six nonoverlapping classes could be used: A frequency distribution would show the number of data values in each of these classes, and a relative frequency distribution would show the fraction of data values in each. A cross tabulation is a two-way table with the rows of the table representing the classes of one variable and the columns of the table representing the classes of another variable. To construct a cross tabulation using the variables gender and age, gender could be shown with two rows, male and female, and age could be shown with six columns corresponding to the age classes 20â 29, 30â 39, 40â 49, 50â 59, 60â 69, and 70â The entry in each cell of the table would specify the number of data values with the gender given by the row heading and the age given by the column heading. Such a cross tabulation could be helpful in understanding the relationship between gender and age. Graphical methods A number of graphical methods are available for describing data. A bar graph is a graphical device for depicting qualitative data that have been summarized in a frequency distribution. Labels for the categories of the qualitative variable are shown on the horizontal axis of the graph. A bar above each label is constructed such that the height of each bar is proportional to the number of data values in the category. A bar graph of the marital status for the individuals in the above example is shown in Figure 1. There are 4 bars in the graph, one for each class. A pie chart is another graphical device for summarizing qualitative data. The size of each slice of the pie is proportional to the number of data values in the corresponding class. A pie chart for the marital status of the individuals is shown in Figure 2. A pie chart for the marital status of individuals. A histogram is the most common graphical presentation of quantitative data that have been summarized in a frequency distribution. The values of the quantitative variable are shown on the horizontal axis. A rectangle is drawn above each class such that the base of the rectangle is equal to the width of the class interval and its height is proportional to the number of data values in the class. Page 1 of 8. Page 12

13 Chapter 9 : Descriptive Statistics Excel/Stata Summarize, represent, and interpret data on two categorical and quantitative variables blog.quintoapp.comb.5 Summarize categorical data for two categories in two-way frequency tables. Interpret relative frequencies in the context of the data (including joint, marginal, and conditional relative frequencies). Science Science is based on the empirical method for making observations - for systematically obtaining information. It consists of methods for making observations. Observations Observations are the basic empirical "stuff" of science. Statistics Statistics is a set of methods and rules for organizing, summarizing and interpreting information. The methods and rules enable scientific researchers to describe and analyze the observations they have made. Statistical methods are tools for science. Science consists of methods for making observations; Statistics consists of methods for describing and analyzing the observations. We will also refer to populations of scores. Samples A sample is a set of individuals selected from a population, usually intended to represent the population in a study. We will also refer to samples of scores. The data we gathered in class are a "sample" of scores obtained with a sample of individuals. The population we sampled from is the population of UNC undergraduates. Parameters A Parameter is a value, usually a numerical value, that describes a Population. A Parameter may be obtained from a single measurement, or it may be derived from a set of measurements from the Population. Statistics A Statistic is a value, usually a numerical value, that describes a Sample. A Statistic may be obtained from a single measurement, or it may be derived from a set of measurements from the Sample. Here are some "statistics" computed from our sample of data: Data Data plural are measurements or observations. A data set is a collection of measurements or observations. A datum singular is a single measurement or observation and is commonly called a data-value, a score, or a raw score. Descriptive Statistics Descriptive Statistics are statistical procedures used to summarize, organize and simplify data. It is also the branch of statistical activity focusing on the use of such procedures. These procedures are the focus of chapters 1 through 5. Statistical Visualization Recently developed computational statistical procedures used to visually summarize, organize and simplify data. The statistical system we are using is named ViSta for "Visual Statistics", because it includes statistical visualiation. A statistical visualization of our data is shown below. Higher satisfaction is associated with higher GPA. Exploratory Statistics The process of exploring data by using descriptive and visualization methods to "see what the data seem to say". The branch of statistics that focuses on "seeing what the data seem to say" Tukey, 19?? Inferential Statistics Inferential Statistics consist of techniques that allow us to study samples and then to make generalizations about the populations from which the samples were selected. These procedures are the focus of chapters 8 through the remainder of the text. The groundwork for statistical inference is laid in chapters 6 and 7. Sampling Error Sampling error is the discrepency, or amount of error, that exists between a sample statistic and the corresponding population parameter. The Scientific Method and the Design of Experiments Science attempts to discover orderliness in the universe - to discover regularity in changes. Something that can change is called a variable. Variables A variable is a characteristic or condition that changes or has different values for different individuals. In the data we gathered, the variables include "Gender", "Age", etc. A constant is a characteristic or condition that does not vary, and is the same for every individual. The Correlational Method The scientific method in which two or more variables are observed without manipulation i. The correlational method cannot establish cause-and-effect: Correlation is not causation! The data we gathered are an example of the correlational method. The Experimental Method The scientific method which can establish a cause-and-effect relationship between two or more variables. The researcher manipulates one variable and observes what happens on the other. More than one variable may be manipulated or observed. To correctly establish cause-and-effect, the researcher must exercise some control over the experimental situation to ensure that some other variable s do es not influence the relationship being watched. The experimental conditions must be identical, other than differing on values of the manipulated variable. Independent Variable also called Page 13

14 the predictor variable The variable which is manipulated by the researcher. Dependent Variable also called the response variable The variable which is observed by the researcher for changes in order to access the effect of the treatment. The treatment is the manipulation of the predictor variable. Confounding Variable An uncontrolled variable that is unintentionally allowed to vary systematically with the independent variable. Confounds the results bad, bad, bad! The control group This is a condition of the independent variable that does not receive the experimental treatment. Usually, the control group receives either no treatment or a placebo treatment. The experimental group This is a condition of the independent variable that does receive an experimental treatment. There may be several experimental groups. The Quasi-Experimental Method Examines differences between pre-existing groups of subjects such as men vs. Hypotheses A hypothesis is a prediction about the outcome of an experiment. In experimental research, a hypothesis makes a prediction about how the manipulation of the independent predictor variable will affect the dependent response variable. Measurement Data are measurements of observations which involve categorizing, ordering or using number to characterize amount. Several levels of measurement are involved. These in turn determine what statistics can be computed. Measurements may also be discrete or continuous. Scales Levels of Measurement Nominal The nominal level of measurement labels observations so that they fall into different categories. Football jersey numbers and home street addresses are common examples. In ViSta, nominal variables are called "Category" variables. Ordinal The ordinal level of measurement consists of categories that are ordered in a sequence. Order of finish in a race is a common example. In ViSta, ordinal variables are called "Ordinal" variables. Interval The interval level of measurement consists of ordered categories where all of the categories are intervals of exactly the same size. Temperature is a common example. Here, equal differences between numbers reflect equal differences in magnitude of the observed variable. Ratio The ratio level of measurement is an interval scale with an absolute zero point. Length and weight are common examples. Here, ratio of numbers reflect ratios of variable magnitude. In ViSta, interval and ratio variables are called "Numeric" variables. Discrete and Continuous Variables Discrete A discrete variable has separate, indivisible categories. No values can exist in between two neighboring categories. Continuous A continuous variable has an infinite number of possible values falling between any two observed values. Mathematical Notation In statistical calculations you will constantly be required to add a set of values to find a specific total. We use algebraic expressions to represent the values being added. For example X means "Scores on a Variable. Thus, we write Note that All calculations within parentheses are done first. Squaring, multiplying, and dividing are done second, and should be completed in order from left to right. Adding and subtracting including summation are third, and should be completed in order from left to right. The following term, which is called the "squared sum" works as shown: Because of the order of operations, the following term, which is called "the sum of squares", works as shown: Consider how the following summation equation works: On the other hand, the next summation equation works differently: Finally, consider how this last summation equation works: Page 14

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile