DOWNLOAD PDF TAKE FULL ADVANTAGE OF DESCRIPTIVE STATISTICS

Size: px

Start display at page:

Download "DOWNLOAD PDF TAKE FULL ADVANTAGE OF DESCRIPTIVE STATISTICS"

Claribel Stephens
5 years ago
Views:

1 Chapter 1 : Understanding Descriptive and Inferential Statistics The above 8 descriptive statistics examples, problems and solutions are simple but aim to make you understand the descriptive data better. As you saw, descriptive statistics are used just to describe some basic features of the data in a study. Statistics The mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling. Descriptive Statistics Descriptive statistics include the numbers, tables, charts, and graphs used to Return to Table of Contents describe, organize, summarize, and present raw data. Advantages Of Descriptive Statistics: Advantages Of Descriptive Statistics be essential for arranging and displaying data form the basis of rigorous data analysis be much easier to work with, interpret, and discuss than raw data help examine the tendencies, spread, normality, and reliability of a data set be rendered both graphically and numerically include useful techniques for summarizing data in visual form form the basis for more advanced statistical methods Disadvantages Of Descriptive Statistics: Disadvantages Of Descriptive Statistics be misused, misinterpreted, and incomplete be of limited use when samples and populations are small demand a fair amount of calculation and explanation fail to fully specify the extent to which non-normal data are a problem offer little information about causes and effects be dangerous if not analyzed completely Mean: Mean Mean is the average, the most common measure of central tendency. The mean of a population is designated by the Greek letter mu. The mean of a sample is designated by the symbol x-bar. The mean may not always be the best measure of central tendency, especially if data are skewed. For example, average income is often misleading since those few individuals with extremely high incomes may raise the overall average. Median Median is the value in the middle of the data set when the measurements are arranged in order of magnitude. For example, if 11 individuals were weighed and their weights arranged in ascending or descending order, the sixth value is the median since five values fall both above and below the sixth value. Mode Mode is the value occurring most often in the data. If the largest group of people in a sample measuring age were 25 years old, then 25 would be the mode. The mode is the least commonly used measure of central tendency, particularly in large data sets. However, the mode is still important for describing a data set, especially when more than one value occurs frequently. Variance Variance is expressed as the sum of the squares of the differences between each observation and the mean, which quantity is then divided by the sample size. For populations, it is designated by the square of the Greek letter sigma square. For samples, it is designated by the square of the letter s s2. Variance is used less frequently than standard deviation as a measure of dispersion. Variance can be used when we want to quickly compare the variability of two or more sets of interval data. Standard Deviation Standard deviation is expressed as the positive square root of the variance, i. F for populations and s for samples. It is the average difference between observed values and the mean. The standard deviation is used when expressing dispersion in the same units as the original measurements. It is used more commonly than the variance in expressing the degree to which data are spread out. Coefficient Of Variation Coefficient of variation measures relative dispersion by dividing the standard deviation by the mean and then multiplying by to render a percent. This number is designated as V for populations and v for samples and describes the variance of two data sets better than the standard deviation. Range Range measures the distance between the lowest and highest values in the data set and generally describes how spread out data are. For example, after an exam, an instructor may tell the class that the lowest score was 65 and the highest was The range would then be Percentiles Percentiles measure the percentage of data points which lie below a certain value when the values are ordered. Her scorecard informs her she is in the 90th percentile of students taking the exam. Thus, 90 percent of the students scored lower than she did. Quartiles Quartiles group observations such that 25 percent are arranged together according to their values. The top 25 percent of values are referred to as the upper quartile. The lowest 25 percent of the values are referred to as the lower quartile. Often the two quartiles on either side of the median are reported together as the interquartile range. Measures Of Skew Measures of skew describe how concentrated data points are at the high or low end of the scale of measurement. Skew is designated by the symbols Sk for populations and Sk Page 1

2 for samples. Skew indicates the degree of symmetry in a data set. The more skewed the distribution, the higher the variability of the measures, and the higher the variability, the less reliable are the data. But, if the distribution is skewed left negative skew, the mean lies to the left of the median and the mode. Measures Of Kurtosis Measures of kurtosis describe how concentrated data are around a single value, usually the mean. Thus, kurtosis assesses how peaked or flat is the data distribution. The more peaked or flat the distribution, the less normally distributed the data. And the less normal the distribution, the less reliable the data. Mesokurtic distributions are, like the normal bell curve, neither peaked nor flat. Platykurtic distributions are flatter than the normal bell curve. Leptokurtic distributions are more peaked than the normal bell curve. Inferential Statistics pertain to the procedures used to make forecasts, estimates, or judgments about a large set of data on the basis of the statistical characteristics of a smaller set a sample. Inferential statistics Are Most Often Used To Inferential statistics are frequently used to answer cause-and-effect questions and make predictions. Advantages Of Inferential statistics: Advantages Of Inferential statistics provide more detailed information than descriptive statistics yield insight into relationships between variables reveal causes and effects and make predictions generate convincing support for a given theory be generally accepted due to widespread use in business and academia Disadvantages Of Inferential statistics: Such tests are normally used with contingency tables which group observations based on common characteristics. ANOVA does this by comparing the dispersion of samples in order to make inferences about their means. Ideally, variables should move independently of one another, regardless of their means. Unfortunately, in the real world, groups of observations usually differ on a number of dimensions, making simple analyses of variance tests problematic since differences in other characteristics could cause observed differences in the values of the variables of interest. Correlation Correlation D, like ACOVA, is used to measure the similarity in the changes of values of interval variables but is not influenced by the units of measure. Another advantage of correlation is that it is always bounded by the interval: A value of 0 indicates no relationship. Regression analysis Regression analysis is often used to determine the effect of independent variables on a dependent variable. Regression measures the relative impact of each independent variable and is useful in forecasting. It is used most appropriately when both the independent and dependent variables are interval, though some social scientists also use regression on ordinal data. Like correlation, regression analysis assumes that the relationship between variables is linear. Logistic regression analysis Logistic regression analysis is used to examine relationships between variables when the dependent variable is nominal, even though independent variables are nominal, ordinal, interval, or some mixture thereof. One could then use several independent variables such as GED completion, job training, post-secondary education and the like to predict the odds of getting a job. Discriminate analysis Discriminate analysis is similar to logistic regression in that the outcome variable is categorical. However, here the independent variables must be interval. Factor analysis Factor analysis simultaneously examines multiple variables to determine if they reflect larger underlying dimensions. Factor analysis is commonly used when analyzing data from multi-question surveys to reduce the numerous questions to a smaller set of more global issues. Forecasting Forecasting exists in many variations. The predictive power of regression analysis can be an effective forecasting tool, but time series forecasting is more common when time is a significant independent variable. Page 2

3 Chapter 2 : Descriptive Statistics Statistics for Engineers 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one. Descriptive statistics are distinguished from inferential statistics or inductive statistics, in that descriptive statistics aim to summarize a data set quantitatively without employing a probabilistic formulation, rather than use the data to make inferences about the population that the data are thought to represent. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups e. Inferential statistics Inferential statistics tries to make inferences about a population from the sample data. We also use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one, or that it might have happened by chance in this study. Use in statistical analyses Descriptive statistics provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of quantitative analysis of data. Descriptive statistics summarize data. For example, the shooting percentage in basketball is a descriptive statistic that summarizes the performance of a player or a team. This number is the number of shots made divided by the number of shots taken. The percentage summarizes or describes multiple discrete events. Or, consider the scourge of many students, the grade point average. This single number describes the general performance of a student across the range of their course experiences. Describing a large set of observations with a single indicator risks distorting the original data or losing important detail. Despite these limitations, descriptive statistics provide a powerful summary that may enable comparisons across people or other units. Univariate analysis Univariate analysis involves the examination across cases of a single variable, focusing on three characteristics: It is common to compute all three for each study variable. Distribution The distribution is a summary of the frequency of individual or ranges of values for a variable. The simplest distribution would list every value of a variable and the number of cases who had that value. For instance, computing the distribution of gender in the study population means computing the percentages that are male and female. The gender variable has only two, making it possible and meaningful to list each one. However, this does not work for a variable such as income that has many possible values. Typically, specific values are not particularly meaningful income of 50, is typically not meaningfully different from 51, Grouping the raw scores using ranges of values reduces the number of categories to something for meaningful. For instance, we might group incomes into ranges of,, 10,,, etc. Frequency distributions are depicted as a table or as a graph. Table 1 shows an age frequency distribution with five categories of age ranges defined. The same frequency distribution can be depicted in a graph as shown in Figure 2. This type of graph is often referred to as a histogram or bar chart. Central tendency The central tendency of a distribution locates the "center" of a distribution of values. The three major types of estimates of central tendency are the mean, the median, and the mode. The mean is the most commonly used method of describing central tendency. To compute the mean, take the sum of the values and divide by the count. For example, the mean quiz score is determined by summing all the scores and dividing by the number of students taking the exam. For example, consider the test score values: The median is the score found at the middle of the set of values, i. One way to compute the median is to sort the values in numerical order, and then locate the value in the middle of the list. For example, if there are values, the median is the average of the two values in th and st positions. If there are values, the value in th position is the median. Sorting the 7 scores above produces: The median is If there are an even number of observations, then the median is the mean of the two middle scores. In the example, if there were an 8th observation, with a value of 25, the median becomes the average of the 4th and 5th scores, in this case The mode is the most frequently occurring value in the set. To determine the mode, compute the distribution as above. The mode is the value with the greatest frequency. In the example, the modal value 15, occurs three times. In some distributions there is a "tie" for the highest frequency, i. These are called multi-modal distributions. Notice that the three measures typically produce different results. The term "average" obscures Page 3

4 the difference between them and is better avoided. The three values are equal if the distribution is perfectly " normal " i. Dispersion Dispersion is the spread of values around the central tendency. There are two common measures of dispersion, the range and the standard deviation. The range is simply the highest value minus the lowest value. Th Test statistic In statistical hypothesis testing, a hypothesis test is typically specified in terms of a test statistic, which is a function of the sample; it is considered as a numerical summary of a set of data that reduces the data to one or a small number of values that can be used to perform a hypothesis test. Given a null hypothesis and a test statistic T, we can specify a "null value" T0 such that values of T close to T0 present the strongest evidence in favor of the null hypothesis, whereas values of T far from T0 present the strongest evidence against the null hypothesis. An important property of a test statistic is that we must be able to determine its sampling distribution under the null hypothesis, which allows us to calculate p-values. For example, suppose we wish to test whether a coin is fair i. If we flip the coin times and record the results, the raw data can be represented as a sequence of Heads and Tails. In this case, the exact sampling distribution of T is the binomial distribution, but for larger sample sizes the normal approximation can be used. Using one of these sampling distributions, it is possible to compute either a one-tailed or two-tailed p-value for the null hypothesis that the coin is fair. Note that the test statistic in this case reduces a set of numbers to a single numerical summary that can be used for testing. A test statistic shares some of the same qualities of a descriptive statistic, and many statistics can be used as both test statistics and descriptive statistics. However a test statistic is specifically intended for use in statistical testing, whereas the main quality of a descriptive statistic is that it is easily interpretable. Some informative descriptive statistics, such as the sample range, do not make good test statistics since it is difficult to determine their sampling distribution. Range statistics In descriptive statistics, the range is the length of the smallest interval which contains all the data. It is calculated by subtracting the smallest observation sample minimum from the greatest sample maximum and provides an indication of statistical dispersion. It is measured in the same units as the data. Since it only depends on two of the observations, it is a poor and weak measure of dispersion except when the sample size is large. The range, in the sense of the difference between the highest and lowest scores, is also called the crude range. When a new scale for measurement is developed, then a potential maximum or minimum will emanate from this scale. This is called the potential crude range. Of course this range should not be chosen too small, in order to avoid a ceiling effect. When the measurement is obtained, the resulting smallest or greatest observation, will provide the observed crude range. The midrange point, i. Again it is not particularly robust for small samples. Mathematical statistics Mathematical statistics is the study of statistics from a mathematical standpoint, using probability theory as well as other branches of mathematics such as linear algebra and analysis. The term "mathematical statistics" is closely related to the term " statistical theory " but also embraces modelling for actuarial science and non-statistical probability theory, particularly in Scandinavia. Statistics deals with gaining information from data. In practice, data often contain some randomness or uncertainty. Statistics handles such data using methods of probability theory. Introduction Statistical science is concerned with the planning of studies, especially with the design of randomized experiments and with the planning of surveys using random sampling. The initial analysis of the data from properly randomized studies often follows the study protocol. Of course, the data from a randomized study can be analyzed to consider secondary hypotheses or to suggest new ideas. A secondary analysis of the data from a planned study uses tools from data analysis. Data analysis is divided into: For example, inferential statistics involves selecting a model for the data, checking whether the data fulfill the conditions of a particular model, and with quantifying the involved uncertainty e. While the tools of data analysis work best on data from randomized studies, they are also applied to other kinds of data for example, from natural experiments and observational studies, in which case the inference is dependent on the model chosen by the statistician, and so subjective. Mathematical statistics has been inspired by and has extended many procedures in applied statistics. Statistics, mathematics, and mathematical statistics Mathematical statistics has substantial overlap with the discipline of statistics. Statistical theorists study and improve statistical procedures with mathematics, and statistical research often raises mathematical questions. Statistical theory relies on probability and decision theory. Mathematicians and statisticians like Gauss, Laplace, and C. Peirce used decision theory with probability distribution s and loss Page 4

5 function s or utility function s. The decision-theoretic approach to statistical inference was reinvigorated by Abraham Wald and his successors. From Yahoo Answers Question: What is a regular introductory Statistics class like?? Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. Page 5

6 Chapter 3 : Descriptive Statistics Descriptive statistics Although entering a large set of numbers into a spreadsheet can take some time, once they are there you can carry out a wide range of operations and calculations very easily and. May 5, by April Klazema Statistical analysis allows you to use math to reach conclusions about various situations. This type of analysis can be performed in several ways, but you will typically find yourself using both descriptive and inferential statistics in order to make a full analysis of a set of data. There are key differences between these two types of analysis, and using them both can aid you in getting accurate conclusions about your test subjects. So where is this type of analysis applicable? Believe it or not, most people use statistical analysis for many aspects of their life whether they realize it or not. You can really enhance your understanding of the world with just a little more understanding of statistics and how it works. If you want to learn all there is to know about the subject beyond the basics of descriptive and inferential statistics, then check out the Workshop in Probability Udemy course. Descriptive Statistics In order to understand the key differences between descriptive and inferential statistics, as well as know when to use them, you must first understand what each type of statistics does, and what it is used to analyze. Descriptive statistics is a form of analysis that helps you by describing, summarizing, or showing data in a meaningful way. Descriptive statistics only give us the ability to describe what is shown before us. You cannot draw any specific conclusions based on any hypothesis you have with just descriptive statistics. A good example would be grades for a body of students. Using just descriptive statistics, you can find patterns of the test scores, such as a small number of students get high and low test scores and a large number of students get average test scores. If you were to simply present the data as it is, then you would not be able to easily visualize what the data is trying to show or tell you. This is even more difficult when you have a lot of data to process. The first type is the Measures of Central Tendency. This type of statistics describes the central position for a frequency distribution when it comes to a specific group of data. In reference to the exam grades, finding out that the average test score is 77 would be a measure of the central tendencies of the data. This type of analysis helps people summarize data by describing the way in which the data is spread out. Look at the test scores again; the median score may be This form of analysis takes a look at these test scores and evaluates how many students made a score between 83 and, as well as how many made scores between 0 and The different ways in which to do this form of analysis includes finding things such as the absolute deviation, variance, quartiles, standard deviation, and the range. Some of these concepts may seem complicated, but you can learn statistics quickly and easily with Udemy. There are all sorts of courses and tutorials out there to help you master statistics. Inferential Statistics Above we explore descriptive analysis and it helps with a great amount of summarizing data. The examples regarding the test scores was an analysis of a population. When you use descriptive statistics, you have to have the entire population at your disposal, since descriptive analysis gives you the properties of the population as a whole, such the mean or the absolute deviation. Instead of getting the data from the entire school, you would take a small sample, such as the test scores that you already have. The technique you use for inferential statistics is a bit different from the ones you use with descriptive statistics. Inferential statistics involves you taking several samples and trying to find one that accurately represents the population as a whole. You then test that sample and use it to make generalizations about the entire population, which in this case is every student within the school. There are two methods used in inferential statistics: You can come to a close estimate of what the test scores of the population will be like, but you have no way of accurately knowing what the parameters of the test scores truly are without having the data yourself. As with descriptive statistics, you can learn to do inferential statistics with computer software. This can make things it a lot easier and will allow you to input data for a much larger set of numbers. This course teaches you everything you need to know about doing inferential statistics with the SPSS software. You will also get a step-by-step guide to help ensure that you are able to learn the concepts with ease. The Differences of Descriptive and Inferential Statistics Both descriptive and inferential statistics have their benefits and shortcomings. Descriptive statistics are great for a small population. As mentioned before, you have the Page 6

7 accuracy that you may want, but it is all limited to a very small population, at least in comparison to inferential statistics. You can make an educated guess on what the parameters of the entire population are, no matter how large it may be. Unfortunately, this does prevent you from having accurate data. Understanding Key Types of Statistics As easy as learning the concepts of statistics may seem, it can be a difficult thing for someone to apply in a real world situation. Having a few dozen pieces of data to analyze may not be much of a task, but when that data reaches into the hundreds or thousands, things can become a bit more difficult. It can teach you many of the key concepts of statistical analysis. Page 7

8 Chapter 4 : Statistic - Wikipedia Statistics: Descriptive- Chapter 2 Elementary Statistics: Picturing the World 5th edition Larson & Ferber Frequency Distribution & their graphs More Graphs and displays Measure of Central Tendency Measures of Variation Measure of Position. Better numbers exist to summarize location, association and spread: Statistics professors tend to gloss over basic descriptive statistics because they want to spend as much time as possible on margins of error and t-tests and regression. Forget what you think you know about descriptives and let me give you a whirlwind tour of the real stuff. The average The arithmetic mean is one of many measures of central tendency. One particularly useful feature of the mean is that, whenever we lack outside information like a scientific theory, it is our best possible guess for what to expect in the future. For example, some statistics about Homo sapiens: The problem with the arithmetic mean is that it does not correspond to anything or anyone, it just blends everything together. The median, on the other hand, can be interpreted as a typical sort of value. If you look at how tall the average adult human is, you will find a bump around cm and another around cm. In the case of human height, the answer is obvious: Once you split the data by gender, the bimodal distribution disappears. However, the typical adult human is not therefore a cm tall woman. The median of a dataset with two or more dimensions is not accurately represented by the median of each individual dimension. What you want is the centerpoint or the half-space. With very many dimensions, however, the concept of a central value becomes less and less useful. Interestingly, this is true not just for humans but for machines as well: The spread The standard deviation measures how spread out different values are. Why would you square something only to take its square root a couple steps later? We square the distances to the mean to make them positive Squaring is a mathematical hack: Easy differentiation is nice, but not terribly relevant when all you want to do is describe the spread of your data. The standard deviation lacks an easy interpretation. When communicating how far apart values are, use the mean absolute deviation or the median absolute deviation MAD. These statistics have the distinct advantage that they stand for what your audience will think they stand for. An acceptable substitute, also quite easy to interpret, is the interquartile range. Sort the data, put it into four bins of equal size, and return the lower and upper bound of the two bins in the middle, otherwise known as the first and third quartile. Half of your data is in between these goal posts. The interquartile range is the measure of spread you will usually see pictured as the box in a boxplot. The interquartile range is also sometimes communicated as a single number, the difference between the third and first quartiles. The location Statisticians and mathematicians are lazy, so instead of devising one statistical method that works for data with a mean of 2 and a variance of 5, and another statistical method for data with a mean of 23 and a variance of 8. These standardized numbers are called pivotal quantities, quantities that make no reference to the mean or variance or any other parameter of a statistical distribution, and they are used a lot in statistics. One such pivotal quantity is the z-score. To convert a dataset into z-scores, subtract the mean from each value and then divide each value by the standard deviation. This normalizes every value to a normal distribution with a mean of 0 and a standard deviation of 1. Once in that standardized format, you can run all kinds of statistical tests, in particular Wald tests. Normalized data is also useful when comparing things. If you took a test and got 15 out of 20 questions right, is that above or below average, and exactly how far above or below? Z-scores are great for statistical tests. As a basis for comparisons, they are flawed: A more easily interpretable number is the percentile rank. The 50th percentile is our good friend the median. Percentile refers to the actual value, percentile rank is the fraction of the data it corresponds to. You can calculate the percentile rank for any value in a dataset. As with the median, percentile ranks are immune to skew and kurtosis: Strangely, I almost never hear statisticians talk about z-scores but it pops up from time to time in news articles. The skew Data is skewed when it contains a disproportionate amount of small or large values, rather than the data being nicely spread out in both directions around the mean. If you graph the distribution, it will look lopsided, with the bulk of the data on one side and a long tail on the other. Negative skewness means the data is skewed to the left, which means it has a fat left tail, and positive skewness shows up as a fat right tail on a histogram. Skewness Page 8

9 is another statistic where I sometimes see non-statisticians trying to outdo statisticians. Skewness is a number that is used so little in statistics that even an experienced data scientist would have a hard time drawing a distribution of approximately the right shape if you gave them a skew statistic. How can we convey skewness if not through a statistic? For a technical audience, a QQ-plot can communicate how two distributions differ in shape. In every other situation, use a histogram. A histogram organizes the data into an arbitrary number of equal intervals, counts how many points fit in each interval, and plots those counts as a bar chart. It takes up more space than a number but you get to see the true shape of the data. In fact, analysis of just the bottom or top of a dataset opens you up to regression toward the mean, which will invalidate your conclusions. But there are moments when you do need a way to spot anomalies, perhaps to detect fraud or malfunctioning machines. It is common to look for outliers by identifying values that are more than 3 standard deviations from the mean. Intuitively this makes a lot of sense, because the standard deviation and the mean were probably the first things you calculated when you got the data, and a normal distribution has very little density at 3 standard deviations beyond the mean. However, x deviations from the mean is a self-defeating heuristic: To stick to anatomical examples: Instead of hunting for outliers per se, we leave out one observation from the model at a time and check whether this single observation affects the model parameters one way or the other, the idea being that something can only count as outlandish if it has an outlandish impact on how we see the world. Because it adds or removes the entire observation, with all of its component variables, this technique can detect highly unusual observations that at first sight look perfectly normal. The correlation Take a daily aspirin and you are less likely to succumb to a heart attack. Higher temperatures, maintained for longer, kill more bacteria. Machinery subjected to heavier loads will break down sooner. A relationship between any two variables is an association, an association between two quantities not gender or color but distance or weight is a correlation. Negative correlations simply mean that as one thing goes up, the other goes down. The longer folks have to wait for the bus, the worse their mood will be. Digging a little deeper, we see that the Pearson product-moment correlation is a measure of linear association. It works by drawing lines. Statisticians can do all sorts of crazy things with lines that make them not lines anymore while they get to pretend that they still are. The squiggly curves of polynomial regression still count as linear regression, for one. Fundamentally, though, a correlation is still just a line, and not every relationship between two variables can be captured by a linear relationship that states for each additional x, increase y by this amount. Toxins are generally harmless below a certain treshold and then very quickly become dangerous. Cheaper goods sell more, but below a certain price point other factors weigh more heavily on our purchasing decisions. Or you might remember from an intro to stats that taking the square root or the logarithm of the dependent variable in a hockey stick graph will straighten it out, and Pearson may live to fight another day. Not really though, Karl Pearson died in But why do you want a number at all? Instead, just draw a scatterplot, which shows the relationship in all its messy glory, no matter how bendy or how straight. Still not happy and absolutely want a number? You would do well to shun correlations even so. While statisticians are generally quite good at estimating a correlation from a picture and vice versa, most people are not. Communicate the linear relationship between two variables through its slope instead, the for each additional x, y will increase by this amount thing we mentioned earlier. To calculate the slope of a linear association, multiply the correlation by the standard deviation of the variable on the y-axis and divide it by the standard deviation of the variable on the x-axis. The discipline we call statistics is a two-headed beast. Descriptive statistics is the attempt to make sense of large amounts of data. Each observation brings its own ideosyncracies, so we must distill the data down to easier to read summaries, charts and comparisons between groups. Inferential statistics then takes these summaries and judges whether they are likely to hold true in general or whether they contain quirks, patterns that are particular to just your data. Descriptive statistics is when you ask five people and they all tell you coffee makes them sleepy. Means and medians are descriptive, hypotheses and margins of error are inferential. Statistics attracts people from many different backgrounds but above all it attracts mathematicians. Descriptive statistics is a matter of communication, cognition, numeracy, even user experience. Inferential statistics, on the other hand, is a theoretical delight. Similarly, much of Bayesian statistics relies on brute force simulation in lieu of the elegant little theorems of frequentist statistics. This sheds some light on the psychology of the statistician who turns Page 9

10 sour at the first mention of a posterior probability. The disdain of statisticians for descriptive work has contributed to a peculiar situation where innovations in visualization are generally the work of outsiders and fringe figures like John Tukey, William Cleveland and Edward Tufte. Another consequence is that the descriptive statistics we use so much â the mean, the standard deviation, the correlation â are our go-to numbers not because they are the nicest way to describe a dataset, but because they are useful building blocks for statistical inference. It would be nice to have numbers that can do double duty, statistics that work equally well for description and inference. Page 10

11 Chapter 5 : Descriptive and Inferential Statistics: How to Analyze Your Data In descriptive statistics, we simply state what the data shows and tells us. Interpreting the results and trends beyond this involves inferential statistics that is a separate branch altogether. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics you are simply describing what is or what the data shows. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Descriptive Statistics are used to present quantitative descriptions in a manageable form. In a research study we may have lots of measures. Or we may measure a large number of people on any measure. Descriptive statistics help us to simplify large amounts of data in a sensible way. Each descriptive statistic reduces lots of data into a simpler summary. For instance, consider a simple number used to summarize how well a batter is performing in baseball, the batting average. This single number is simply the number of hits divided by the number of times at bat reported to three significant digits. A batter who is hitting. The single number describes a large number of discrete events. This single number describes the general performance of a student across a potentially wide range of course experiences. Every time you try to describe a large set of observations with a single indicator you run the risk of distorting the original data or losing important detail. Even given these limitations, descriptive statistics provide a powerful summary that may enable comparisons across people or other units. Univariate Analysis Univariate analysis involves the examination across cases of one variable at a time. There are three major characteristics of a single variable that we tend to look at: The distribution is a summary of the frequency of individual values or ranges of values for a variable. The simplest distribution would list every value of a variable and the number of persons who had each value. For instance, a typical way to describe the distribution of college students is by year in college, listing the number or percent of students at each of the four years. Or, we describe gender by listing the number or percent of males and females. In these cases, the variable has few enough values that we can list each one and summarize how many sample cases had the value. But what do we do for a variable like income or GPA? With these variables there can be a large number of possible values, with relatively few people having each one. In this case, we group the raw scores into categories according to ranges of values. For instance, we might look at GPA according to the letter grade ranges. Or, we might group income into four or five ranges of income values. One of the most common ways to describe a single variable is with a frequency distribution. Depending on the particular variable, all of the data values may be represented, or you may group the values into categories first e. Rather, the value are grouped into ranges and the frequencies determined. Frequency distributions can be depicted in two ways, as a table or as a graph. Table 1 shows an age frequency distribution with five categories of age ranges defined. The same frequency distribution can be depicted in a graph as shown in Figure 1. This type of graph is often referred to as a histogram or bar chart. Frequency distribution bar chart. Distributions may also be displayed using percentages. For example, you could use percentages to describe the: The central tendency of a distribution is an estimate of the "center" of a distribution of values. There are three major types of estimates of central tendency: Mean Median Mode The Mean or average is probably the most commonly used method of describing central tendency. To compute the mean all you do is add up all the values and divide by the number of values. For example, the mean or average quiz score is determined by summing all the scores and dividing by the number of students taking the exam. For example, consider the test score values: The Median is the score found at the exact middle of the set of values. One way to compute the median is to list all scores in numerical order, and then locate the score in the center of the sample. For example, if there are scores in the list, score would be the median. If we order the 8 scores shown above, we would get: Since both of these scores are 20, the median is If the two middle scores had different values, you would have to interpolate to Page 11

12 determine the median. The mode is the most frequently occurring value in the set of scores. To determine the mode, you might again order the scores as shown above, and then count each one. The most frequently occurring value is the mode. In our example, the value 15 occurs three times and is the model. In some distributions there is more than one modal value. For instance, in a bimodal distribution there are two values that occur most frequently. Notice that for the same set of 8 scores we got three different values -- If the distribution is truly normal i. Dispersion refers to the spread of the values around the central tendency. There are two common measures of dispersion, the range and the standard deviation. The range is simply the highest value minus the lowest value. The Standard Deviation is a more accurate and detailed estimate of dispersion because an outlier can greatly exaggerate the range as was true in this example where the single outlier value of 36 stands apart from the rest of the values. The Standard Deviation shows the relation that set of scores has to the mean of the sample. Again lets take the set of scores: We know from above that the mean is So, the differences from the mean are: Page 12

13 Chapter 6 : Comparative Study Between Descriptive Statistics authorstream Statistical analysis allows you to use math to reach conclusions about various situations. This type of analysis can be performed in several ways, but you will typically find yourself using both descriptive and inferential statistics in order to make a full analysis of a set of data. There are key. The first subject received Treatment 1, and had Outcome 1. X and Y are the values of two measurements on each subject. We were unable to get a measurement for Y on the second subject, or on X for the last subject, so these cells are blank. The subjects are entered in the order that the data became available, so the data is not ordered in any particular way. We used this data to do some simple analyses and compared the results with a standard statistical package. The comparison considered the accuracy of the results as well as the ease with which the interface could be used for bigger data sets - i. It includes a variety of choices including simple descriptive statistics, t-tests, correlations, 1 or 2-way analysis of variance, regression, etc. Two other Excel features are useful for certain analyses, but the Data Analysis tool pack is the only one that provides reasonably complete tests of statistical significance. Pivot Table in the Data menu can be used to generate summary tables of means, standard deviations, counts, etc. Also, you could use functions to generate some statistical measures, such as a correlation coefficient. Functions generate a single number, so using functions you will likely have to combine bits and pieces to get what you want. Even so, you may not be able to generate all the parts you need for a complete analysis. In order to check a variety of statistical tests, we chose the following tasks: Get means and standard deviations of X and Y for the entire group, and for each treatment group. Get the correlation between X and Y. Do a two sample t-test to test whether the two treatment groups differ on X and Y. Do a paired t-test to test whether X and Y are statistically different from each other. Compare the number of subjects with each outcome by treatment group, using a chi-squared test. All of these tasks are routine for a data set of this nature, and all of them could be easily done using any of the aobve listed statistical packages. Look in the Tools menu. If you do not have a Data Analysis item, you will need to install the Data Analysis tools. Search Help for "Data Analysis Tools" for instructions. Missing Values A blank cell is the only way for Excel to deal with missing data. If you have any other missing value codes, you will need to change them to blanks. Data Arrangement Different analyses require the data to be arranged in various ways. If you plan on a variety of different tests, there may not be a single arrangement that will work. You will probably need to rearrange the data several ways to get everything you need. The typical dialog box will have the following items: Type the upper left and lower right corner cells. You can only choose adjacent rows and columns. Unless there is a checkbox for grouping data by rows or columns and there usually is not, all the data is considered as one glop. Labels - There is sometimes a box you can check off to indicate that the first row of your sheet contains labels. If you have labels in the first row, check this box, and your output MAY be labeled with your label. Then again, it may not. Output location - New Sheet is the default. Or, type in the cell address of the upper left corner of where you want to place the output in the current sheet. New Worksheet is another option, which I have not tried. Ramifications of this choice are discussed below. Other items, depending on the analysis. Output location The output from each analysis can go to a new sheet within your current Excel file this is the default, or you can place it within the current sheet by specifying the upper left corner cell where you want it placed. Either way is a bit of a nuisance. If each output is in a new sheet, you end up with lots of sheets, each with a small bit of output. You will want to make this column wide in order to be able to read the labels. But if a simple Frequency output is right underneath, then the column displaying the values being counted, which may just contain small integers, will also be wide. Results of Analyses Descriptive Statistics The quickest way to get means and standard deviations for a entire group is using Descriptives in the Data Analysis tools. You can choose several adjacent columns for the Input Range in this case the X and Y columns, and each column is analyzed separately. The labels in the first row are used to label the output, and the empty cells are ignored. If you have more, non-adjacent columns you need to analyze, you will have to repeat the process for each group of contiguous columns. The procedure is straightforward, can manage many columns reasonably efficiently, and empty cells are treated properly. To get the means and Page 13

14 standard deviations of X and Y for each treatment group requires the use of Pivot Tables unless you want to rearrange the data sheet to separate the two groups. Finally, drag X in one more time, leaving it as Count of X. This will give us the Average, standard deviation and number of observations in each treatment group for X. Do the same for Y, so we will get the average, standard deviation and number of observations for Y also. This will put a total of six items in the Data box three for X and three for Y. As you can see, if you want to get a variety of descriptive statistics for several variables, the process will get tedious. A statistical package lets you choose as many variables as you wish for descriptive statistics, whether or not they are contiguous. You can get the descriptive statistics for all the subjects together, or broken down by a categorical variable such as treatment. You can select the statistics you want to see once, and it will apply to all variables chosen. Correlations Using the Data Analysis tools, the dialog for correlations is much like the one for descriptives - you can choose several contiguous columns, and get an output matrix of all pairs of correlations. Empty cells are ignored appropriately. The output does NOT include the number of pairs of data points used to compute each correlation which can vary, depending on where you have missing data, and does not indicate whether any of the correlations are statistically significant. If you want correlations on non-contiguous columns, you would either have to include the intervening columns, or copy the desired columns to a contiguous location. A statistical package would permit you to choose non-contiguous columns for your correlations. The output would tell you how many pairs of data points were used to compute each correlation, and which correlations are statistically significant. Two-Sample T-test This test can be used to check whether the two treatment groups differ on the values of either X or Y. In order to do the test you need to enter a cell range for each group. Since the data were not entered by treatment group, we first need to sort the rows by treatment. Be sure to take all the other columns along with treatment, so that the data for each subject remains intact. After the data is sorted, you can enter the range of cells containing the X measurements for each treatment. Do not include the row with the labels, because the second group does not have a label row. Therefore your output will not be labeled to indicate that this output is for X. If you want the output labeled, you have to copy the cells corresponding to the second group to a separate column, and enter a row with a label for the second group. The empty cells are ignored, and other than the problems with labeling the output, the results are correct. A statistical package would do this task without any need to sort the data or copy it to another column, and the output would always be properly labeled to the extent that you provide labels for your variables and treatment groups. It would also allow you to choose more than one variable at a time for the t-test e. Paired t-test The paired t-test is a method for testing whether the difference between two measurements on the same subject is significantly different from 0. In this example, we wish to test the difference between X and Y measured on the same subject. The important feature of this test is that it compares the measurements within each subject. If you scan the X and Y columns separately, they do not look obviously different. But if you look at each X-Y pair, you will notice that in every case, X is greater than Y. The paired t-test should be sensitive to this difference. In the two cases where either X or Y is missing, it is not possible to compare the two measures on a subject. Hence, only 8 rows are usable for the paired t-test. When you run the paired t-test on this data, you get a t-statistic of 0. The test does not find any significant difference between X and Y. Looking at the output more carefully, we notice that it says there are 9 observations. As noted above, there should only be 8. It appears that Excel has failed to exclude the observations that did not have both X and Y measurements. To get the correct results copy X and Y to two new columns and remove the data in the cells that have no value for the other measure. Now re-run the paired t-test. This time the t-statistic is 6. The conclusion is completely different! Of course, this is an extreme example. But the point is that Excel does not calculate the paired t-test correctly when some observations have one of the measurements but not the other. Although it is possible to get the correct result, you would have no reason to suspect the results you get unless you are sufficiently alert to notice that the number of observations is wrong. There is nothing in online help that would warn you about this issue. Apparently the functions and the Data Analysis tools are not consistent in how they deal with missing cells. Nevertheless, I cannot recommend the use of functions in preference to the Data Analysis tools, because the result of using a function is a single number - in this case, the 2-tail probability of the t-statistic. The function does not give you the t-statistic itself, the degrees of freedom, or any Page 14

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile