Statistics: Interpreting Data and Making Predictions. Interpreting Data 1/50

Similar documents
Module 28 - Estimating a Population Mean (1 of 3)

Lesson 9 Presentation and Display of Quantitative Data

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

Applied Statistical Analysis EDUC 6050 Week 4

Standard Deviation and Standard Error Tutorial. This is significantly important. Get your AP Equations and Formulas sheet

Population. Sample. AP Statistics Notes for Chapter 1 Section 1.0 Making Sense of Data. Statistics: Data Analysis:

Chapter 7: Descriptive Statistics

Descriptive Statistics Lecture

Unit 7 Comparisons and Relationships

Gage R&R. Variation. Allow us to explain with a simple diagram.

AP Statistics TOPIC A - Unit 2 MULTIPLE CHOICE

Organizing Data. Types of Distributions. Uniform distribution All ranges or categories have nearly the same value a.k.a. rectangular distribution

Appendix B Statistical Methods

Clever Hans the horse could do simple math and spell out the answers to simple questions. He wasn t always correct, but he was most of the time.

t-test for r Copyright 2000 Tom Malloy. All rights reserved

15.301/310, Managerial Psychology Prof. Dan Ariely Recitation 8: T test and ANOVA

Undertaking statistical analysis of

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Unit 1 Exploring and Understanding Data

Unit 2: Probability and distributions Lecture 3: Normal distribution

Never P alone: The value of estimates and confidence intervals

Available as a Powerpoint data file

Chapter 8 Estimating with Confidence. Lesson 2: Estimating a Population Proportion

Part 1. For each of the following questions fill-in the blanks. Each question is worth 2 points.

Probability and Statistics. Chapter 1

Lesson 11.1: The Alpha Value

Lecture 12: Normal Probability Distribution or Normal Curve

PROBABILITY Page 1 of So far we have been concerned about describing characteristics of a distribution.

Statistics Coursework Free Sample. Statistics Coursework

Test 1C AP Statistics Name:

I Can t Stand BOREDOM!

STT315 Chapter 2: Methods for Describing Sets of Data - Part 2

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Psy201 Module 3 Study and Assignment Guide. Using Excel to Calculate Descriptive and Inferential Statistics

3.2 Least- Squares Regression

Making Inferences from Experiments

Frequency Distributions

Measurement and Descriptive Statistics. Katie Rommel-Esham Education 604

Still important ideas

Section 1.2 Displaying Quantitative Data with Graphs. Dotplots

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Experimental Design Process. Things you can change or vary: Things you can measure or observe:

Welcome to OSA Training Statistics Part II

Statisticians deal with groups of numbers. They often find it helpful to use

Business Statistics Probability

Statistics: Bar Graphs and Standard Error

Patrick Breheny. January 28

HARRISON ASSESSMENTS DEBRIEF GUIDE 1. OVERVIEW OF HARRISON ASSESSMENT

Chapter 8 Estimating with Confidence. Lesson 2: Estimating a Population Proportion

(2) In each graph above, calculate the velocity in feet per second that is represented.

Example The median earnings of the 28 male students is the average of the 14th and 15th, or 3+3

The Wellbeing Course. Resource: Mental Skills. The Wellbeing Course was written by Professor Nick Titov and Dr Blake Dear

V. Gathering and Exploring Data

Quantitative Literacy: Thinking Between the Lines

Chapter 1: Exploring Data

Control Chart Basics PK

Sleeping Beauty is told the following:

Part III Taking Chances for Fun and Profit

Frequency Distributions

FIGHTING DRUG-RESISTANT MALARIA

Pneumococcal Vaccines: Questions and Answers

C-1: Variables which are measured on a continuous scale are described in terms of three key characteristics central tendency, variability, and shape.

Lab 4 (M13) Objective: This lab will give you more practice exploring the shape of data, and in particular in breaking the data into two groups.

Summarizing Data. (Ch 1.1, 1.3, , 2.4.3, 2.5)

Section 3.2 Least-Squares Regression

Chapter 19. Confidence Intervals for Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Beattie Learning Disabilities Continued Part 2 - Transcript

Self-Injury. What is it? How do I get help? Adapted from Signs of Self-Injury Program

Descriptive Research a systematic, objective observation of people.

Still important ideas

Variability. After reading this chapter, you should be able to do the following:

Statistics for Psychology

about Eat Stop Eat is that there is the equivalent of two days a week where you don t have to worry about what you eat.

Introduction to Statistical Data Analysis I

MITOCW conditional_probability

Statistics is a broad mathematical discipline dealing with

Choosing Life: Empowerment, Action, Results! CLEAR Menu Sessions. Substance Use Risk 2: What Are My External Drug and Alcohol Triggers?

STP226 Brief Class Notes Instructor: Ela Jackiewicz

Eating and Sleeping Habits of Different Countries

Introduction. Lecture 1. What is Statistics?

Conditional Distributions and the Bivariate Normal Distribution. James H. Steiger

Using Analytical and Psychometric Tools in Medium- and High-Stakes Environments

Chapter 1. Picturing Distributions with Graphs

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

Filling the Bins - or - Turning Numerical Data into Histograms. ISC1057 Janet Peterson and John Burkardt Computational Thinking Fall Semester 2016

Unit 1: Perception and Dreaming

Research Methods 1 Handouts, Graham Hole,COGS - version 1.0, September 2000: Page 1:

111, section 8.6 Applications of the Normal Distribution

How to interpret scientific & statistical graphs

Making charts in Excel

Chapter 12. The One- Sample

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Measuring the User Experience

Paper Airplanes & Scientific Methods

Name AP Statistics UNIT 1 Summer Work Section II: Notes Analyzing Categorical Data

Lesson 1: Distributions and Their Shapes

THIS PROBLEM HAS BEEN SOLVED BY USING THE CALCULATOR. A 90% CONFIDENCE INTERVAL IS ALSO SHOWN. ALL QUESTIONS ARE LISTED BELOW THE RESULTS.

Comparing Two Means using SPSS (T-Test)

First Problem Set: Answers, Discussion and Background

Chapter 1: Alternative Forced Choice Methods

Transcription:

Statistics: Interpreting Data and Making Predictions Interpreting Data 1/50

Last Time Last time we discussed central tendency; that is, notions of the middle of data. More specifically we discussed the mean (= average) and median. Depending on the situation, one of these may be more indicative of the middle of the data than the others. We will see today that the talking about the middle of the data shows only part of the story. On Tuesday we discussed some weather data for San Francisco. Let s look at weather data for both San Francisco and Las Cruces. We will also calculate the mean and median for both data sets. Interpreting Data 2/50

Histograms (= Bar Charts) Since numbers often do not resonate with people, giving visual representations of data often makes the data more meaningful and understandable. We will talk about a couple ways to represent data visually. One we saw last time. A histogram, or bar chart, is the most common way to represent numerical data. We ll illustrate histograms with weather data for high temperatures on January 1 in San Francisco. We ll see some different pictures representing the same data. Interpreting Data 3/50

Here is a histogram we saw last time. Interpreting Data 4/50

Each value, in this case temperature, is drawn with a vertical bar. The height of the bar represents how many times that value occurs. The values are listed on the horizontal axis in increasing value from left to right. We can get a different look at the data by graphing a range of temperature values rather than each value. The following chart does this. Interpreting Data 5/50

Clicker Question Estimate the average high temperature from this chart and enter your estimate with your clicker. Interpreting Data 6/50

Answer Answering something between 57 and 59 would be reasonable. The actual average is 58.5. Because of graphing ranges of temperatures, we can t get the actual average from this kind of chart, but we can get a reasonable estimate fairly quickly. Interpreting Data 7/50

One thing that is often done with visual displays is to compare two sets of data. Let s look at some more weather data. We will show the high temperatures on January 1 from 1977 to 2007 for both San Francisco and Las Cruces. We will also calculate the mean and median for both data sets. We will look at histograms for both sets of data. We can graph each set separately or both together. We ll see both. Interpreting Data 8/50

Interpreting Data 9/50

Trying to see how temperatures in the two cities compare is really hard with the table of data above. We ll see that we can get some sense of the difference with histograms. One interesting point of this data is the following calculation of central tendency, which we ll demonstrate with an Excel spreadsheet. The results are San Francisco Las Cruces Mean 58.5 58.9 Median 58.0 58.5 The mean and median are virtually identical for the two cities. We will now draw histograms in a similar way as we did earlier. Interpreting Data 10/50

San Francisco and Las Cruces High Temperatures, Jan. 1 Interpreting Data 11/50

Interpreting Data 12/50

Clicker Question What both stands out to you about these and could be significant? A One uses red and one uses blue B The Las Cruces data is more spread out C The biggest bars for SF are larger than those for LC D A and B E B and C Interpreting Data 13/50

Answer It is true, and significant, that the Las Cruces data is more spread out. That means there is more variation of high temperatures for Las Cruces than San Francisco. This is due, in part, to the ocean s moderation of the San Francisco weather versus Las Cruces being in the desert. It is true that the highest bars for San Francisco are higher than those for Las Cruces. This is actually a consequence of the previous observation. We ll see other graphs that share this property. Interpreting Data 14/50

While the mean and median of the two sets of temperature data are nearly the same, the graphical representation makes the data look much different. The data for Las Cruces is spread out much more than that of San Francisco. A calculation of the middle of the data only presents part of the story. The dispersion or deviation of the data is also an important part of the data. While there are several measures of deviation, the most common one is called standard deviation. Interpreting Data 15/50

Standard Deviation The most basic property of standard deviation is: The larger the standard deviation, the more spread out the data. That is, the larger the deviation, the more the data is away from the middle. We won t formally define standard deviation or give a formula for how to calculate it. Spreadsheets have built in functions to calculate standard deviation. Interpreting Data 16/50

The point of measuring deviation is to give a sense of how far data is from the middle, or the average. Standard deviation approximately measures the average of how far data is from the middle. For example, the standard deviations of the San Francisco and Las Cruces weather data are 4.0 and 8.0, respectively. The standard deviation for Las Cruces is much larger than for San Francisco, reflecting that the Las Cruces data is more spread out. Interpreting Data 17/50

Some Coin Flipping Data Let s look at the experiment of flipping a coin repeatedly. We will simulate this with Excel and the computer program Maple. In the experiment we simulate a bunch of people flipping 100 coins and determining the percentage of flips which came up heads. We ll first look at an Excel spreadsheet, Coin Flip Distribution.xlsx. Interpreting Data 18/50

Excel isn t the best way to simulate a large amount of data. The following charts were created with the program Maple. We used this program to demonstrate RSA encryption in the beginning of the semester. Interpreting Data 19/50

Simulation of flipping 100 coins 1,000 times Interpreting Data 20/50

Simulation of flipping 100 coins 10,000 times Interpreting Data 21/50

Simulation of flipping 100 coins 100,000 times Interpreting Data 22/50

Simulation of flipping 100 coins 500,000 times Interpreting Data 23/50

Simulation of flipping 100 coins 1,000,000 times Interpreting Data 24/50

As the number of flips gets larger and larger, the graph looks more and more regular. Note that the graphs are roughly symmetric, and that the middle of each graph is at 50, the expected percentage of heads. Q Does the shape of the graphs, especially the latter ones, look at all familiar? A Yes B No Interpreting Data 25/50

The Bell Curve (or Normal Curve) Interpreting Data 26/50

The importance of the bell curve is that as the number of trials gets larger and larger, histograms generally look more and more like a bell curve. The particular shape of the bell curve reflects the mean and the standard deviation. The center of the curve represents the mean. How wide or thin is the curve is an indication of the standard deviation. The larger the standard deviation the wider is the curve. Interpreting Data 27/50

Bell Curves with Different Standard Deviations Standard Deviation = 5 Standard Deviation = 2 For both of these graphs the mean is 50. Interpreting Data 28/50

The blue and red graph have mean 0 and the green graph has mean 2. The blue graph has the smallest standard deviation, followed by the green graph, and finally by the red graph, which has the largest standard deviation. Interpreting Data 29/50

Standard Deviation and Number of Coins Flipped Let s go back to the coin flipping experiment. How do things change if each person changes the number of flips? The following graphs represent a simulation of 50,000 people flipping a coin repeatedly. Each person determines the percentages of heads on their flips. The first graph represents each person flipping a coin 10 times and recording the percentage of heads. The second graph represents each person flipping 100 times. Interpreting Data 30/50

Clicker Question Which graph represents the most variability in the data? A The graph on the left B The graph on the right 10 flips per person 100 flips per person Interpreting Data 31/50

Answer The first graph has more variability. While nearly all the data of the second graph is between 40 and 60, a lot of the data in the first graph is outside that range. Interpreting Data 32/50

Pie Charts We will come back to uses of standard deviation, but we ll talk about another visual representions of data first. Pie charts are used to break a whole into pieces, typically with percentages, according to different categories. Here is an example. Interpreting Data 33/50

Pie charts are often used to give visual displays of data. The size of a wedge is proportional to the size of the data it represents. For example, to represent 25%, you draw a wedge that is 25%, or 1/4, of the full circle. Pie charts are useful typically to show relative sizes of a few categories. When some categories are close in size, they lose some of their effectiveness. Trying to make a pie chart fancier is a good way to make the data less clear, and even to encourage misconceptions about the data. Interpreting Data 34/50

Clicker Question Between the blue and the green regions, which represents the largest number? A The green region B The blue region C They represent the same number D I can t tell Interpreting Data 35/50

Answer Whether or not it appears to be the case, the blue and green regions both represent 25%. These were created by Excel. The spreadsheet Pie Charts.xlsx shows two variants of a pie chart for the same data. Interpreting Data 36/50

For another example of a badly designed pie chart, how easy is it to get information from the following chart? Interpreting Data 37/50

Misleading Charts Here are some webpages demonstrating poorly designed, misleading, or incorrect charts. www.damirsystems.com/?p=99 lovestats.wordpress.com/2009/03/11/pie-charts-our-evil-friend www.huffingtonpost.com/2011/10/03/nancy-pelosis-debt-chartobama-bush n 989425.html obamapacman.com/2011/12/google-android-director-presentsmisleading-chart wonkette.com/wp-content/uploads/2009/11/193.jpg Interpreting Data 38/50

Back to Standard Deviation In terms of the normal curve, standard deviation can be interpreted approximately with the following rules of thumb: 68% of all data is within 1 standard deviation of the mean. 95% of all data is within 2 standard deviations of the mean. 99.7% of all data is within 3 standard deviations of the mean. Interpreting Data 39/50

The letter σ is an abbreviation for the standard deviation, and µ for the mean. Finding the area under a curve was one of the problems that led to the development of Calculus in the 17th century. Interpreting Data 40/50

How to Tell if a Coin is Unfair? We ll use the normal distribution to get some sense on how to tell if a coin is fair. Suppose you flip a coin and get at least 60% heads. Can you conclude the coin is unfair? Interpreting Data 41/50

Clicker Question Q Suppose you flip a coin 10 times and get 6 heads. Do you think this is good evidence to conclude the coin is unfair? A Yes B No A No it really isn t. If you flip a coin 10 times, you are pretty likely to get 6 or more heads quite often. In fact, the probability of getting 6 or more heads is a little over 30%, which we can estimate with our spreadsheet. Interpreting Data 42/50

Let s suppose we flip a coin 100 times and get at least 60% heads. We can ask what is the probability that that happens. Let s imagine we do this many many times. As our simulations indicate, we can consider the distribution of trials giving us a bell curve. Based on the data from Coin Flip Distribution.xlsx, the mean for this curve is 50% and the standard deviation is 5%. Then 60% is two standard deviations to the right of the mean. The amount of data to the right of 60% is then approximately 2.25% of the data. So, there is only about a 2% chance that flipping a coin 100 times results in at least 60% heads, but that is not so small. Unless you had a reason to think the coin might be unfair, it would probably be hard to argue from this data that the coin is unfair, even though it is not too likely to get at least 60% heads. Interpreting Data 43/50

To think a little further about it, suppose a class of 100 students each flipped a coin 100 times. Even if everybody had a fair coin, we d expect, on average, 2 of the 100 students to get at least 60% heads. Therefore, while getting this many heads isn t too likely for any one person, with enough people it will happen. Interpreting Data 44/50

Heights A common example of a data set that looks like a bell curve is heights of adult males (or females). The webpage www.johndcook.com/blog/mixture distribution says that the average height of an adult male is about 70 inches (5 10 ) with a standard deviation of 3 inches. Let s explore some questions about this data. Recall the following rules of thumb: 68% of all data is within 1 standard deviation of the mean. 95% of all data is within 2 standard deviations of the mean. 99.7% of all data is within 3 standard deviations of the mean. Interpreting Data 45/50

Clicker Question Q What is the probability of an adult male having height between 5 7 and 6 1? Enter a two-digit number to represent the probability as a percentage. A The answer is approximately 68%; since the average is 5 10 and the standard deviation is 3 inches, we are asking for the percentage of men within 1 standard deviation of the mean. Interpreting Data 46/50

Clicker Question Q What is the probability of an adult male having height between and 6 1 and 6 4? Enter a two-digit number to represent the probability as a percentage. A The answer is approximately 13.6%; we are asking for the percentage between 1 and 2 standard deviations above the mean. Interpreting Data 47/50

Clicker Question If a person is 7 2 tall, they are 4 standard deviations above the mean. The probability of having height that much or more is 0.00003. This can be found from the website stattrek.com/online-calculator/normal.aspx Q If there are about 2 billion adult males on the planet, what would be the expected number at least 7 2 tall? A The expected number is 2,000,000,000 0.00003 = 6,000. Interpreting Data 48/50

Do you believe this number? The website http://www.ask.com/world-view/many-people-7-feettall-cb4f9973908981f1 estimates that there are around 2,800 people over 7 feet tall. However, the website http://www.answers.com claims there are 20,000 people over 7 feet tall. One thing this shows is that information on the web is not always accurate. A more subtle issue is that the assumption that heights follow a normal distribution may be approximately correct, but extreme values may not fit the model as well as values near the average. Interpreting Data 49/50

Next Week and Homework # 10 Next week is the final week of the semester. On Tuesday we will discuss fractals and focus on their artistic nature. On Thursday we ll wrap up the semester by watching a clip from the Simpson s Treehouse of Horror VI. In that episode lots of math and science is mentioned in the background, and we ll talk about that. Homework #10 is on the class website. It is due by next Thursday by midnight. Interpreting Data 50/50