V. Gathering and Exploring Data

Similar documents
Population. Sample. AP Statistics Notes for Chapter 1 Section 1.0 Making Sense of Data. Statistics: Data Analysis:

Chapter 1: Exploring Data

Introduction to Statistical Data Analysis I

Unit 1 Exploring and Understanding Data

Outline. Practice. Confounding Variables. Discuss. Observational Studies vs Experiments. Observational Studies vs Experiments

STP226 Brief Class Notes Instructor: Ela Jackiewicz

Summarizing Data. (Ch 1.1, 1.3, , 2.4.3, 2.5)

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Understandable Statistics

Business Statistics Probability

Probability and Statistics. Chapter 1

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

M 140 Test 1 A Name SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 60

Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting data to assist in making effective decisions

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Test 1C AP Statistics Name:

Undertaking statistical analysis of

Still important ideas

Observational studies; descriptive statistics

Unit 7 Comparisons and Relationships

Knowledge discovery tools 381

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Department of Statistics TEXAS A&M UNIVERSITY STAT 211. Instructor: Keith Hatfield

Example The median earnings of the 28 male students is the average of the 14th and 15th, or 3+3

Statistics is a broad mathematical discipline dealing with

CHAPTER 2. MEASURING AND DESCRIBING VARIABLES

Chapter 1. Picturing Distributions with Graphs

AP Stats Review for Midterm

CHAPTER 3 Describing Relationships

Still important ideas

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

M 140 Test 1 A Name (1 point) SHOW YOUR WORK FOR FULL CREDIT! Problem Max. Points Your Points Total 75

PRINTABLE VERSION. Quiz 1. True or False: The amount of rainfall in your state last month is an example of continuous data.

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Medical Statistics 1. Basic Concepts Farhad Pishgar. Defining the data. Alive after 6 months?

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Frequency distributions

2.4.1 STA-O Assessment 2

What you should know before you collect data. BAE 815 (Fall 2017) Dr. Zifei Liu

Organizing Data. Types of Distributions. Uniform distribution All ranges or categories have nearly the same value a.k.a. rectangular distribution

Chapter 1 Where Do Data Come From?

Welcome to OSA Training Statistics Part II

Chapter 1: Explaining Behavior

Part 1. For each of the following questions fill-in the blanks. Each question is worth 2 points.

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Readings: Textbook readings: OpenStax - Chapters 1 11 Online readings: Appendix D, E & F Plous Chapters 10, 11, 12 and 14

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Statistical Methods Exam I Review

Stats 95. Statistical analysis without compelling presentation is annoying at best and catastrophic at worst. From raw numbers to meaningful pictures

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Chapter 7: Descriptive Statistics

AP STATISTICS 2010 SCORING GUIDELINES (Form B)

Averages and Variation

9 research designs likely for PSYC 2100

LOTS of NEW stuff right away 2. The book has calculator commands 3. About 90% of technology by week 5

Quantitative Data and Measurement. POLI 205 Doing Research in Politics. Fall 2015

4.3 Measures of Variation

Methodological skills

UF#Stats#Club#STA#2023#Exam#1#Review#Packet# #Fall#2013#

A) I only B) II only C) III only D) II and III only E) I, II, and III

Section 1.2 Displaying Quantitative Data with Graphs. Dotplots

Quizzes (and relevant lab exercises): 20% Midterm exams (2): 25% each Final exam: 30%

Readings: Textbook readings: OpenStax - Chapters 1 4 Online readings: Appendix D, E & F Online readings: Plous - Chapters 1, 5, 6, 13

AP Psych - Stat 1 Name Period Date. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

How to interpret scientific & statistical graphs

Further Mathematics 2018 CORE: Data analysis Chapter 3 Investigating associations between two variables

Data, frequencies, and distributions. Martin Bland. Types of data. Types of data. Clinical Biostatistics

Statistics and Probability

Section I: Multiple Choice Select the best answer for each question.

AP Statistics. Semester One Review Part 1 Chapters 1-5

Lecture Notes Module 2

Lesson 9 Presentation and Display of Quantitative Data

Statistical Summaries. Kerala School of MathematicsCourse in Statistics for Scientists. Descriptive Statistics. Summary Statistics

STT315 Chapter 2: Methods for Describing Sets of Data - Part 2

DOWNLOAD PDF SUMMARIZING AND INTERPRETING DATA : USING STATISTICS

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

Math 2200 First Mid-Term Exam September 22, 2010

Clever Hans the horse could do simple math and spell out the answers to simple questions. He wasn t always correct, but he was most of the time.

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test February 2016

STAT243 LS: Intro to Probability and Statistics Quiz 1, Feb 10, 2017 KEY

Descriptive Statistics Lecture

Empirical Rule ( rule) applies ONLY to Normal Distribution (modeled by so called bell curve)

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Previously, when making inferences about the population mean,, we were assuming the following simple conditions:

AP Psych - Stat 2 Name Period Date. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Distributions and Samples. Clicker Question. Review

Statistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI

3. For a $5 lunch with a 55 cent ($0.55) tip, what is the value of the residual?

Statistics: Interpreting Data and Making Predictions. Interpreting Data 1/50

Observational study is a poor way to gauge the effect of an intervention. When looking for cause effect relationships you MUST have an experiment.

Math 214 REVIEW SHEET EXAM #1 Exam: Wednesday March, 2007

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

q2_2 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

NORTH SOUTH UNIVERSITY TUTORIAL 1

12.1 Inference for Linear Regression. Introduction

Students will understand the definition of mean, median, mode and standard deviation and be able to calculate these functions with given set of

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

Displaying the Order in a Group of Numbers Using Tables and Graphs

Transcription:

V. Gathering and Exploring Data With the language of probability in our vocabulary, we re now ready to talk about sampling and analyzing data.

Data Analysis We can divide statistical methods into roughly two categories: Statistics Exploratory Data Analysis Inference Exploratory analysis answers the question: What do we observe? If h i B d h b Inference answers the question: Based upon what we observe, what conclusions can we draw about the underlying population?

Example V.A According to a recent available General Social Survey (see http://www.norc.org/projects/general+social+survey.htm), 26.2% of 4492 randomly polled Americans attend church at least weekly. Does this result provide any evidence that a minority of Americans attend weekly religious services? What do we observe? We observe 1177 of 4492 randomly sampled adults (26%) go to church at least weekly. What can we infer about the underlying population? What does this sample say about the proportion of all Americans who attend weekly services? Does this survey provide evidence that t a majority do not attend weekly? Can we make a reliable judgment about general opinion using a sample of this size?

The Purpose of Statistics (parameters) (estimates, statistics) Our object is to characterize the underlying population from which the sample was taken i.e., we want to infer something about the population using information from the sample.

Key Terminology A parameter is a quantity that characterizes the population (such as the population mean or population variance). A statistic ttiti is a quantity computed dfrom the sample. Statistics i are often used to estimate population parameters. For example, the sample mean X is used to estimate the population p mean µ. A random sample is one for which every subject in the population has some chance of selection. A simple random sample (or SRS) is a random sample for which each subject in the population has equal chance of selection.

Representative Samples No statistical ti ti methods no matter how sophisticated t will help in understanding the underlying population if the sample is not representative of the population of interest! A sample is representative if it is random. A random sample with n = 5 is better than a nonrandom (i.e., biased) sample with n = 10,000! Example V.B Read the handout Samples in History How did nonrandom Read the handout Samples in History. How did nonrandom sampling lead to biased observations?

Controlled Experiments One way of gathering data is through a controlled experiment, such as a randomized trial. For example, in order to test the efficacy of a new drug or treatment, we may randomly divide the study subjects into two groups: treatment and control. Controlled experiments are often considered the gold standard, in the sense that they eliminate spurious associations due to confounding.

Confounding Factors Confounding is best understood through examples. It is easy to demonstrate, for instance, that incidents of drowning increase when more ice cream is consumed. Also, drunk driving deaths increase when with increased chocolate sales. Does eating ice cream cause drowning, or does eating chocolate cause accidents due to impaired i driving? i Can you think of explanations for these associations? The confounder in each case is a third variable that is associated with both the cause and effect.

Observational Studies Sometimes, it is unethical or infeasible to randomize subjects to treatment groups. For example, how might we carry out a randomized study to determine the effects of smoking? In the absence of a controlled study, we rely on observational data (or a survey). However, we need always use caution when interpreting the results of an observational study. Remember: ASSOCIATION does not necessarily imply CAUSATION.

Example V.C For years, investigators reported based on observational data that hormone replacement therapy (HRT) among postmenopausal women leads to other health benefits, including lowering the risk of heart disease. Note that this assumption has widespread repercussions: HRT was prescribed almost universally to help ease uncomfortable symptoms of menopause. Recently, a randomized HRT trial reached a seemingly opposite conclusion: women on HRT were at higher risk of heart disease (on average) than women on placebo. For example, see: http://www.nhlbi.nih.gov/health/women/pht_facts.pdf hlbi ih /h l h/ / h Can you think of an explanation for these seemingly contradictory findings?

Example V.D Read the handout titled Readings on Controlled Experiments. Consider these questions: Why might a controlled experiment be more ethical than an observational study? When is a controlled experiment not ethical? Considering the last question, how should a researcher design a study to ensure she is following ethical practices?

Exploratory Data Analysis: Visualizing Data Recall that there are two types of variables we can measure: Categorical Variables (e.g., gender, race, political party affiliation). Continuous Variables (e.g., height, weight, age, income). For categorical variables, we can visualize data using a bar chart or pie chart. For continuous variables, we often visualize the distribution of the data using a histogram, stem-and-leaf plot, or boxplot.

Bar Charts A bar chart is plot where each possible category of the variable is represented by a bar. The height of a given bar is proportional to the number of observations from the sample that fall in the associated itdcategory. Example V.E Sketch a bar chart for gender, using our class as a sample.

Pie Charts A pie chart is plot where the whole sample is represented by a circle, or pie. The slice of pie associated with a given category is sized proportionally to the number of subjects that fall in that t category. For example, if 1/3 of the sample subjects fall in a given category, then the interior angle of the slice for that category is (1/3)360 = 120. Example V.F Sketch a pie chart for gender, using our class as a sample.

Histograms A histogram helps us to visualize the distribution of a sample of continuous data. We construct a histogram by following these steps: Divide the range of possible values of the variable into intervals, called bins. The width of the intervals is sometimes referred to as the binwidth. Count the number of observations that fall within each interval. Represent each interval with a bar, whose height is proportional to the number of observations falling within that interval, or bin.

Additional Comments on Histograms Don t confuse a histogram and a bar chart. The vertical axis of a histogram contains intervals over a continuum, whereas the bar chart has categories that are possibly unordered (like race, for example). The binwidth is somewhat arbitrary. The width should be chosen so that the plot is aesthetic not overly smooth (width too wide) and not too ragged (width too narrow). Statistical software packages will select a default binwidth, but will generally also give you the option of explicitly choosing a different binwidth.

Example V.G The following table shows the number of O-ring incidents id (failures) versus temperature in space shuttle take-offs up to and including the Challenger disaster of 1986. O-ring Incidents Temperature (degrees Fahrenheit) None 66 67 67 67 68 68 70 70 72 73 75 76 76 78 79 80 81 One 57 58 63 70 70 Two 75 Three 53

Example V.G, cont d Construct three histograms for all of the launch temperatures combined: one with a binwidth of 2.5 degrees, a second with a binwidth of 5 degrees, and a third with a binwidth of 10 degrees. Take the minimum value of all three to be 50 degrees. Which h of the three do you prefer, and why?

Interpreting a Histogram The purpose of a histogram is to visually assess the distribution of your sample. We are generally interested in four characteristics: 1. Symmetry versus skewness: is the distribution roughly symmetric? If not, is it right-skewed (or positively skewed)? ) Is it left-skewed (negatively g y skewed)? ) 2. Center: around what approximate value do the observations cluster? 3. Mode: roughly how many peaks do we observe in the distribution? One (is it unimodal)? Two (is it bimodal)? 4. Outliers: are there any observations that lie relatively far away from the main body of data?

Skewness A Symmetric distribution appears to have roughly equal numbers of observations on either side of the midpoint of the distribution. A sample from a normally distributed population (such as height or blood pressure measurements, for example) would appear to have a symmetric distribution. A right (or positively) skewed distribution appears to have a bulk of the data clustered toward the lower end of the distribution, with the proportion of larger values tailing off to the right. Measurements such as annual income tend to have positively skewed distributions. A left (or negatively) skewed distribution appears to have a bulk of the data clustered toward the upper end of the distribution, with the proportion of smaller values tailing off to the left. Homework scores in this class have a leftskewed distribution most students tend to score high (between 20-25) with the proportion p of lower scores tailing to the left.

Example V.H The data used for the following three histograms were randomly g g y generated using computer software. Can you describe the distribution in each case?

Example V.H, cont d

Example V.H, cont d

Example V.I Given the histogram for the space shuttle temperatures from Example V.G (seen below with a binwidth of 5), describe the distribution.

Stem-and-Leaf Plots A type of plot closely related to the histogram is the stem-and- leaf plot (or stemplot, for short). Both plots have the same purpose: to help us to quickly characterize the distribution of our sample. The stemplot, however, allows us to view the actual measurements in the sample while still providing a graphical description of their distribution. In a stemplot, the stems are analogous to the bins of a histogram. The leaves represent individual observations bl belonging to each stem. The more leaves in a stem, the more observations fall within that range of the data. THE IDEA: Choose stems that correspond to a base digit of the measurements (such as the 10 s or 100 s digit, for example).

Example V.J Construct a stem-and-leaf plot for the shuttle launch temperatures in Example V.G. Use a stem corresponding to the tens digit in each temperature, with ten leaves per stem. Construct another stemplot that splits the stems of your first plot in two (i.e., five leaves per stem). Construct a third stemplot that splits the stems of the first plot into five (i.e., two leaves per stem). Which do you prefer, and why?

Choosing Stems Y th t th i di bi idth f hi t You can see that the same issues regarding binwidth for histograms apply to stemplots: you don t want to split stems too much, or the plot will look too ragged. Not splitting enough results in an oversmooth plot. Either extreme results in a graph that is difficult to interpret.

Exploratory Data Analysis: Summarizing Data The purpose of exploratory data analysis is to describe the distribution of the sample. Summary statistics help to reduce the data to just a handful of numbers that help an investigator to quickly characterize the distribution. ib ti We will discuss two types of summary statistics: Measures of Center: mean, median, mode. Measures of Spread (Variability): variance, interquartile range, range.

Summary Statistics: Measures of Center The sample mean is defined as X 1 n 1 X n i i The sample median is defined as the middle observed value. In other words, let X (1), X (2),, X (n) represent the sample, sorted from the smallest to the largest value. If n is odd, then the sample median is the middle observation, or X ([n+1]/2). If n is even, then the sample median is the average of the middle two ordered observations, or (X (n/2) + X ([n/2]+1) )/2. The sample mode is the most frequently observed value in the data The sample mode is the most frequently observed value in the data set..

We observe the data 4, 3, 4, 2, 8. Example V.K What are the sample mean, median, and mode?

Interpreting and Comparing the Sample Mean and Median The sample median is often described d as robust. This simply means that it is not greatly affected by extreme or outlying observations. The sample mean, on the other hand, is not robust it can be highly sensitive to extreme observations. Suppose, for example, that we also observe the value 1000 in addition to the data sampled in Example V.K. VK How does this affect the values of the sample mean and median?

Interpreting and Comparing the Sample Mean and Median Because the sample mean is more sensitive to extreme observations than the sample median, how do you think the mean and median compare (i.e., which will be larger) under the following circumstances? Distribution is right-skewed. Distribution is left-skewed. Distribution is symmetric. When reporting about individual or household income, why do p g, y researchers most often use median income as a measure of center?

The Trimmed Mean The sample trimmed mean is a sort of compromise between the sample mean and median. The α100% sample trimmed mean is the sample mean of the middle (1 2α)100% of the ordered observations, where α is some proportion of the data that will be trimmed from each tail. The quantity α is typically chosen to be something like 0.10 or 0.20. What is the 20% trimmed mean for Example VK? V.K? While the sample trimmed mean is an interesting way of producing a sample mean that is more robust, it is actually not often used in practice.

Summary Statistics: Measures of Spread The sample variance s 2 is defined as s 1 1 n 2 ( ) n 2 X X X nx 2. i 1 i i 1 i n 1 n 1 2 n The interquartile range (or IQR) is defined as IQR = Q3 Q1, where Q1 is the first quartile (sample 25 th percentile) and Q3 is the third quartile (sample 75 th percentile). Note that the second quartile Q2 is the sample median. The range is defined as the largest observed value less the The range is defined as the largest observed value less the smallest observed value, or X (n) X (1).

Computing the Quartiles Note that the p th sample percentile is the value (based on the sample) such that p% of the sample lies below that value. The quartiles Q1 and Q3 are hence the sample 25 th and 75 th percentiles. How do we find these quantiles? Divide the ordered data set in half (e.g., if n = 20, then we use X (1),, X (10) to find Q1 and X (11),, X (20) to find Q3; if n = 21, then we use X (1),, X (10) to find Q1 and X (12),,X X (21) to find Q3). To determine Q1, if the number of observations comprising the first half of the ordered data is odd, then Q1 is the middle of these. If the number of observations is even, then Q1 is computed as the value lying 25% of the distance between the middle two of these observations. Q3 is found in a similar manner, applying the same logic to the second half of the ordered observations.

Example V.L What are the sample variance s 2, the standard deviation s, the IQR, and the range for the data in Example V.K? How do these values change if an additional observation equal to 1000 is observed?

The Boxplot A boxplot (or box-and-whisker plot) ) allows us to visualize the distribution of our sample via the so called five-number summary, which consists of the three quartiles along with the minimum and the maximum. The diagram below shows how we construct such a chart. X (1) Q1 Q2 Q3 X (n) Computer software packages sometimes add other features, like an asterisk that indicates the location of the sample mean.

Example V.M Construct a boxplot for the data in Example V.K.

Example V.N Side-by-side boxplots are often useful in comparing the distributions of a continuous variable between two groups. The boxplots below show the distribution of launch temperatures for shuttle missions (through 1986) with zero O-ring failures, versus the distribution for missions with at least one O-ring failure. What do you observe? 55 60 65 70 75 80 No Failures 1 Failure

Example V.O Stat 3000 Statistics for Scientists and Engineers The following data are pre-azt serum antigen levels measured in a study of 20 AIDS patients. Patient Serum Antigen Level (pg/ml) Patient Serum Antigen Level (pg/ml) 1 149 11 180 2 0 12 0 3 0 13 84 4 259 14 89 5 106 15 212 6 255 16 554 7 0 17 500 8 52 18 424 9 340 19 112 10 65 20 2600

Example V.O, cont d What are the mean, median, and mode? What are the variance, standard deviation, and IQR? Draw a boxplot for these data.