Statistics Mathematics 243

Similar documents
Introduction to Statistics

Ch. 1 Collecting and Displaying Data

Math 124: Modules 3 and 4. Sampling. Designing. Studies. Studies. Experimental Studies Surveys. Math 124: Modules 3 and 4. Sampling.

THE ROLE OF THE COMPUTER IN DATA ANALYSIS

Chapter 1: The Nature of Probability and Statistics

Math 124: Module 3 and Module 4

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 1.1-1

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ

Data = collections of observations, measurements, gender, survey responses etc. Sample = collection of some members (a subset) of the population

Basic Concepts in Research and DATA Analysis

Statistics are commonly used in most fields of study and are regularly seen in newspapers, on television, and in professional work.

Sampling and Data Collection

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Data collection, summarizing data (organization and analysis of data) The drawing of inferences about a population from a sample taken from

STA Module 1 The Nature of Statistics. Rev.F07 1

STA Rev. F Module 1 The Nature of Statistics. Learning Objectives. Learning Objectives (cont.

A Probability Puzzler. Statistics, Data and Statistical Thinking. A Probability Puzzler. A Probability Puzzler. Statistics.

full file at

EXPERIMENTAL DESIGN Page 1 of 11. relationships between certain events in the environment and the occurrence of particular

STA Module 1 Introduction to Statistics and Data

Population. population. parameter. Census versus Sample. Statistic. sample. statistic. Parameter. Population. Example: Census.

Sta 309 (Statistics And Probability for Engineers)

Vocabulary. Bias. Blinding. Block. Cluster sample

factors that a ect data e.g. study performance of college students taking a statistics course variables include

Unit 1 Exploring and Understanding Data

Chapter 11. Experimental Design: One-Way Independent Samples Design

Lecture 1 An introduction to statistics in Ichthyology and Fisheries Science

Suppose we tried to figure out the weights of everyone on campus. How could we do this? Weigh everyone. Is this practical? Possible? Accurate?

I. Introduction and Data Collection B. Sampling. 1. Bias. In this section Bias Random Sampling Sampling Error

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

A point estimate is a single value that has been calculated from sample data to estimate the unknown population parameter. s Sample Standard Deviation

Chapter 7: Descriptive Statistics

Sampling. (James Madison University) January 9, / 13

Types of Variables. Chapter Introduction. 3.2 Measurement

Chapter 1 Data Types and Data Collection. Brian Habing Department of Statistics University of South Carolina. Outline

1 The conceptual underpinnings of statistical power

A point estimate is a single value that has been calculated from sample data to estimate the unknown population parameter. s Sample Standard Deviation

AP Psychology -- Chapter 02 Review Research Methods in Psychology

Math Workshop On-Line Tutorial Judi Manola Paul Catalano

Introduction to biostatistics & Levels of measurement

Statistical Decision Theory and Bayesian Analysis

Chapter 01 What Is Statistics?

On the diversity principle and local falsifiability

Handout 16: Opinion Polls, Sampling, and Margin of Error

Political Science 15, Winter 2014 Final Review

For each of the following cases, describe the population, sample, population parameters, and sample statistics.

On the purpose of testing:

Chapter 6: Counting, Probability and Inference

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Math Workshop On-Line Tutorial Judi Manola Paul Catalano. Slide 1. Slide 3

UNIT I SAMPLING AND EXPERIMENTATION: PLANNING AND CONDUCTING A STUDY (Chapter 4)

9.63 Laboratory in Cognitive Science

CHAPTER 5: PRODUCING DATA

Artificial Intelligence Programming Probability

AP Statistics Exam Review: Strand 2: Sampling and Experimentation Date:

Measuring the User Experience

Chapter 1 Introduction to Educational Research

Statistics. Nur Hidayanto PSP English Education Dept. SStatistics/Nur Hidayanto PSP/PBI

Stat 13, Intro. to Statistical Methods for the Life and Health Sciences.

Introduction: Statistics, Data and Statistical Thinking Part II

Creative Commons Attribution-NonCommercial-Share Alike License

Do not copy, post, or distribute

Probability and Statistics Chapter 1 Notes

Measurement and meaningfulness in Decision Modeling

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

CHAPTER 1 SAMPLING AND DATA

An InTROduCTIOn TO MEASuRInG THInGS LEvELS OF MEASuREMEnT

Chapter 1: Data Collection Pearson Prentice Hall. All rights reserved

Chapter 5: Producing Data

Math 140 Introductory Statistics

Statistics Concept, Definition

Title of Resource Fast Food Fun: Understanding Levels of Measurement Author(s) Karla M. Batres Institution Caldwell University

Section Introduction

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

The Institute of Chartered Accountants of Sri Lanka

Chapter 1: Introduction to Statistics

Interpretation of Data and Statistical Fallacies

Exploration and Exploitation in Reinforcement Learning

Introduction to Program Evaluation

1 Introduction to statistics and simple descriptive statistics

Handout on Perfect Bayesian Equilibrium

Two-sample Categorical data: Measuring association

P. 266 #9, 11. p. 289 # 4, 6 11, 14, 17

UNIVERSITY OF THE FREE STATE DEPARTMENT OF COMPUTER SCIENCE AND INFORMATICS CSIS6813 MODULE TEST 2

6. Distinguish between variables that are discrete or continuous.

Sample size calculation a quick guide. Ronán Conroy

Chapter 1: Introduction to Statistics

Statistics for Psychology

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Statistics Lecture 4. Lecture objectives - Definitions of statistics. -Types of statistics -Nature of data -Levels of measurement

Introduction to Biostatics.

Statistical Methods and Reasoning for the Clinical Sciences

Quiz 4.1C AP Statistics Name:

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Statistical Methods Exam I Review

Chapter 4: Defining and Measuring Variables

Transcription:

Statistics Mathematics 243 Michael Stob February 2, 2005 These notes are supplementary material for Mathematics 243 and are not intended to stand alone. They should be used in conjunction with the textbook and the classroom discussion. Please inform the author of any errors, typographical or otherwise that you find in them. 1 Types of Data The OED defines data as facts and statistics used for reference or analysis. (And the OED notes that while the word data is technically the plural of datum, it is often used with a singular verb and that usage is now generally deemed to be acceptable.) For our purposes, the sort of data that we will use comes to us in collections or datasets. A dataset consists of a set of objects, variously called observations, items, instances, units, or subjects, together with a record of the value of a certain variable or variables defined on the items. Definition 1.1. A variable is a function defined on the set of objects. For example, we record the SAT scores and GPAs of 30 seniors. The 30 seniors are the items. The SAT score of a given senior can be thought of as the value of a function defined on the set of 30 seniors. In this instance then, a variable is simply a numerical characteristic of the items. We will usually use uppercase letters to denote variables (e.g., X) and lowercase letters (e.g., x) to denote specific values of the variables. A number of distinctions among types of variables are important. 1.1 Categorical and Numerical Variables Some variables, like the SAT and GPA variables mentioned above, are numerical variables. In other words, the range of these variables is a set of real numbers. Note that these numbers almost always come with units. Other variables are nonnumerical and for the purpose of statistical analyses simply serve to divide the subjects up into categories. For example, the variable CLASS when applied to Calvin students, has values FR, SO, JR, SR and serves to divide Calvin undergraduates into four categories. Often categorical variables are transformed into numerical variables by numbering the categories (as in 1,2,3,4 in the example of the variable CLASS). As you might expect, we usually ask different kinds of questions about numerical and categorical variables and do different kinds of analyses. 1

1.2 Continuous and Discrete Variables This distinction refers to the possible values that a variable that have, i.e., the range of the variable. Definition 1.2. A variable is continuous if for every ɛ > 0 and possible value x of the variable, there is a possible value y of the variable such x y < ɛ. Almost always, our continous variables have as their range an interval of real numbers. For example, AGE is a variable defined on all living persons and is a real number in the interval (0, ) (once we specify the units such as minutes). Definition 1.3. A variable is discrete if there is ɛ > 0 such that if x, y are two possible values of the variable then x y > ɛ. The variable SAT described above is discrete since the possible values of SAT are the numbers in the set {200, 210, 220,..., 780, 790, 800}. In this instance, ɛ = 5 suffices. Discrete variables usually have values that are natural numbers (or sometimes rational numbers with fixed denominator) such as counts. Not all variables are either discrete or continuous. For example, a variable with possible values in {0,..., 1 5, 1 4, 1 3, 1 2, 1} is neither. We will have different methods of analyses depending on whether a variable is discrete or continuous. However, given the precision of our measuring instruments, most variables that are theoretically continuous are practically discrete. HEIGHT (in inches) is a continuous variable but in practice we report heights to the nearest half or quarter of inch. HEIGHT (to the nearest quarter inch) is a discrete variable. Nevertheless we we will usually assume that the variable is continuous and often assume that our measurement is exact. Similarly, sometimes a discrete variable will be treated as continous if it has too many different values. We will even profit from treating SAT as a continuous variable even though the definition of continuous in this case is not satisfied for ɛ = 10! 1.3 Measurement Scales Numerical variables are generally used to measure something, some underlying property of the object on which the variable is defined. If x 1 and x 2 are values of a variable X on objects O 1 and O 2 respectively, we make inferences about the differences between O 1 and O 2 on the underlying property that variable X measures. For example, if X is AGE, we say someone aged 10 is twice as old as someone aged 5. However we do not say that someone with SAT score 400 did twice as well on the SAT test as someone who scored 200. This distinction has to do with the scale of values used to measure these variables. 1. Nominal scale. A measurement scale for a variable X is a nominal scale if x 1 x 2 means only that O 1 and O 2 are different (with respect to X). A variable with a nominal scale is essentially a categorical variable for which numbers are the names of the categories. For example, for the variable GENDER we could use 1 to mean female and 0 to mean male (or vice versa) or we could use the numbers 37 and 25 respectively. The actual values of the number carry no other significance about gender. 2. Ordinal scales. A measurement scale for the variable X is an ordinal scale if x 1 < x 2 implies that O 1 comes before O 2 in a natural ordering of the variables by what X measures. For example, FINISH-ORDER of runners in a race is an ordinal variable (as is any ranking variable) since the order of the values of the variable is significant. 2

3. Interval scales. A measurement scale for a variable X is an interval scale if x 1 x 2 = r means that object X 1 has r more of some quantity than object O 2. For example, TEMPERATURE as a measure of heat is an interval variable since a difference of 10 degrees means a certain heat difference regardless of where on the scale it falls. The information in an interval scale is preserved by any linear function but not by a nonlinear function. FINISH-ORDER is not an interval variable (as a measure of speed, at least) as the difference in speed or time to finish between runners 1 and 2 is not necessarily the same as that between runners 2 and 3. 4. Ratio variables. A measurement scale for a variable X is a ratio scale if x 1 /x 2 = r implies that O 1 has r times as much of some quantity as does object O 2. HEIGHT is a ratio variable since twice as tall means twice of something (linear space consumed). TEMPERATURE is not a ratio variable (as a measure of heat) since 20 degrees is not twice as hot as 10 degrees. Ratio scales have a natural 0 whereas interval scales do not. (We could add 10 to all temperatures and preserve the information in the relative temperature readings.) Many of the analyses we perform will assume that we are dealing with a variable that is on an interval scale. Sometimes ratio variables will be assumed. Sometimes we will pretend the variable is on an interval scale (SAT scores is a prime example) when it is really only on an ordinal scale. (SAT scores only order students on the basis of how many questions they got right.) Exercises 1.1 Classify the following variables as discrete or continuous and as nominal, ordinal, interval or ratio a. CLASS b. CLASS-RANK c. GPA d. WEIGHT e. GENDER f. FT-PCT (the free-throw percentage of a basketball player) g. RED-BLOOD-CELL-COUNT (the number of red blood cells per cc of blood) h. SCORE (the score of a golfer in a particular round of tournament golf) 2 Sampling Many of the datasets that statisticians analyze arise from sampling from some population. Definition 2.1. A population is a (well-defined) set of objects. As with any mathematical sets, sometimes we define a population by a census or enumeration of the elements of the population. The registrar can easily produce an enumeration of the population of all currently registered Calvin students. Other times, we define a population by properties that determine membership in the population. (In mathematics, we define sets like this all the time since many sets 3

in mathematics are infinite and so do not admit enumeration.) For example, the set of all Michigan registered voters is a population even though a census of the population would be very difficult to produce. It is perfectly clear for an individual whether that individual is a Michigan registered voter or not. Definition 2.2. A subset S of population P is called a sample from P. Quite typically, we are studying a population P but have only a sample S and have the values of one or several variables for each element of S. The canonical problem of (inferential) statistics is: Problem 2.3. Given a sample S from population P and values of variables X 1,... X k on elements of S, make inferences about the values of X 1,..., X k on the elements of P. Obviously, our success at solving this problem will depend to a large extent on how representative S is of the whole population P with respect to the properties measured by X 1,..., X k. In turn, the representativeness of the sample will depend how the sample is chosen. A convenience sample is a sample chosen simply by locating units that conveniently present themselves. A convenience sample of Calvin students could be produced by grabbing the first 40 students that come through the doors of Johnny s. It s pretty obvious that in this case, and for convenience samples in general, there is no guarantee that the sample is likely to be representative of the whole population. In fact we can predict some ways in which a Johnny s sample would not be representative of the whole student population. One might suppose that we could construct a representative sample by carefully choosing the sample according to the important characteristics of the units. For example, to choose a sample of 40 Calvin students, we might ensure that the sample contains 22 females and 18 males. Continuing, we would then ensure a representative proportion of first-year students, dorm-livers, etc. There are two problems with this strategy. First, there are usually so many characteristics that we might consider that we would have to take too large a sample so as to get enough subjects to represent all the possible combinations of characteristics in the proportions that we desire. Second, even if we list many characteristics, it might be the case that the sample will be unrepresentative according to some other characteristic that we didn t think of and that characteristic might turn out to be important for the problem at hand. Statisticians have settled on using sampling procedures that employ chance mechanisms. The simplest such procedure (and also by far the most important) is known as simple random sampling (SRS). Definition 2.4. A simple random sample of size k from a population of size n is a sample that results from a procedure for which every subset of size k has the same chance to be the sample chosen. For example, to pick a random sample of size 30 of Calvin students, we might write the names of all Calvin students on index cards and choose 30 of these cards from a well-mixed bag of all the cards. In practice, random samples are often picked by computers that produce random numbers. (A computer can t really produce random numbers since a computer can only execute a deterministic algorithm. However computers can produce numbers that look random. More on this issue later in the course.) In this case, we would number all students from 1 to 4135 and then choose 30 numbers from 1 to 4135 in such a way that any set of 30 numbers has the same chance of occurring. The R command sample(1:4135,30,replace=f) will choose such a set of 30 numbers. Now it is certainly possible that a random sample is unrepresentative in some significant way. Since all possible samples are equally likely to be chosen, by definition it is possible that we choose a 4

bad sample. For example, a random sample of Calvin students might fail to have any seniors in it. However the fact that a sample is chosen by simple random sampling enables us to make quantitative statements about the likelihood of certain kinds of nonrepresentativeness. This in turn will enable us to make inferences about the population and to make statements about how likely it is that our inferences are accurate. The concept of random sampling can be extended to produce samples other than simple random samples. Definition 2.5. A random sample of size k from a population of size n is a sample that results from a procedure for which every object has a fixed likelihood of being in the sample chosen. This definition differs from that of a simple random sample in two ways. First, it does not requires that each object has the same likelihood of being the sample chosen. In opinion polls, sometimes specific groups are oversampled. For example, we might want to ensure in a sample of Calvin students that there are enough international students so that we can make some inferences about subgroups. To do this we might construct the sampling method to make it more likely that an international student is chosen in the sample. Second, it does not require that equal likelihood extends to groups. A sampling method that we might employ given a list of Calvin students is to choose one of the first 30 students in the list and then choose every 30th student thereafter. Obviously some subsets can never occur as the sample since two students whose names are next to each other in the list can never be in the same sample. Such a sample might indeed be representative however. Stratified random sampling as described in the textbook is an example of a random sampling method that does not result in simple random samples. Exercises 2.1 Often, we take a sample by some convenient method (a convenience sample) and hope that the sample behaves like a random sample. For each of the following convenient methods for sampling Calvin students, indicate in what ways that the sample is likely not to be representative of the population of all Calvin students. a. The students in Mathematics 243A. b. The students in Nursing 329. c. The first 30 students who walk into the FAC west door after 12:30 PM today. d. The first 30 students you meet on the sidewalk outside Hiemenga after 12:30 PM today. e. The first 30 students named in the bod book. f. The men s basketball team. 2.2 For the convenience samples in the previous problem, which do you suppose is likely to behave most like a random sample? 2.3 Donald Knuth, the famous computer scientist wrote a book entitled 3:16. This book was a Bible study book that studied the 16th verse of the 3rd chapter of each book of the Bible (that had a 3:16). Knuth s thesis was that a Bible study of random verses of the Bible might be 5

edifying. The sample was of course not a random sample of Bible verses and Knuth had ulterior motives in choosing 3:16. Describe a method for choosing a random sample of 60 verses from the Bible. 2.4 Consider the set of natural numbers P = {1, 2,..., 30} to be a population. a. How many prime numbers are there in the population? b. If a sample of size 10 is representative of the population, how many prime numbers would we expect to be in the sample? How many even numbers would we expect to be in the sample? c. Using R choose 5 different samples of size 10 from the population P. Record how many prime numbers and how many even numbers are in each sample. Make any comments about the results that strike you as relevant. 6