Chapter 1 Where Do Data Come From?

Size: px

Start display at page:

Download "Chapter 1 Where Do Data Come From?"

Emery Cooper
5 years ago
Views:

Chapter 1 Where Do Data Come From? Understanding Data: The purpose of this class; to be able to read the newspaper and know what the heck they re talking about!

Statistics: study of how to collect, organize, analyze, and interpret information The advantage of statistics is that it gives a process for making decisions when faced with uncertainties without

the ability to remember facts Population: the collection of individuals or items of interest Examples: All residents of Kentucky All admits to hospitals in U.S.

1 Chapter 1 Where Do Data Come From? Understanding Data: The purpose of this class; to be able to read the newspaper and know what the heck they re talking about! To be able to go to the casino and know why the always wins. Statistics: study of how to collect, organize, analyze, and interpret information The advantage of statistics is that it gives a process for making decisions when faced with uncertainties without prejudice Statistics are used in many fields Examples: Medical: what are the chances a patient will go into remission with a certain cancer treatment Education: does writing material down increase the ability to remember facts Population: the collection of individuals or items of interest Examples: All residents of Kentucky All admits to hospitals in U.S. Maybe you want to know what Wayne state students like to do on a Friday night. Your population: Wayne state students The population is defined in terms of our desire for knowledge Census: measurements from the entire population are used Every 10 years the U.S. conducts a census o They attempt to reach every resident in the United States o Some people are difficult to reach Often, it is not feasible to study the entire population Instead of the entire population, we often take measurements from a subset. Sample: the subset of the population on which we make measurements We call the measurements Data We don t ask everyone in the population. We ask part of the population: or a sample.

collect data on the variables that we are interested in) We conduct a study to collect and process data Types of studies: Observational study: Observes individuals and measures variables of interest

2 Where do data come from? Individuals: The objects described by a set of data (Can be people, animals, or things) Variable: Any characteristic of an individual that can take different values for different individuals (we collect data on the variables that we are interested in) We conduct a study to collect and process data Types of studies: Observational study: Observes individuals and measures variables of interest but does not attempt to influence responses Sample survey: Type of observational study in which a sample is selected and asked to respond to questions Examples Public opinion polls Pre-election polls Teacher evaluations Experiment: Deliberately imposes some treatment on individuals in order to observe their responses (purpose is to study if the treatment causes a change in the response) Examples Medical study: patients are given drugs at various dosage levels to study effectiveness Change a container from to and see if individuals notice the difference There are two main parts in the science of statistics 1) Descriptive Statistics: methods of summarizing a set of data 2) Inferential Statistics:

methods of making inference about a population based on the information in a sample Chapter 2 Samples, Good and Bad Bias a prejudice in one direction Going to the Democratic National convention and

Common ways of creating bias: Convenience Sampling Uses results or data that are conveniently and readily obtained (Runs risk of being severely biased!

3 methods of making inference about a population based on the information in a sample Chapter 2 Samples, Good and Bad Bias a prejudice in one direction Going to the Democratic National convention and asking who each person voted for would create bias. Common ways of creating bias: Convenience Sampling Uses results or data that are conveniently and readily obtained (Runs risk of being severely biased!) Asking your friends is an example of convenience sampling Example: Voluntary response samples These often over represent people with strong opinions Example: Restaurant comment cards People that are willing to volunteer an opinion usually have a strong opinion and it s usually not good. Not many people take time to say I m Happy! This type of data is biased and should not be generalized to the overall population For a sample to be useful, it needs to represent the population! This is important since we usually want to extend the results to the population The sample should be similar to the population in terms of demographics and other variables One way to do this is with a random sample a sample determined completely by

4 A simple random sample or of n measurements from a population is one selected in such a manner that every sample of size n from the population has equal probability of being selected With a random sample Our sample will typically be similar to our population with respect to demographic characteristics We can control the probability of making a mistake (or probability of error)

Chapter 3 What Do Samples Tell Us? Statistic: Example: Of 100 people surveyed, 37 said they would rather take a train than drive.

Parameter: p A numerical characteristic of a population This value is a fixed number, but when doing inferential statistics we will not know its value (unless we take a census) Since the

will force less variation To reduce bias we should use random sampling From the sampling variability we can calculate the margin of error The margin of error: Gives us a way of estimating the

5 Chapter 3 What Do Samples Tell Us? Statistic: Example: Of 100 people surveyed, 37 said they would rather take a train than drive. Our statistic is 37% of the people surveyed (our in this case, our sample) A numerical characteristic of a sample This value is known when we take our sample, but it will from sample to sample Parameter: p A numerical characteristic of a population This value is a fixed number, but when doing inferential statistics we will not know its value (unless we take a census) Since the population is often not available, we use statistics to estimate parameters Variability Describes the spread of the values of the statistics We can control this variability since a larger sample will force less variation To reduce bias we should use random sampling From the sampling variability we can calculate the margin of error The margin of error: Gives us a way of estimating the parameter, given a statistic. If the margin of error is ± 2% and the statistic is 58% of people believe, then 95% of samples would have a statistic between 56% and 60% (plus or minus 2% from our statistic), so we can confidently say that between 56% and 60% of people believe

6 We can say with 95% confidence that the amount by which a proportion obtained from a sample will differ from the population proportion will not exceed 1 where n is the number of people in the sample. n Example: A sample proportion of 50% and a sample of 1600 people: The Margin of Error? % n The Confidence Statement? We are 95% confident that the population parameter is between 47.5% and 52.5% How large a sample is large enough? What are the factors? How confident do we want to be in our conclusions? How much variability is in our data?

7 Chapter 5 Experiments, Good and Bad Back off, man. I'm a scientist To conduct a study properly, you must do the following: Get a representative sample Get a large enough sample Decide whether the study should be an observational study or an experiment A response variable (dependent variable) is a variable that measures an outcome or result of a study This just in: According to statisticians, there is a link between having a hangover and doing poorly on an exam. While the experiment is not conclusive, statisticians recommend not trying to take an exam after a night of drinking. Response variable: Exam scores, Explanatory variable: Whether or not you have a hangover. An explanatory variable (independent variable) is a variable that attempts to explain or causes changes in the response variable In an experiment, we create differences in the explanatory variable and then examine the results

In an observational study, we observe differences in the explanatory variable and then notice whether these are related to differences in the response variable Observational study: We observe that

8 In an observational study, we observe differences in the explanatory variable and then notice whether these are related to differences in the response variable Observational study: We observe that televisions that have a larger screen size, weigh more. Experiment: We noticed that when patients were given a placebo, they rated their mood higher than if they did not take the placebo. Not to be confused with the placebo effect (a lurking variable for most drug related experiments). The individuals studied in an experiment are often called subjects A Treatment is one or a combination of explanatory variables assigned by the experimenter A treatment diagram is useful in determining if all combinations of the explanatory variables have been used. Suppose we are conducting an experiment on weight loss and the explanatory variables we want to use is diet (fat-free and Adkins) and exercise (swimming and walking). Diet Fat-free Adkins Exercise Swimming 1 2 Walking 3 4 When conducting an experiment it is important to randomly assign individuals to one of the treatment groups (Random assignment is equivalent to flipping a coin to decide group membership) Placebos are given to subjects that look similar to the treatment being given in the experiment Control Groups are used to help control lurking variables (variables that have an effect on the response variable but are not part of the study) A Confounding variable is a variable whose effect on the response variable cannot be separated from the effect of an explanatory variable Confounding variables are examples of lurking variables.

An Interaction occurs when the effect of one explanatory variable on the response variable depends on what is happening with another explanatory variable An observed effect so large that it would

9 An Interaction occurs when the effect of one explanatory variable on the response variable depends on what is happening with another explanatory variable An observed effect so large that it would rarely occur by chance is called statistically significant Chapter 6 Experiments in the Real World Nonadherers subjects who participate but do not follow the experimental treatments In a study that tries to determine if a cough medicine helps the patient with pain, a nonadherer may be that person that continues to make themselves hot toddies for their cold, while continuing to be a subject of the experiment. The experimenter will not be able to tell if the medicine or the hot toddy was benefiting pain. Dropouts subjects who begin the experiment but do not complete it Blinding (single blind) only the administrator knows if the subjects receive the treatment or placebo (double blind) neither the subjects nor administrator know what is being given Completely randomized experimental design all the subjects are allocated at random among all treatment groups Did someone say SRS? A block is a group of subjects that are known before the experiment to be similar in some way that is expected to affect the response of the treatments Example: Some of the subjects are pregnant. In a block design, the random assignment of subjects to treatments is carried out separately within each block Example: Do the an SRS for the pregnant women and an SRS for the rest of the subjects.

Chapter 8 Measuring Measure- assign a number to represent a property Instrument- something used to measure Units- the type of values our measurements take Validity a measurement is valid if it is

10 Chapter 8 Measuring Measure- assign a number to represent a property Instrument- something used to measure Units- the type of values our measurements take Validity a measurement is valid if it is relevant or appropriate when representing a property Story: Goober loved to measure the weight of different chocolate cakes, but Goober is a few candles short of a birthday. Goober didn t have a problem with his instrument, he used a calibrated scale he bought at his favorite store WEIGH STUFF MART. His units were not bad either; Goober weighed his cakes in ounces. Goober s problem was that he weighed the cakes, by holding them and stepping on the scale. Goober thinks the average cake weighs approximately 2112 ounces. You could say his measurements are not entirely valid. Often a rate (percent) at which something occurs is a more valid measure than a frequency We have 328 chocolate shops in my city! Rates may be a better option for comparing WOW! I wish we had that many we only have 38 chocolate shops! NYC Resident Population: 8.3 million Ionia Resident Population: 11.4 thousand Reliability a measurement is reliable if it is the same time after time when taken on the same individual Variability Consistency across measures

It might be best, when trying to get Garfield s weight, to weigh him more than once and see that the measurement is consistent.

(Categorical) variable places responses into categories with no logical ordering Quantitative variable numeric values that can be

11 It might be best, when trying to get Garfield s weight, to weigh him more than once and see that the measurement is consistent. If Garfield weighs 122.1, 122.1, 122.1, 122.2, 122.1, after stepping on the scale 6 times, we see that the variance is small and that is a reliable measurement. Variance- A value used to determine if random error is small (so that our measurement is reliable) Types of data Qualitative (Categorical) variable places responses into categories with no logical ordering Quantitative variable numeric values that can be ordered and mathematical operations can be performed such as finding an average Discrete variable things that can be counted Continuous variable things that are measured Run a race in 2:03? Time is continuous nobody counted your seconds.

Chapter 10 Graphs, Good and Bad The distribution of a variable tells what values it takes and how often it takes these values Pie Chart displays the division of a total quantity Used only for

eaten and pie not yet eaten. Notice percentage of pie eaten is approximately 33% and the associated angle is.

12 Chapter 10 Graphs, Good and Bad The distribution of a variable tells what values it takes and how often it takes these values Pie Chart displays the division of a total quantity Used only for qualitative data Should not include too many categories The number of degrees for each wedge should be proportional to the percentage The total percentage must add to be 100% Two categories: pie eaten and pie not yet eaten. Notice percentage of pie eaten is approximately 33% and the associated angle is.33(360) or approximately 120 degrees Bar Graph displays frequency or percentage of items in each category Can be used for more than one categorical variable The bars can be vertical or horizontal The bars should be of uniform width and uniformly spaced The length of a bar represents the quantity we wish to compare I found this bar graph online and thought the title was amusing. I imagined a study where someone asked children What is your favorite juice? and most students replied Yellow! I m stumped what is yellow juice?

13 Line Graph (Time Plot) shows the relationship between a quantitative variable and time Time is the horizontal scale (x-axis) The quantitative variable being measured is the vertical scale (y-axis) This next example seems contradictory to my beliefs but it was also kind of amusing A time series is a record of a variable over time. A steady change over time is called a trend. A seasonal component in time series means that the variable tends to be higher at certain points in time and lower at certain points in time. All other variation can be explained by irregular cycles and random fluctuations. Chapter 11 Displaying Distributions with Graphs Extreme value or Outlier observations that are separated from the rest of the data set by some margin Imagine a study where you asked people at an M&M conference how many M&Ms they ate each day and the results were: 32, 33, 45, 67, 28, 32, 40, 0, 32, 41, 879, 33 We see that 0 and 879 are outliers. Who goes to an M&M conference without eating M&Ms? I d be a little nervous about that person. I d also be a little nervous about the person who consumes 879 M&Ms!

Shape the pattern displayed when the graph is created Stem and Leaf separates data entries into leading digits or stems and trailing digits or leaves.

14 Shape the pattern displayed when the graph is created Stem and Leaf separates data entries into leading digits or stems and trailing digits or leaves. Features: A device that organizes and groups data but allows us to recover the original data if desired Good for spotting extreme values and identifying shape 14 male weights in pounds 139,153,179,201,163,168,157,170,172,165,145,155,161,151 stem tens of pounds leaf one pounds A stem and leaf plot for inches of snow per day for the first week of May? Frequency distribution a summary table in which the data are arranged into conveniently established class groupings. should have between 5 and 15 classes each class grouping should be of equal width overlapping the classes must be avoided useful when dealing with very large data sets through the grouping process the original data is lost class midpoint the point halfway between the boundaries of each class.

15 Weight Frequency 130 but less than but less than but less than but less than but less than but less than but less than but less than Total 14 Histogram a picture of a frequency distribution Shapes of histograms Symmetrical both sides are the same when the graph is folded vertically Uniform every class has equal frequency Skewed Left or Skewed Right one tail is stretched longer than the other. The direction of the skewness is on the side of the longer tail. Bimodal the two classes with largest frequencies are separated by at least one class

16 Chapter 12 Describing Distributions with Numbers Measure of Central Tendency Description of Average (Typical Value) sample mean: (simple average) where n is the sample size and are the observations. The sum of the data values divided by the sample size. Select 4 students and ask how many brothers and sisters do you have? Answers: 2,3,1,3 Notice if the fourth person had responded that they had 10 brothers instead of 3; the mean would be 4 instead. This shows that the mean is influenced by extreme values. Here is something that is not influenced by extreme values: sample median: (middle score) rank data from smallest to largest if n is odd, median is the middle score if n is even, median is the average of two middle scores (number of siblings) observations: 2,3,1,3 1,2,3,3

17 1,2,3,10 Observations (with the fourth responder saying 10 instead of 3): 2,3,1, 10 sample mode: most frequent score Observations: 2,3,1,3 Mode = 3 What if there is no mode?! If no number occurs more than once, we say there is no mode, but if two numbers tie for the number of occurrences, then each observation gets the title of mode. does not always exist/can be more than one Unstable (If we start rounding, the mode can change drastically) can be used with qualitative data Measures of Dispersion (Variability) Distribution #1 Distribution # The mean, median and mode are all 35 in both distributions above, but there is a big difference between the two distributions! How we measure the differences: sample range: (highest observation) (lowest observation) Years of experience of faculty 1, 30, 22, 10, 5 sample range = 30-1 = 29 years This is easy to compute and totally sensitive to extreme scores. Sample Variance: measures the average squared distances from the mean. Sample Standard Deviation: The square root of the sample variance and measures the average distances from the mean.

18 Standard deviation is incredibly important to class and we will discuss the formulas and how to compute in class. Measures of Position Quartiles - divide the data into four equally sized parts First Quartile, : 25% of the data lies below 75% of the data lies above Second Quartile (median), : 50% of the data lies below 50% of data lies above Third Quartile, : 75% of the data lies below 25% of the data lies above Procedure to Compute Quartiles 1) Order the data from smallest to largest 2) Find the median. This is the 2 nd Quartile 3) is the median of the lower half of the data 4) is the median of the upper half of the data In the event that there is an odd number of observations, you will take out the median before computing the first and third quartiles. In the event that there is an even number of observations, you will leave all the observations in, when computing the first and third quartiles. 5 number summary: Min,, median,, Max Interquartile range (IQR) = Range of Middle 50% of the data Students Faculty Students Faculty Min = 0 Min = 10 = 1 = 15 = 5 = 25 = 7 = 31 Max = 10 Max = 73

19 Boxplots: Procedure 1) Draw a scale to include the lowest and highest data value (USE EVEN INCREMENTS!) 2) Draw a box from to 3) Draw a solid line through the box at the median 4) Draw solid lines, called whiskers, from to the lowest value and from to the highest value Chapter 13 Normal Distributions As the class widths for a histogram become smaller and smaller, the top of the histogram becomes more curvelike. We set up these curves so that the area under the curve represents the proportion of observations This is known as the density curve and is the most common way of representing a population Another way to determine shape is by comparing the mean and median The median of a density curve is the point that divides the area in half

The mean of a density curve is the balance point of the density function Because of this if the mean and median are equal

than the median then the curve is skewed left If the curve follows a normal distribution (Gaussian distribution) then it will

an interval The area under the curve represents this proportion and the total area is 1 The normal distribution is

20 The mean of a density curve is the balance point of the density function Because of this if the mean and median are equal then the distribution is symmetric If the mean is greater than the median then the curve is skewed right If the mean is less than the median then the curve is skewed left If the curve follows a normal distribution (Gaussian distribution) then it will be a bell-shaped curve Density curves are useful in determining what proportion or percentage of the population falls within an interval The area under the curve represents this proportion and the total area is 1 The normal distribution is characterized by or (population mean) and (population standard deviation) A normal curve with a 0 and 1 is called the standard normal curve

21 A percentile represents the position of your measurement in comparison with everyone else s and gives the percentage of the population that falls below you. To find a percentile we will use standardized scores (z-scores), denoted z Example If your height is 70 inches, and the heights of the class are normally distributed with 65 and 5, then you have a z 1 That is your height is 1 standard deviation above the mean z x z-scores allow us to transform any normal curve into a standard normal curve Empirical Rule The z-score for an observation is just the number of standard deviations, the observation is above the mean Approximately 68% of the data fall within 1 standard deviation of the mean x s, x s Approximately 95% of the data fall within 2 standard deviations of the mean x 2s, x 2s Approximately 99.7% of the data fall within 3 standard deviations of the mean x 3s, x 3s For a normal distribution, the empirical rule gives exact percentages If an observation is not 1, 2, or 3 standard deviations from the mean, we cannot use the rule. To determine percentages, we use a z-score table, like the one on the next page.

23 Chapter 14 Describing Relationships: Scatterplots and Correlation Scatterplot or Scatter diagram displays the relationship between two quantitative variables x-axis independent variable explanatory variable y-axis dependent variable response variable Example: Age (x) Height (y) Correlation - a measure of association that tests whether a relationship exists between two variables In general we will be looking for linear correlations i.e. how closely the data follows a line when plotted.

24 The correlation coefficient (denoted r) is a value which measures correlation and indicates both the strength of the association and its direction. Positive r suggests that as the x values increase, so do the y values. It also happens that as the x values decrease, so do the y values. Negative r suggests the opposite; when the x values increase, the y values decrease and when the x values decrease, the y values increase. We will always have that 1 r 1 when r is close to 1 (data is close to a straight line with positive slope) when r is close to -1 (data is close to a straight line with negative slope) r 0 (no linear relationship) The stronger the linear relationship, the closer r is to -1 or 1. Generally, we will say there is a strong relationship if r. 75

Note that it is never possible to prove causality just based on the relationship between two variables There is a strong statistical correlation over months of the year between ice cream consumption

25 Note that it is never possible to prove causality just based on the relationship between two variables There is a strong statistical correlation over months of the year between ice cream consumption and the number of assaults in the U.S. Does this mean ice cream manufacturers are responsible for crime? No! The correlation occurs statistically because the hot temperatures of summer increase both ice cream consumption and assaults Thus, correlation does NOT imply causation Other factors besides cause and effect can create an observed correlation To establish whether two variables are causally related you must establish: 1) Time order - The cause must have occurred before the effect 2) Co-variation (statistical association) The correlation coefficient must show a strong relationship between the dependent and independent variable 3) Rationale - There must be a logical and compelling explanation for why these two variables are related 4) Non-spuriousness - It must be established that the independent variable X, and only X, was the cause of changes in the dependent variable Y; rival explanations must be ruled out This type of research is very complex and the researcher can never be completely certain that there are not other factors influencing the causal relationship To help identify a relationship as cause and effect a study is often performed many times The study should yield the same results every time it is conducted (if this occurs it helps rule out rival explanations)

26 Chapter 15 Describing Relationships: Regression, Prediction, and Causation Linear Regression Purpose of linear regression: To predict the value of a difficult to measure variable, y (response variable), based on an easy to measure variable, x (explanatory variable) Example Predict the finishing time of the men s 100 meter dash in 2032 We do this by using a line that fits the data, called a regression line. Lines have equations of the form: Where b is the y-intercept and m is the slope In order to use linear regression, make sure the model is reasonable (the scatter plot and r should indicate a strong correlation) We need a line that is the best fit for our data To accomplish this we will use the method of least-squares

27 To find the least squares regression line, we are essentially looking to minimize the area of the squares created from a possible regression line and our observations: While we do not do this to compute, we find that the least squares regression line can be found by seeing that and We will do an example in class (be sure to include in your notes) Interpolation Predicting y values for x values that are within the range of the scatter plot (This is what regression should be used for) Extrapolation Predicting y values for x values beyond the range of the observations (In general, this should not be done as it can pose a problem) It is possible to create a scatter plot where the explanatory variable is age and the response variable is height. When comparing a child s height to her age, it may seem as if the data has a strong linear correlation. By using a regression line to extrapolate, we might find that at the age of 28 we would expect a height of 7 feet. This problem is present because our growth over time is not typically linear.

Chapter 17 & 18 Thinking about Chance & Probability When thinking about chance, we

When rolling 2 dice an example of something that can happen is rolling snake eyes (both

We call the collection of outcomes we care about an event.

28 Chapter 17 & 18 Thinking about Chance & Probability When thinking about chance, we consider outcomes or the possible things that can happen. When rolling 2 dice an example of something that can happen is rolling snake eyes (both land on one). Rolling 2 ones is an example of an outcome when rolling 2 dice. We call the collection of outcomes we care about an event. When playing cards we might like to consider the chance of getting a royal flush. Our event would be getting a royal flush, which has 4 outcomes (one for each suit)

The probability of an event A, denoted P (A) experiment were performed a large number of times, is the expected proportion of occurrences of A if the In general we can compute probability if the

In this case: The set of all possible outcomes is called the sample space Examples: Roll a d20 (20 sided die) Sample space: {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20} Flip a coin sample

29 The probability of an event A, denoted P (A) experiment were performed a large number of times, is the expected proportion of occurrences of A if the In general we can compute probability if the chance of each outcome is equal. In this case: The set of all possible outcomes is called the sample space Examples: Roll a d20 (20 sided die) Sample space: {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20} Flip a coin sample space {H,T} Usually it will be hard to list the entire sample space. For instance, listing out all of the possible card hands a very long time. We therefore resort to counting principles. Counting principals How many ways are there to arrange the letters in the word SUNDAY? Here we have 6 letters which cannot have repeats. We have 6 choices for the first letter, this leaves us with 5 choices for the second letter, 4 for the third and finally one choice for the sixth letter, giving us 6x5x4x3x2x1 = 720 ways to arrange the letters.

30 If a burglar system has a 3 digit code and each digit is a number between 0 and 9, then how many possible codes are there? Here we have 10 different numbers and we are okay with repeats so there are 10 choices for each of the 3 digits giving us =1000 different combinations If we are interested in looking at a series of operations then a device called a tree diagram is useful for determining the sample space Flip a Penny, Nickel, and a Dime: In this tree we see that there are 8 possible outcomes. This gives us the ability to compute probabilities: If A is the event of getting all three tails, what is P (A)? If B is the event of getting exactly two tails, what is P (B)? If C is the event of getting exactly one tail, what is P (C)? If D is the event of getting no tails, what is P (D)? Thinking about the above counting techniques we see that there should be some formulas at our disposal.

The complement of an event A, denoted, is all outcomes not in A The complement rule P( A) 1 P( A) The addition rule P( Aor B) P( A) P( B) P( Aand B) If we are interested in knowing if event A

31 The complement of an event A, denoted, is all outcomes not in A The complement rule P( A) 1 P( A) The addition rule P( Aor B) P( A) P( B) P( Aand B) If we are interested in knowing if event A occurred given that we know event B occurred, this is known as conditional probability, denoted A B or A given B The conditional probability rule P( A and B) P( A B) P( B) The multiplication rule P( Aand B) P( A) P( B A) We say the events A and B are mutually exclusive or disjoint if they cannot occur together. P ( Aand B) 0 Two events are said to be independent if the occurrence (or nonoccurrence) of one does not effect the probability of occurrence of the other. P( A) P( A B) Events that are not independent are dependent.

32 P( A) P( A B) Example Draw two cards without replacement A {first card is an ace} B {second card is an ace} A and B are dependent P( A and B) P( A) P( B A) ( 4 )( 3 ) Suppose we return the first card thoroughly shuffle before we draw the second A and B are independent P( A and B) P( A) P( B A) ( 4 )( 4 ) We can also use a density curve when our outcomes are not discrete, but continuous. Example: determining the probability that a sample statistic of with a sample size of 100 is within 10% of the parameter. AT THIS TIME YOU MIGHT BE THINKING MARGIN OF ERROR, MARGIN OF ERROR, MARGIN OF ERROR It turns out that in this case there is a 95% chance since the margin of error with a sample size of 100 is 10%. This type of probability model is continuous. We compute probabilities using areas under the density curve. We also require that the total area under the density curve is 1.

Chapter 20 The House Edge: Expected Values Mean and Standard Deviation of a Discrete Probability Distribution One of the most asked questions for probability: If the probability that I win $10 is ¼,

33 Chapter 20 The House Edge: Expected Values Mean and Standard Deviation of a Discrete Probability Distribution One of the most asked questions for probability: If the probability that I win $10 is ¼, $20 is ¼ and $0 is ½, what will I win on average? The mean (denoted ) of a probability model is outcome one would expect on average. The equation for mean is not much different than the old equation for mean. If you win $10 ¼ of the time, $20 ¼ of the time and $0 ½ of the time, you would expect out of every 4 plays to get $10, $20, $0 and $0, so your mean would be but this is equal to The standard deviation (denoted ) of a probability model is the weighted average distance each outcome is from the mean (where the weighting is given by the probabilities)

34 Chapter 21 What is a Confidence Interval? In Chapter 3 we talked about 95% confidence statements Reminder: A statistic from a sample of size n has a margin of error of approximately This is because the statistics of sample sizes n are normally distributed with a mean equal to the parameter and standard deviation of approximately While, what we did in Chapter 3 was great, something worth noting is that the standard deviation of the distribution of statistics should also be based on the parameter you wouldn t expect to allow a confidence interval of something like: I m 95% confident that between 90 and 110 percent of people like brownies. This comes about because we could have a parameter of 100%, but our sample size is small I mean, who doesn t like brownies? Let the new standard deviation be defined to be: where p is the statistic from a sample size of n individuals.

35 Confidence intervals are about to blow your mind!. Sampling Distribution of the Sample Mean Suppose you don t just want to know what percentage of the population has brown eyes or other characteristic variables like that. Suppose you want to know the average IQ of the American population; you want to know the mean of some quantitative variable. We only have the tools thus far, to approximate a percentage of the population that has some characteristic. If we wanted to answer a quantitative question, we could only say things like 54% of the population has more than 2 cats, when we would like to say things like the average person has 1.7 cats You kind of have to feel bad for the 7/10 of a cat running around To approximate the mean we notice the following: If we have a sample of size n, we can compute the mean value of the observations. If we consider the collection of mean values from all samples of size n (just like we did when we looked at statistics), we see that the values that takes, are normally distributed with a mean, which we will call and a standard deviation of, where is the actual standard deviation of all observations from the population.

36 What can we do with this? Suppose we have a sample of size n We can compute the mean, which we will call We can compute the standard deviation, which we will call s From this we can use the fact that the sample means are normally distributed and approximate the mean for the population to be, and approximate the standard deviation of this distribution to be, which gives us that we can make a 95% confidence interval: I am 95% confident that the population mean is between and

Chapter 22 What is a Test of Significance? Statistical hypotheses statements about population parameter Suppose you think people can t tell the difference between sugar and artificial sweeteners.

37 Chapter 22 What is a Test of Significance? Statistical hypotheses statements about population parameter Suppose you think people can t tell the difference between sugar and artificial sweeteners. Your hypothesis: 50% of people would say they like sugar better in a blind taste test Notice: Hypotheses are not necessarily correct! In statistics, we test one hypothesis against another: The hypothesis that we want to prove is called the alternative hypothesis, We might want to show that sugar is actually preferred or that our parameter is greater than 50% H a The hypothesis that is contradictory to is called the null hypothesis, H 0 To determine if is believable we conduct a study with a sample and either Reject H 0 and believe Or Fail to reject H 0 because there was not sufficient evidence to reject it H a

38 Example: Suppose it is believed by others that there is no difference between sugar and artificial sweeteners, but you believe that sugar is better liked. Null hypothesis is that half of the population would like sugar better Alternative hypothesis would be that more than half of the population would like sugar better Now suppose we ask a sample of 100 people and see that 63% of them like sugar better. This certainly suggests that the null hypothesis is wrong, but could it have been coincidental? Notice that if then we would expect that the sample statistics of samples of size 100 would be normally distributed with mean of.5 and we would expect a standard deviation of so 68% of samples should have statistics between 45% and 55%, 95% of samples should have statistics between 40% and 60%. Because of this we can see how unlikely it is becoming that we got one of the few samples with a statistic as large as 63% In fact, we can determine the probability that such a thing would occur: notice the z-score of 63% is And the associated percentile is 99.53% so the likelihood that a statistic as high as 64% is =.47% The likelihood that the statistic take place if the null hypothesis is true is referred to as the P-Value Because it is highly unlikely that we would get a statistic of 63% if the null hypothesis were true, we reject the null hypothesis. Since the statistics satisfied the alternative hypothesis, we also use this as evidence that the alternative hypothesis is in fact true. If we want to be sure that the null hypothesis is untrue, we can adjust the P-Value we are looking for. The level that the P-Value must be under is referred to as the level of significance. For instance, if we were testing to a level of significance of.1%, we would not reject in the above example, but we would if the level of significance was 1%

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile