Objectives Samling methods Simle random samles (used to avoid a bias in the samle) More reading (Section 1.3): htts://www.oenintro.org/stat/textbook.h?stat_book=os Chaters 1.3.2 and 1.3.3.
Toics: Samling Learning objectives: Be able to identify bad samling methods Know what a reresentative samle is. Understand what a simle random samle is.
What is statistical inference? In statistics we rely on a samle this a very small subset of a larger set, which is called a oulation. Examle: Suose you want to know the roortion of the US oulation who have attended university. It would be rohibitive to ask every American adult this question. Instead we query a relatively small number of Americans, and draw inference on the entire country based on this small samle. The rocedure, whereby we convert information from the samle into intelligent guesses about the oulation, is called statistical inference. How to take a samle: A samle is a tiny subset of the oulation. Therefore it must be reresentative of the oulation, in the sense that no suboulation is over or under-reresented. If this haens it will induce a bias in the samle. On the next few slides we discuss bad samling strategies and then the samling method that gives a good reresentation of the oulation and under which statistical inference is based.
Bad samling methods Many eole try to justify a oint of view based on anecdotal evidence. Anecdotal evidence is based on individual cases, often selected because they are unusual in some way. They tend not to be reresentative of any larger grou of cases and by definition tend to be biased. This is comletely oosite to the statistical way of thinking. Examles: q q Smoking is good for you My great grandfather smoked like a chimney his entire life and live until he was 120. There is no need to ban the use of cell hones. While driving I always use a cell while I drive and I have yet to have an accident. q In many resects, anecdotal evidence is the oosite of statistical inferences. Whereas statistical inference looks at tyical (reresentative) behavior, anecdotal evidence is notable because it is atyical.
Anecdotal evidence is one tye of Hahazard samling: Examles: A substitute teacher wants to know how well students have done in an exam, so he asks the students in the front row their grades. Ask about gun control on the street in a large city in California or in a small town in Texas and you are likely to get very different answers. Even within the same area, answers would tend to differ if you did the survey on a university camus or at a ranchers cooerative. q A study on high school SAT grades selects nearby schools because they are easy to reach. Hahazard samles are always biased: resonses are limited to individuals who just haen to have been available. There is a very real danger of hahazard samling in an observational study (which we define in a few slides).
Voluntary Resonse Samling: Individuals choose to be involved. These samles are highly suscetible to being biased because eole are motivated to resond (or not) for reasons often related to the resonse. Desite their common use, they are not considered to be valid or scientific. Examles: To access the fitness of students, volunteers are asked to run 5K. After buying a roduct you are requested to submit a review. On a Have Your Say forum: 55% of the articiants wrote in to say that immigration was bad for the country. But letters to forums are often written by disgruntled eole. In a random samle of adults, only 30% thought that immigration was bad for the country. They are biased because the samling method, can, systematically favor an outcome different from the truth.
Good samling methods A Simle Random Samle (SRS) is made of randomly selected individuals. This means each individual in the oulation has the same robability of being in the samle. All ossible samles of size n have the same chance of being the samle we select. A SRS is comletely unbiased. In ractice this can be very difficult, so statisticians have many sohisticated methods for obtaining random samles. In this course, we will always assume we have an SRS. A SRS is exactly what we are doing in class when selecting a samle of students to give their heights.
Size of oulation In a SRS there is always a chance that an individual will be selected more than once. In ractice this usually cannot haen (a erson does not like being interviewed twice). But if the oulation size is sufficient large as comared with the samle size, the chance that an individual can be samled twice in a SRS is extremely small. Therefore, the imact of not using an exact SRS scheme is minimal. Since most data is not collected using an exact SRS scheme, a rule of thumb is that the samle size is at most 5% of the total oulation. If it is more than 5%, then most statistical analyses are not valid.
Collecting data through olls Polling is an easy and fun way to collect data. One can use web based alications such as easyolls.com to collect the data. However, one needs to be a little cautions when analyzing such data. We recall that olling data is voluntary resonse, so eole who resond often have a strong oinion on the subject. If you send a oll out to friends and relatives, you need to be wary about the oulation you are analyzing. Since they are your friends or relatives they will be reresentative of a secific demograhic such as young eole, eole of a articular race, eole with a articular olitical view. Thus the oulation you are samling from are not the citizens of the world but a far smaller suboulation.
Objectives Statistical studies and exeriments Collecting data and roblems with confounders Observational vs. exerimental studies Samle vs. oulation htts://www.oenintro.org/stat/textbook.h?stat_book=os Section 1.3.4, 1.4.1 and 1.5.1
Toics: Study Design Learning objectives Understand what an observational study is Understand how confounders (hidden variables) may lead to wrong conclusions Understand what an exerimental study is, and how can be used to remove the effect of hidden variables
Finding answers To reliably answer secific questions set in a articular context, researchers must conduct an exerimental or observational study. When the study is designed and conducted roerly, statistical analysis can be used to formulate conclusions. Data comes in many forms and from many sources. Both the design of the study and the statistical analysis are required to have objective, reliable and justifiable conclusions. Often data is ublically available: collected in the ast and made ublic for future uroses. This data is often roduced by government organizations, and is not biased (as secial interest grous are not involved). Check out websites such as the EPA, FDA, and data.gov (you can get data from these sites for your future roject).
There are two kinds of studies Observational study: Record data on individuals without attemting to influence the resonses. Examle: Based on observations you make in nature, you susect that female crickets choose their mates on the basis of their health. Method 1: Observe and record the health of male crickets that mated. Exerimental study: Deliberately imose a treatment on individuals and record their resonses. Influential factors can be controlled. Method 2: Deliberately infect some males with intestinal arasites and see whether females tend to choose healthy rather than ill males.
Examle 1: Observational studies Observational studies are essential sources of data on a variety of toics. Examle: To see whether child care had an influence on behavior, 1000 children were observed over a eriod of 12 years. It was `found that the more time children sent at day care between zero and 4.5 years of age, the more disrutive the children tend to be. This examle, could be used to justify measures to encourage more mothers to stay at home. But extreme caution is required before we jum to such conclusions: This examle illustrates a roblem with observational studies. It is not clear whether daycare had an influence on the child s behavior or other factors/variables (such as low income families, stressed working arents, single arents) which may be more revalent amongst arents who send their children to daycare.
Definition: Confounding The Daycare examle illustrates the idea of confounding. It is not clear whether daycare or other hidden factors (low income, stress etc.) have increased the chance of disrutiveness in a child. This mix-u of variables between the factor (variable) we want to investigate and the other hidden factors (variables) is known as confounding. Examle: An ice-cream comany had a huge sales camaign in Aril. It was found that ice-cream sales had leat in May, June and July. Does this mean the camaign worked? Answer: May-July are tyically the warm months in the year, where traditionally sales are high. Thus there are two factors which have could have contributed to the lea in sales (1) the sales camaign (2) the warm months. There is confounding between these two factors. Without additional information, it is imossible to disentangle them and know whether the camaign worked.
Examle 2: Observational studies A study was done in the early 90s to see if the mortality rates between left and right handed eole were the same. To do this, the sychologists collected the death records of 2000 eole who had died in Southern California in May, 1990. They rang the families and asked whether the erson was left or right handed. They also categorized the eole into two grous: those who were above 60 when they died and those who were below 60. This is the data: Below 60 Above 60 Left handed 150 250 400 Right handed 300 1300 1600 450 1550 2000 Looking at the data we see that 37.5% of the left handed eole died before 60, whereas 18.75% of the right handed eole died before 60. This is a rather large difference. Does this mean if you are left handed you should be worried?
What the study did not take into account was that the roortion of eole who are left handed has been increasing over time, due to shifts in social attitude. 60 years ago there was a olicy to `change to the right side all children who showed signs of left handedness. Left handed eole were turned into Right handed students In more recent times this measure has been hased out. To understand how this would effect the data consider the following extreme case: Before 1930 no one was left handed. This means that all the left handed eole who had died in 1990 must be below 60 (since here are no left handed eole over 60). Without taking this change into account, the numbers would suggest that if you were left handed then you would die before 60! This of course is not the case, it is simly because there were no left handed eole before 1930. The hidden factor is that there is an increase in the roortion of left handed eole over time. This change in roortion is giving the erceived increase in mortality of left handed eole at a younger age.
Question Time
Exerimental Studies These examles illustrate that it is often better (if ossible) to design the exeriment from which we collect the data. In an exerimental study we deliberately imose some treatments on some individuals and observe their resonse. When our goal is to understand cause and effect, exeriment studies are the only way to have fully convincing data. Examle 1: Do smaller class sizes have an influence on exam results? An observational study will have too many confounders (hidden factors), because large class sizes tend to be in areas which are oorer and oorer areas have more social roblems which can contribute to worse examination results. In the Tennessee exeriment, researchers tried to revent confounding by randomly allocating 6385 children to a normal class (arox 25), small class (arox 15) or normal class with extra teacher, and then follow these children over some years. This removes confounders such as income from the study.
Exerimental Studies Examle 2: To understand if a certain treatment has an influence on sick atients we need to randomly allocate a treatment or lacebo to each atient. We should not give the sickest atients the treatment and the least sick atients the lacebo, this would bias the results. Examle 3: The CO2 obesity study discussed in the introduction is an exerimental study. Different treatments were given to the rats and their weights/food consumtion/hormone levels were observed.
Exerimental Studies: roblems Though ideal, sometimes it is imossible to conduct an exerimental study Examle 4: To see whether daycare has an influence on a child s behavior, at birth we randomly assign each child to either daycare or no daycare. In ractice this is imossible! This examle illustrates situations where it is infeasible or indeed unethical to do an exerimental study. Examle 5: During the 1940s scientists observed that the incidence of lung cancer were higher amongst eole who smoked. This lead scientists to believe that a causality may be involved. That is smoking could directly be linked to an increase in cancer. Ronald Fisher argued that such a claim could not be made on data collected in this way. He argued that an equally lausible exlanation was that eole who tended to smoke also tended to have a disosition for lung cancer. Though we now know that smoking does increase dramatically the chance of lung cancer, this line of argument is not wrong.
Question Time What is the most effective method (may not be realistic) for establishing a causality between lactose consumtion and the incidence of fraternal (nonidentical) twins? What do you think may be a realistic method for looking for a relationshi?
Question Time OkCuid, an online dating website, has sent a decade collecting and analyzing data from eole who use the website. The team randomly selects 5000 users and comares the average attractiveness scores they received (based on eole who viewed them) with the number of messages they were sent in a month. This is an examle of: (A) Exerimental Study (B) Observational Study
Case study: antibiotics in farms htts://www.scientificamerican.com/article/how-drug-resistant-bacteriatravel-from-the-farm-to-your-table/ Drug resistant bacteria (such as MRSA) is on the rise. The rise of this bacteria is due to the mass use of antibiotics (which causes the resistance). Antibiotics are widely used in farming To increase the weight of animals (using low doses of antibiotics reduce the gut flora and increase its weight). To reduce infection amongst animals in industrial farms (where conditions are cramed and infections easily caught and sread). An observational study In a study conducted in Pennsylvania, eole who were the most heavily exosed to cro fields treated with ig manure for instance, because they lived near to them had more than a 30% greater chance of develoing MRSA infections comared with eole who were the least exosed. This study (and many similar studies, read the above article) suggest that the wide sread use of antibiotics in farms is driving drug resistant infections in humans.
However, it has often been argued that farms use antibiotics that humans do not use and resistance that develos to these nonhuman drugs will not ose a risk to eole. Several studies have been done in labs to understand the influence feeding antibiotics have on drug resistance. They have shown that the above line of argument is incorrect. However, it is still not fully understood how drug resistance sreads on large scale industrial farms and how this gets into the food chain. The only way to understand this sread is to collect data using an exerimental study on large scale farms.
Statistical Terminology The individuals in a study are the exerimental units. We often call them subjects, esecially if they are human. Collectively, the units are known as the samle. Each variable we measure on the units is called a resonse. It can be numerical or categorical. The urose of the study is to comare the resonses of different grous of subjects. The characteristic distinguishing the grous is called a factor or a treatment. Examle: To understand how exercise and diet effect blood ressure, one random grou of eole may be laced on a diet/exercise rogram for six months (treatment), and their blood ressure (resonse variable) would be comared with another grou of eole who did not diet or exercise. q The larger (and often concetual) grou from which we select the samle is called the oulation. It usually is much larger.
Key oint we don t see everyone The samle is usually far smaller than the oulation. But the oulation is what we are interested in, not just the individuals in the samle. The objective of statistics is to extraolate from the evidence of the samle to the oulation (objectively and reliably). This is called statistical inference. Sometimes the samle is selected from a suboulation of the oulation. If the suboulation and oulation differ substantially then our results may be biased. Examle: To understand how nutrition may influence the birth weight of a child, a simle random samle was taken from maternity hositals in Texas. Problem: By excluding other states in the study certain arts of the oulation may not be reresented in the samle. Thus leading to a biased samle. This illustrates that the samling method is critical.
Accomanying roblems associated with this Chater Quiz 1 (Q1 and Q2)