STATISTICS: METHOD TO GET INSIGHT INTO VARIATION IN A POPULATIONS If every unit in the population had the same value,say everyone has the same income same blood pressure No need for statistics
Statistics makes conclusion about a population and not individual units in the population. Example: 40% of the student population wear glasses (does not say about individual students) U.S. unemployment rate is 7 % (no mention of who is unemployed) In 70% of the cases Lipitor lowers cholestrol
Suppose I want to know the percentage of students in MSU who use Microsoft Windows Apple IOS Linux Others
Method 1 Query ALL Students about the software This gives information on the entire population Collecting data on an entire population is called CENSUS Generally infeasible in large populations
Method2 Draw a sample of students from the entire population Collect data on the sample Use it to draw conclusions about the population
Statistical Inference Need to draw representative sample Cannot judge if the sample is representative of the population ( we have no information on the population) Draw random sample calculate the probability that the sample is (approximately) representative
A representative sample exhibits characteristics typical of those possessed by the population of interest. A simple random sample of n experimental units is a sample selected from the population in such a way that every different sample of size n has an equal chance of selection.
Issues in sampling Logistics Non statistical. Ensure that the sample is drawn randomly from the population of interest If not, leads to Bias Other sampling errors
How a sample is selected from a population is of vital importance in statistical inference because the probability of an observed sample will be used to infer the characteristics of the sampled population.
Statistics is used in various situations to infer some feature of a population based on a sample to test the effect of a treatment to estimate the effect of a treatment to classify an object
There is a whole theory of Sampling and Design of Experiments Simple random sampling Designed trials to control variability Designed trials to avoid bias
Case Study:sampling from popuation digest1.pdf The Literary Digest Poll 1936, Roosevelt versus Landon. Campaign centered on economic policies.
Literary Digest prediction Landon 57% FDR 43 % Used a sample of 24 million people Actual Result FDR 60.8 %
The Literary Digest s method for choosing its sample was as follows: Based on every telephone directory in the United States, lists of magazine subscribers, rosters of clubs and associations, and other sources, a mailing list of about 10 million names was created. Every name on this list was mailed a mock ballot and asked to return the marked ballot to the magazine.
There were two basic causes of the Literary Digest s downfall: selection bias nonresponse bias
Bias Samples selected from telephone directories, club membership lists etc Biased toward upper-class voters, and exclude lower-income voters the Literary Digest mailing list was far from being a representative cross-section of the population
Non response out of 10 million on the mailing list, only about 2.4 million responded to the survey. People who respond to surveys are different from people who don t, not only in the obvious way (their attitude toward surveys) but also in more subtle and significant ways. Nonresponse is difficult to handle
Moral A badly chosen big sample is much worse than a well-chosen small sample Watch out for selection bias and nonresponse bias.
Quiz 01 Not graded
Testing for effect of treatment Want to study the effect of a variable on another Effect of fertilizer on yield, an ad campaign Ensure that whatever effect on yield is attributable only to fertilizer Conduct the trial on similar plots, similar environment etc
case study Twins Studies of twins reared apart are one of the most powerful tools that scholars have to analyze the relative contributions of heredity and environment to the makeup of individual human natures. Such a study might not set to rest the quarrel over the relative importance of nature versus nurture, but there were few other experiments be more pertinent
Polio Vaccine Trial 1954 Salk polio vaccine trials Biggest public health experiment ever Polio epidemics hit U.S. in 20 th century Struck hardest at children Responsible for 6% of deaths among 5- to 9- year-olds
Polio Vaccine Trial Polio is rare but the virus itself is common Children from higher-income families were more vulnerable to polio! Children in less hygienic surroundings contract mild polio early in childhood while still protected from their mothers antibodies. They develop immunity early. Children from more hygienic surroundings don t develop such antibodies.
Case study:polio Vaccine Polio rate of occurrence is about 50 per 100,000 Suppose the vaccine was 50% effective and 10,000 subjects were recruited for each of the control and treatment groups You would expect 5 polio cases in control group and 2-3 in treatment group Such a difference could be attributed to random variation Clinical trials were needed on a massive scale
Case study:polio Vaccine Why not just distribute the vaccine to some and see if it lowered the polio rate? A yearly drop might mean the drug was effective, or that that year was not an epidemic year 60000 Number of polio cases in the U.S. 1930 to 1955 50000 40000 30000 20000 10000 0 1930 1934 1938 1942 1946 1950 1954 1932 1936 1940 1944 1948 1952 YEAR
The NFIP study Vaccinate all children in grade 2 whose parents give consent Leave grades 1 and 3 unvaccinated Compare incidence of polio in grade 2 to other gradesd Vaccinate group : Treatment Group Grade 1 &3 : Control Group
the observed control experiment, suffers from selection bias Diagnostic Bias
Selection Bias In terms of selection, the treatment and control groups are different with respect to at least two variables: age parental consent for vaccination. Parents who consented to vaccination were on average better educated and more affluent, and lived in more sanitary conditions The critical issue is, whether these variables are related to the variable of interest, namely contracting polio year
Diagnostic Bias diagnosis, the problem is that mild cases of polio resemble influenza and other common diseases. Doctors, who generally believe in the value of vaccines, would tend to diagnose polio slightly more often for the unvaccinated children in the control group than for the vaccinated children in the treatment group
Randomized Study Of the children whose parents give consent, randomly allocate half to a treatment group and the other half to a control group Treatment group gets the vaccine Control group gets a placebo
Double Blinding Neither children nor parents know if child has received vaccine or control Doctors also don t know
Experiment Study Group Population Field Trial Data Polio Cases Paralytic Non- Paralytic False Reports Vaccinated 200,745 33 24 25 Placebo 201,229 115 27 20 Not Inoculated Incomplete Vaccinations 338,778 121 36 25 8,484 1 1 0 Vaccinated 221,998 38 18 20 Controls 725,173 330 61 48 Observed Control Grade 2 Not Inoculated Incomplete Vaccinations 123,605 43 11 12 9,904 4 0 0
Statistics is used in various situations to infer some feature of a population based on a sample to test the effect of a treatment Three or more treatments Analysis of Variance to estimate the effect of a treatment. Regression Problems to classify an object Classification or Pattern Recognition Problems
Dewey Vs Truman (1948) The Gallup, Roper, and Crossley polls all predicted that Dewey would defeat Truman by a significant margin The Crossley, Gallup, and Roper organizations all used quota sampling. Each interviewer was assigned a specified number of subjects to interview. Moreover, the interviewer was required to interview specified numbers of subjects in various categories, based on residential area, sex, age, race, economic status, and other variables.
Dewey Vs Truman (1948) slides/chapter 1/1948 election2.pdf Candidate Crossley Gallup Roper Election Results Truman 45 44 38 50 Dewey 50 50 53 45 Others 5 6 9 5
Dewey Vs Truman (1948) In quota sampling, the subjects are hand-picked to resemble the population with respect to some key characteristics. Quota sampling SEEMS reasonable because it ensures that the sample will resemble the population with respect to some of the important characteristics related to voting behavior. BUT: quota sampling does not work very well due to unintentional bias on the parts of the interviewers. Probability methods use objective chance procedures to select samples. They guard against bias because they leave no discretion to the interviewer.
Quiz 01 (not graded)