Math 1680 Class Notes. Chapters: 1, 2, 3, 4, 5, 6

Math 1680 Class Notes Chapters: 1, 2, 3, 4, 5, 6

Chapter 1. Controlled Experiments Salk vaccine field trial: a randomized controlled double-blind design 1. Suppose they gave the vaccine to everybody, and the incidence of polio went down. Would that show the vaccine was effective? 2. What if they put the consent group in treatment and the noconsent group in control? (a) Would the difference in group sizes matter? (b) How do the two groups (consent and no-consent) differ? (Polio is a disease of hygiene) 3. What about the NFIP design (grade 2 in treatment, grades 1 and 3 in control)? 4. In a proper controlled experiment, should the assignment be done by the toss of a coin, or by expert judgment? 1

Whenever possible, the control group is given a placebo, which is neutral but resembles the treatment. The response should be to the treatment itself rather than to the idea of treatment. In a double-blind experiment, the subjects do not know whether they are in treatment or in control; neither do those who evaluate the response. This guards against bias, either in the responses or in the evaluations. Conventional wisdom dictates that the investigator should control the key variables and randomize the rest. 2

Chapter 2. Observational Studies In an observational study, the subjects were assigned to treatment through a process outside the control of the investigator. (Studies on the effects of smoking, for instance, are necessarily observational: nobody is going to smoke for ten years just to please a statistician.) Unlike controlled experiments, observational study is harder to draw conclusions about cause-and-effect relationships. The cause and effect may both be the result of some hidden third factor a confounder. A confounder is not just any alternative explanation for an effect. The idea is more subtle: in order for X to confound the association between Y and Z, X has to be associated both with Y and with Z. Note that if X causes Y, we can also say that X is associated with Y. But if X is associated with Y, we cannot say that X causes Y. (A8) 3

In an observational study, a confounding factor can sometimes be controlled for, by comparing smaller groups which are relatively homogeneous with respect to the factor. (Sex Bias in Graduate Admissions on p.17) Relationships between percentages in subgroups (for instance, admissions rates for men and women in each department separately) can be reversed when the subgroups are combined. This is called Simpson s paradox. 4

(Town X) Not an example of Simpson s paradox Democrat Ward A B C All Total number 750 200 150 1100 Number voting 150 160 110 420 Rate 20% 80% 73.3% 38.2% Republican Ward A B C All Total number 700 230 120 1050 Number voting 112 164 78 354 Rate 16% 71.3% 65% 33.7% 5

(Town Y) An example of Simpson s paradox Democrat Ward A B C All Total number 650 250 155 1055 Number voting 140 190 105 435 Rate 21.5% 76% 67.7% 41.2% Republican Ward A B C All Total number 120 730 220 1070 Number voting 20 500 140 660 Rate 16.7% 68.5% 63.6% 61.7% 6

Chapter 3. The Histogram Class Intervals In a histogram, percentages are represented by areas. In this setup, the height of a histogram shows crowding or density: it is percent per unit length. (eg. % per $1000) The point of sketching the histogram is usually to show some qualitative feature, such as the weight in the tails. For this, a smooth curve is just as good as the histogram, and is easier on the eye. 7

Variables: 1. Qualitative 2. Quantitative (a) Discrete (b) Continuous 8

Chapter 4. The Average and the Standard Deviation The average of a list of numbers equals their sum, divided by how many there are. In a cross-sectional study, different subjects are compared to each other at one point in time. In a longitudinal study, subjects are followed over time, and compared with themselves at different points in time.(there is evidence to suggest that, over time, Americans have been getting taller. This is called the secular trend in height, and its effect is confounded with the effect of aging in figure 3. Most of the two-inch drop in height seems to be due to the secular trend: the people age 65-74 were born around 50 years before those age 18-24, and are an inch or two shorter for that reason.) A histogram balances when supported at the average. The median of a histogram is the value with half the area to the left and half to the right. 9

The average is to the right of the median whenever the histogram has a long right-hand tail. When dealing with long-tailed distributions, statisticians might use the median rather than the average, if the average pays too much attention to the extreme tail of the distribution. The SD measures how far away, on the whole, the numbers are from their average. It is the typical departure from average. For many lists of numbers, about 68% of the entries are within one SD of average, and 95% are within two SDs. Although this rule isn t exact or universal, it works surprisingly well for many data sets that don t follow the normal curve at all (footnote 9 to the chapter). The root-mean-square operation: measure the typical size of the numbers in a list SD = r.m.s. deviation from average (E9, E10) 10

Chapter 5. The Normal Approximation for Data This chapter ties together histograms, the average, the SD, and the normal curve. The normal curve was discovered around 1720 by Abraham de Moivre, while he was developing the mathematics of chance. The normal curve has an equation: y = 100% 2π e x2 /2, where e = 2.71828.... 11

The normal curve is shown in the following figure: Normal Curve Percent per S.U. 45 40 35 30 25 20 15 10 5 0-4 -3-2 -1 0 1 2 3 4 Standard Units the area under the normal curve between 1 and +1 is about 68%; the area under the normal curve between 2 and +2 is about 95%; the area under the normal curve between 3 and +3 is about 99.7%. Many histograms for data are similar in shape to the normal curve, provided they are drawn to the same scale. Making the horizontal scales match up involves standard units. 12

A value is converted to standard units by seeing how many SDs it is above or below the average. In figure 2, the shaded area under the histogram between 61 inches and 66 inches represents the percentage of women with heights in that range, which is the interval within 1 SD of the average. By inspection, the shaded area is about equal to the area under the normal curve between 1 and 1. This last area is 68%, justifying the 68% rule. (The two vertical scales match up in the following way: 10% per inch = 1 10% inch = 1 10% 2.5 inch 2.5 = 2.5 25% inches = 25% 1 standard unit = 25% per standard unit) (R5) Normal approximation (examples 8 9 on pp.85-87) 13

For reasons of their own, statisticians call de Moivre s curve normal. This gives the impression that other curves are abnormal. Not so. Many histograms follow the normal curve very well, and many others like the income histogram do not. Later in the book, we will present a mathematical theory which helps explain when histograms should follow the normal curve. Finding percentiles for the normal curve.(p.265, #9) Change of scale: 1. Adding the same number to every entry on a list adds that constant to the average; the SD does not change. 2. Multiplying every entry on a list by the same positive number multiplies the average and the SD by that constant. 3. These changes of scale do not change the standard units. 14

Chapter 6. Measurement Error The SD of a series of repeated measurements gives the likely size of the chance error in each one. individual measurement = exact value + chance error The variability in repeated measurements reflects the variability in the chance errors, and both are gauged by the SD of the data. Mathematically, the SD of the chance errors must equal the SD of the measurements: adding the exact value is just a change of scale. With outliers, many histograms just do not follow the normal curve (figure 2). There is a hard choice to make when investigators see an outlier. Either they ignore it, or they have to concede that their measurements don t follow the normal curve. The prestige of the curve is so high that the first choice is the usual one a triumph of theory over experience. 15

Bias affects all measurements the same way, pushing them in the same direction. Chance errors change from measurement to measurement, sometimes up and sometimes down. The basic equation has to be modified when each measurement is thrown off by bias as well as chance error: individual measurement = exact value + bias + chance error. Usually, bias cannot be detected just by looking at the measurements themselves. Instead, the measurements have to be compared to an external standard or to theoretical predictions. 16