Chapter 15: Continuation of probability rules Example: HIV-infected women attending either an infectious disease clinic in Bangkok were screened for high-risk HPV and received a Pap test; those with abnormal cervical cytologies were referred for diagnosis and treatment. 1 The table below shows some of the data. Table 1: Data extracted from Table 2, of Screening HIV-Infected Women for Cervical Cancer in Thailand: Findings From a Demonstration Project, Sirivongrangson, et al. Cytology High-risk HPV yes no Total negative 6 123 129 positive 37 44 81 Total 43 167 210 This data set will be used to develop some more probability rules besides those presented in Chapter 14. The 210 women cross-classified in table will be treated as a population. 1. Let A denote the event that a person selected at random from the population tests positive for high-risk HPV. One individual is to be drawn at random from among the 210 and each is equally likely to be selected. Since 81 of the 210 have tested positive for high-risk HPV, P (A) = 81 210 =.386. 2. Let B denote the event that an individual selected at random from the population has an abnormal cytology. The probability of A is the number that were found to have abnormal cytology out of the number that were screened; hence, P (A) = 43 210 =.205. 3. The probability that an individual tests positive for high-risk HPV and was found to have abnormal cytology is P (A and B) = 37 210 =.176. The event that a randomly selected individual tests positive for high-risk HPV or is found to have abnormal cytology is the proportion of the 210 that fall in one of the three groups: 1. Individuals that tested positive for high-risk HPV and were not found to have abnormal cytology (there are 44 individuals in this group), 1 Sexually Transmitted Diseases: February 2007, Volume 34, Issue 2, pp 104-107. 108
2. Individuals that tested negative for high-risk HPV and were found to have abnormal cytology (there are 6 individuals in this group), and 3. Individuals that tested positive for high-risk HPV and were found to have abnormal cytology (there are 37 individuals in this group). Thus, the probability that an individual tests positive for high-risk HPV or is found to have abnormal cytology is 44 + 6 + 37 P (A or B) = =.414. 210 Another method of computing the P (A or B) uses the general addition rule: P (A or B) = P (A) + P (B) P (A and B). Why subtract P (A and B)? Because the individuals in this group were counted once in computing P (A) and again in computing P (B); they ve been counted twice. Subtracting P (A and B) corrects for double-counting. Cytology abnormal HPV positive 6 37 44 The first calculation avoids over-counting by decomposing the event of interest (A or B) into three mutually exclusive events. (Two mutually exclusive events have no individuals in common). 123 The Venn diagram above illustrates the situation. Note that the box represents the entire population, and that the 2 circles partition the box into 4 non-overlapping regions. The regions correspond to the sub-groups of the population. Remark: A useful mathematical result is that the probability of an event A can be written as the union of two mutually exclusive events. Let B denote another event in the sample space. Then, A = (A and B) or (A and B c ). As the events (A and B) and (A and B c ) are mutually exclusive, P (A) = P (A and B) + P (A and B c ). (1) 109
Re-arranging equation (1) provides another useful result: Salk polio vaccine trials of 1954 P (A and B) = P (A) P (A and B c ). One of the largest randomized experiment ever (in terms of numbers of subjects) used 401,983 volunteer children. Of these, 200,745 received the Salk vaccine and 201,229 received a placebo. Of those that received the vaccine, 33 contracted paralytic polio 2 during the 1954 polio season and of those that received the placebo, 115 contracted paralytic polio. Let V denote the event that a child received the vaccine and I denote the event that a child contracted polio. The data are summarized in Table 2 Table 2: Experimental results of the 1954 Salk vaccine trials. Vaccinated Placebo Total Number of children 200,745 201,229 401,983 Number of paralytic cases 33 115 148 For this population of 401, 983 volunteer children, find the probabilities of the following events using the general addition rule 3. 1. A child selected at random contracted polio. 2. A child selected at random contracted polio and received the vaccine. 3. A child selected at random contracted polio and did not receive the vaccine. The probabilities are 1. P (I) = 148 =.000368 is the probability that a child selected at random contracted 401,983 paralytic polio. 2. P (V and I) = 33 =.000082 is the probability that a child selected at random 401,983 contracted paralytic polio and received the vaccine. 3. P (V c and I) is the probability that a child selected at random contracted paralytic polio and did not receive the vaccine. We can compute according to P (V c and I) = P (I) P (V and I) =.000368.000082 =.000286. 2 Paralytic polio does not include diagnosed polio cases without paralysis - there were 83 total diagnosed cases. 3 The population is sufficiently large that some of the probabilities below may (in principle) be treated as probabilities applying to the larger population of all U.S. resident children during the 1950 s. 110
How much more likely is it select at random an unvaccinated child that contracted paralytic polio than it is to select at random a vaccinated child that contracted paralytic polio? Was this vaccine a great success? 4.000286 = 3.49 times..000082 Conditional probability An excellent approach to assessing the effectiveness of vaccines is through conditional probability. We discussed conditional distributions in Chapter 3 (a conditional distribution was obtained from a two-way table by considering the distribution of a variable when the data were limited to a single level of the second variable). The approach is extended now to probabilities. Suppose that we examine the probability of contracting paralytic polio given that the child received the Salk vaccine. The probability is expressed as P (I V ) and it is the proportion of vaccinated children that contracted the disease: P (I V ) = 33 200, 745 =.000164. (2) In comparison, the probability of contracting paralytic polio given that the child received the placebo is P (I V c ) = 115 201, 229 =.000571, and the chance that a child contracts paralytic polio are (approximately).000571.000164 = 3.48 times greater if the child is not vaccinated compared to vaccinated children. Note that even if the vaccine caused some of the polio cases, it s better to be vaccinated than not. The conditional probability of A given B can be computed from P (B) and P (A B). The principle behind the calculation is that if A is to happen given that B has happened, then A B must happen. P (A B) is not the conditional probability, since P (A B) is, in the Salk vaccine example, the fraction of children that contract paralytic polio and have been vaccinated out of all children including those that were not vaccinated. These additional children should not be counted since we are interested only in the proportion of vaccinated children that contract polio out of all vaccinated children. To convert P (A B) to a conditional probability, it should be divided by the probability of the conditional event P (B). Dividing by P (B) serves to scale P (A B) to reflect the 4 Upon hearing the results of the study people openly wept with relief, p. 203, Polio, An American Story, D.M. Oshinsky, 2005. 111
likelihood of the event B. The formula for the conditional probability is P (A B) = P (A B), (3) P (B) provided P (B) > 0. If P (B) = 0, then it is impossible for B to occur, and the conditional probability of A given that B has occurred is meaningless. For example, let s compute P (I V ) again using the conditional probability formula. First, P (V ) = 200745/401983 =.499386, then, P (I V ) = P (I V ) P (V ) =.000082.499386 =.000164, which is the same as the result above (equation 2). Examine the last calculation again. Recall that P (I V ) = 33 401983 Thus, and P (V ) = 200745 401983. P (I V ) = P (I V ) P (V ) 33 = 401983 401983 200745 33 = 200745, which is the proportion of the vaccinated children that contracted paralytic polio. The conditional probability formula (3) changes the denominator from the number of all children to the number in the conditional set (vaccinated children) when dividing by the probability of the conditioning event (V in this example). The conditional probability formula (equation 2) can be rearrranged to provide a formula for computing the probability of A B from the probability of B and the conditional probability of A given B: P (A B) = P (A B)P (B). (4) Also, P (A B) = P (B A)P (A). Independence: A mathematical definition of independence can be formulated at this point. Events A and B are independent if P (A B) = P (A). In other words, conditioning on B, or 112
knowing that B has occurred does not alter the probability of A no information is gained about the likelihood of A by knowing that B has occurred. For example, if A is the event that a randomly selected individual has black hair, then knowing that the individual is female does not change the probability of A. If A is the event that the randomly selected individual is at least 6 feet tall, then knowing that the individual is male does alter the probability of A. 5 If events A and B are independent, then the following mathematical statements are true: P (A B) = P (A) P (B A) = P (B) P (A B) = P (A)P (B). Suppose two events are disjoint (they cannot happen simultaneously). For instance, suppose that A is the event that randomly selected individual is male and B is the event that the individual is able to give birth. Are A and B independent? 6 Example Consider the Titanic data. Is there an association between the ticket class of the passenger and survivorship? Table 3 is a contingency table showing the cross-classification of passengers by survival status and ticket. Table 3: Titanic data from Table 2, p. 23, Intro Stats. Class First Second Third Crew Total Survived 203 118 178 212 711 Died 122 167 528 673 1490 Total 325 285 706 885 2201 The conditional distributions of survivorship (conditioning on ticket class) gives tells us the relative frequency of survivorship given ticket class. For example, given that the ticket class was first-class, the relative frequency of survival is 203/325 =.625, and the relative frequency of death is 1.625 = 122/325 =.375. 5 Six feet is approximately the 80th percentile of 20-year-old male heights whereas 6 feet is approximately the 97th percentile of 20-year-old female heights. Suppose we have 200 randomly selected 20 year-old individuals. About 100 23 200 = 11.5% will be at least 6 feet tall and the conditional probabilities of being over 6 feet are.2 for males and.03 for females. The conditional probability of being at least feet tall given the individual is male is.115/.03 = 3.83 larger than the unconditional probability of being at least 6 feet tall. 6 No, since P (A) =.5 0 = P (A B) where P (A B) is the probability that an individual is male given that they are able to give birth. 113
The event that a passenger holds a first-class ticket is denoted by A, and S denotes the event that a passenger survives. The conditional probability of survival is P (S A) =.625 and the conditional probability of death is P (S c A) = 1 P (S A) =.375. Table 4 shows the conditional probabilities of survival and death by ticket class (including crew). Table 4: Conditional probabilities of survival and death for Titanic passengers. Class First Second Third Crew Survived.625.414.252.096 Died.375.586.748.904 Let s ignore the crew and concentrate on passengers (there are 1316 passengers tabulated in Table 4). If a passenger is randomly selected, then the probability that the passenger is first-class is P (A) = 325/1316 =.247; the probability that the passenger is secondclass is P (B) = 285/1316 =.216; and the probability that the passenger is third-class is P (C) = 1.247.216 =.537. The probability that a randomly selected passenger survives is 203 + 118 + 178 P (S) = =.379 1316 and the probability that a randomly selected passenger does not survive is P (D) = 1 384 =.621. 114