Strategies for Mapping Heterogeneous Recessive Traits by Allele- Sharing Methods

Size: px

Start display at page:

Download "Strategies for Mapping Heterogeneous Recessive Traits by Allele- Sharing Methods"

Damian Ray
6 years ago
Views:

Am. J. Hum. Genet. 60:965-978, 1997 Strategies for Mapping Heterogeneous Recessive Traits by Allele- Sharing Methods Eleanor Feingold' and David 0.

1 Am. J. Hum. Genet. 60: , 1997 Strategies for Mapping Heterogeneous Recessive Traits by Allele- Sharing Methods Eleanor Feingold' and David 0. Siegmund2 'Department of Biostatistics, Emory University, Atlanta; and 2Department of Statistics, Stanford University, Stanford Summary We investigate strategies for detecting linkage of recessive and partially recessive traits, using sibling pairs and inbred individuals. We assume that a genomewide search is being conducted and that locus heterogeneity of the trait is likely. For sibling pairs, we evaluate the efficiency of different statistics under the assumption that one does not know the true degree of recessiveness of the trait. We recommend a sibling-pair statistic that is a linear compromise between two previously suggested statistics. We also compare the power of sibling pairs to that of more distant relatives, such as cousins. For inbred individuals, we evaluate the power of offspring of different types of matings and compare them to sibling pairs. Over a broad range of trait etiologies, sibling pairs are more powerful than inbred individuals, but for traits caused by very rare alleles, particularly in the case of heterogeneity, inbred individuals can be much more powerful. The models we develop can also be used to examine specific situations other than those we look at. We present this analysis in the idealized context of a dense set of highly polymorphic markers. In general, incorporation of real-world complexities makes inbred individuals, particularly offspring of distant relatives, look slightly less useful than our results imply. Introduction There have been a number of recent successes in using inbred individuals and/or sibling pairs to map recessive traits. For example, Hugot et al. (1996) used 25 sibling pairs to map a locus for Crohn disease, and Gschwend et al. (1996) used 23 pedigrees involving primarily offspring of first cousins to map a locus for Fanconi anemia. Studies of this type increasingly involve wholegenome searches and relatively complex (at least hetero- Received December 20, 1995; accepted for publication January 23, Address for correspondence and reprints: Dr. Eleanor Feingold, Department of Biostatistics, Emory University, 1518 Clifton Road, Room 314, Atlanta, GA feingold@sph.emory.edu X 1997 by The American Society of Human Genetics. All rights reserved /97/ $02.00 geneous) disorders. A number of data-analysis tools for such studies are now in place. Many different methods and software packages exist for analyzing sibling-pair data, and a practical computational method for carrying out analyses with inbred individuals was developed by Kruglyak et al. (1995), but there is not much literature that is of help to an investigator trying to design such a study. Are sibling pairs or inbred individuals more efficient? Are offspring of closer relatives best? What is the best statistic for sibling pairs? The purpose of this paper is to start to answer some of these kinds of studydesign questions. We compare mapping strategies under varying assumptions about phenocopies, locus heterogeneity, and degree of recessiveness. The main issue for sibling pairs is choosing the best test statistic, given that the amount of dominance variance is likely to be known only approximately. For inbred individuals, the issue is what types of matings give the most mapping power. We also compare the utility of inbred individuals, sibling pairs, sibling triples, and cousin pairs for various trait etiologies. We assume that the whole genome, or a large portion thereof, is being searched, probably for multiple trait loci. Our calculations are done under the assump, tion that whole-genome identity-by-descent (IBD) and homozygosity-by-descent (HBD) information are available, as from a dense set of highly polymorphic markers. This idealized assumption allows our questions to be answered for many interesting cases. The strategies thus developed are generally applicable to more realistic mapping situations, with some caveats (see Discussion). We begin our discussions by developing the necessary extensions of previously published models. Most of the details and formulas are presented in appendices so that they can be ignored by readers who wish to do so. However, it should be noted that the models we present can be used to examine many specific situations other than those we have focused on, and the appendices do include enough information to allow others to apply the models. All of the models assume the Haldane mapping function, equal lengths of the male and female maps, and dense, completely informative markers yielding continuous information about regions of IBD and HBD. We also assume throughout that there is no more than one linked locus on a given chromosome. 965

2 966 Models Scanning the Genome Feingold (1993), Feingold et al. (1993), Lander and Schork (1994), Dupuis et al. (1995), and Lander and Kruglyak (1995) described statistical methods for analyzing whole-genome IBD data for affected relative pairs. The essence is to model the number of pairs sharing IBD at any given point on the chromosome as a stochastic process, with the "time" parameter being genetic distance along the chromosome. The methods are applicable to a wide variety of simple and complex modes of inheritance (cf. Dupuis et al. 1995), but, for the important case of sibling pairs, most of this literature has concentrated on additive traits (those with zero dominance variance of the penetrances). Here, we extend the previous methods to traits with nonzero dominance variance and to affected inbred individuals. The basic model for sibling pairs was described by Feingold (1993). The Markov process X, = (X0,t, X1,t, X2,,) gives the number of sibling pairs that share zero, one, and two alleles IBD at point t on a chromosome. The three components sum to the total number of pairs, N. On a chromosome with no trait locus, Xt has for all t a multinomial distribution with parameters (N, 1/4, 1/2, 1/4), although the values of X, fluctuate along the length of the chromosome. If there is exactly one trait locus on the chromosome, the probabilities for the multinomial will be different near the point where the gene lies, with more pairs sharing two alleles IBD and fewer sharing zero. The probabilities of finding zero, one, or two alleles IBD at the trait locus can be parameterized as (1 - a)/4, (1-8)/2, and (1 + a + 26)/4. The values of the parameters a and 6 depend on the specifics of the trait etiology and are discussed in detail below. In the case where N is large, following Feingold et al. (1993), we can instead work with the approximately Gaussian process Zt = (ZO,t, Zl,t, Z2,t) = N-1/2 (Xo,t - N/4, Xlt - N/2, X2,t - N/4). The calculations for this approximating model are always simpler than for the Markov chain but may not be as accurate for small sample sizes. The Markov assumption, which applies to both the Markov chain models and the Gaussian models, is equivalent to assuming the Haldane mapping function. In fact, the mathematical approximations that we use require only that the process be locally Markov, which is true under any standard mapping function. The effect of positive interference is to shorten the genome relative to the results we show, so the false positive rate based on the Haldane mapping function is conservative, but this effect is small. This is discussed further in Feingold et al. (1993) and I.-P. Tu and D. Siegmund (unpublished data). Am. J. Hum. Genet. 60: , 1997 The models for HBD information from inbred individuals are very similar to those used for IBD data from affected pairs. A stochastic process can be constructed to describe each type of consanguineous mating, and P values and power can be calculated by the same methods used for sibling pairs. The process, Yt, records the number of offspring of a particular mating type who are HBD at each marker locus. The properties of the processes Y, for different types of inbred individuals are described in appendix C. If there is no linked gene on the chromosome, Y, has at any point on the chromosome a binomial (N, F) distribution, where F is the coefficient of inbreeding. If there is a single trait locus on the chromosome, the distribution at the locus is binomial [N, F + y (1 - F)]. The parameter y takes values between 0 and 1, indicating the amount of excess HBD. It actually depends on F and is discussed further below. Gaussian models for inbred individuals can be defined similarly to those for sibling pairs. However, since inbred individuals tend to be most useful for rare traits, which often require only small sample sizes to detect linkage, and because F is often small, the Gaussian approximation is not very accurate. It can be vastly improved by transforming the data. This theory is not discussed further in this paper. Sample size calculations for inbred individuals in this paper were performed under the Markov chain models, not the approximating Gaussian models. We test for linkage by looking for extreme values of X, or Zt. One way to do this is to look for extreme values of a stochastic process that is an appropriate onedimensional summary of X, or Zt. That is, we want to find the appropriate weightings of the deviations from expectation in the numbers of siblings sharing 0, 1, and 2 alleles IBD. In our previous work, we examined the (equivalently 2X2,t + X1,t), the mean IBD statistic, which is very powerful for dominant traits (Blackwelder and Elston 1985). We also examined Z2,t + Z1,t (equivalently X2, + X1,t), which counts is and 2s equally and is appropriate if one and two alleles IBD cannot be distinguished (Nelson et al. 1993) but is a very poor statistic for recessive traits. Here we also consider Z2,t (equivalently X2,t), which is useful for recessive traits. It will be helpful to introduce the more general process T,(c) = Z2,t + cz1,t. Our test statistic is the maximum value attained by Tt(c) over the length of the genome, max Tt(c). We shall see below that the value c process Z2,t + 1/2Z1,, = 1/4 provides a useful compromise between c = 0 and C = 1/2 when the degree of dominance is unknown. Formulas for P values and power of these tests are given in appendix A for the Markov chain versions and in appendix B for the Gaussian versions. Other statistics that have been suggested in the literature are also discussed briefly in the Discussion.

Feingold and Siegmund: Strategies to Map Recessive Traits Models for a, 6, and y The parameters a, 6, and y introduced above do not have direct genetic interpretations.

3 Feingold and Siegmund: Strategies to Map Recessive Traits Models for a, 6, and y The parameters a, 6, and y introduced above do not have direct genetic interpretations. They are statistical conveniences that are useful for exploring mapping strategies. In order to use them, however, we need to know something about the values that these parameters take under various genetic models. The alternative hypothesis for sibling pairs is defined by the parameters a and 6. The parameter a must lie between 0 and 1 and essentially contains the information about how important the locus is. The parameter 6 must lie between 0 and a and contains, in a sense, the information about how recessive or dominant the disease allele is. If more than one locus is involved in the trait, each locus will have its own values of the parameters; ai and 6i will be associated with the ith trait susceptibility locus. Suarez et al. (1978) calculated the values of a and 6 for a single-gene, two-allele trait as a = (VA/2 + VD/4)/ (K2 + VA/2 + VD/4), and 6 = (VD/4)/(K2 + VA/2 + VD/4), where K is the population prevalence of the trait and VA and VD are the additive and dominance variances of the penetrances. It can easily be shown that the two-allele assumption is not necessary for these formulas to hold. Risch (1990b) did similar calculations in terms of relative risks to different kinds of relatives and got a = ('s - 1)/Xs, 6 = (Xs - So)/Xs, (1) where Xs is the relative risk to siblings of affecteds and X0 is the relative risk to offspring of affecteds. These latter representations have the advantage that, in principle, they can be estimated directly from pedigree data; their disadvantage is that they do not generalize simply and naturally to pedigrees containing more than two affecteds. These representations by Risch (1990b) and Suarez (1978) are appropriate in the presence of phenocopies, but they are not correct when there are other major genetic loci involved with the trait. Risch (1990a) (cf. also Dupuis et al. 1995) introduced models for complex disease involving more than one susceptibility locus. A particularly interesting special case that serves as an approximate model for locus heterogeneity of a rare trait is the additive model, where the total penetrance is assumed to be the sum of penetrances of susceptibility alleles at unlinked loci in linkage equilibrium. The degree of dominance can vary from one locus to the next. Associated with the ith locus are parameters ai and 6i, expressions for which are given in appendix D. Equation (1) continues to hold with a replaced by Xai and 6 replaced by Mi. A two-allele (per locus) model can be useful for simplifying the heterogeneity formulas in order to examine specific situations. Such a model can be 967 parameterized by assigning the three possible genotypes at the ith locus to have probabilities q2, 2pq, and p2. where p is the frequency of the susceptibility allele. The penetrances can be written as fo, fo + f + d, and fo + 2f. Equations for a and 6 under this model are given in appendix D. The parameter a equals one only in the limiting case of a single-gene trait as the allele frequency approaches 0. In the presence of phenocopies, a is smaller, approaching zero as the proportion of cases attributable to phenocopies approaches one. Under heterogeneity, the preceding remark applies to lai; each individual ai will be smaller. The sample sizes necessary to achieve a desired power are strongly dependent on a, regardless of what test is used. Two examples that are illustrative here are chronic obstructive pulmonary disease (COPD) and late-onset Alzheimer disease (AD). In the case of COPD, the PiZ allele is one of many possible causes (Khoury et al. 1993). Khoury et al. give values of.05 for the trait frequency,.02 for the allele frequency, and 20 for the risk ratio to homozygotes, yielding ai =.035. The allele is relatively rare, but the level of phenocopies/ heterogeneity is very high, so the value of ai is very low. This value of ai would require a sample size on the order of thousands, or even tens of thousands, to have a high probability of detecting the locus. If a trait is carefully defined to eliminate heterogeneity and phenocopies as much as possible, ai increases and the sample size is smaller. For AD and the APOE-e4 allele, Tsai et al. (1994) give odds ratios of 3.6 for heterozygotes and 18.3 for homozygotes, and an allele frequency of.13. Given a population prevalence on the order of 10%, it seems possible that a value of ai ;.5 could be achieved, requiring a sample size of -100 or 200 for a hypothetical attempt to map this locus. (This calculation was done under an approximating two-allele model.) The parameter 6 takes the value 0 in the case where VD = 0, and Xs = Xo. This additive case corresponds approximately to a relatively rare single-locus dominant trait. It takes its maximum value of a in the limiting case where VA<< VD and Xo <<Xs, which is a good approximation to many recessive traits. The choice of the best statistic for finding a disease locus using sibling pairs depends on "how recessive" the locus is, as expressed in terms of the ratio 6/a (see Results). Under locus heterogeneity, the Sis of the individual loci will be smaller, but so will the ais so that the ratio 8i/ai has about the same range of values as for a single-locus trait. The two-allele model for locus i gives further insight into exactly what characteristics give the large values of 6i/ai that make a locus "recessive" for the purposes of choosing the best statistic. It can be seen, from the equations in appendix D, that the phenocopy rate, fo, does not play a role in determining the size of this ratio, while

968 d and f affect the ratio only through the ratio dif. If the locus is completely recessive in the usual sense (heterozygote penetrance equals nonmutant homozygote penetrance), then -dif = 1.

4 968 d and f affect the ratio only through the ratio dif. If the locus is completely recessive in the usual sense (heterozygote penetrance equals nonmutant homozygote penetrance), then -dif = 1. In this case, 6i/a, is > 1/2 as long as P <.2. Thus, a fully recessive locus is best found by a "recessive statistic" designed for large 6/a, as long as the susceptibility allele is moderately rare. A common allele that slightly increases the likelihood of disease, even if that increase acts recessively, will not give a large value of /ac. If the locus is not fully recessive, i.e., if -dif < 1, the picture changes markedly. For dif = -.9, the maximum value of 8/a is.52, and it can be much lower. For dif = -.8, the maximum value of 6/a is only.31. In the case of COPD described above, the PiZ allele acts, apparently, completely recessively, giving it the quite high value of 6i/ai =.92. For AD, depending on one's assumption about the overall trait frequency, one gets -dif of at most.7, and thus 6i/a, of at most.19. An example of an intermediate value of 6i/ai arises in Kruglyak and Lander's (1995) analysis of the type 1 diabetes data of Davies et al. (1994). They calculate =i/ai =.59 at the HLA locus. = The alternative hypothesis for inbred individuals is defined by the parameter y, which varies between 0 and 1, indicating the amount of excess HBD. At an unlinked locus, an inbred individual is HBD with probability F, the coefficient of inbreeding. At a trait locus, the probability of HBD is F + (1 - F)y. The parameter y depends on the inbreeding type. Appendix D gives formulas for y under the general model and under a two-allele model. For a given trait, y is highest for offspring of close relatives, but that does not necessarily imply greater mapping power for closer relatives. However, for a given inbreeding type, traits with higher values of y will require smaller sample sizes than those with lower values of y. In general, y is high when the trait allele is rare and the level of phenocopies is low. For example, for COPD, y for first-cousin offspring is only.02, because the level of phenocopies is so high. A trait with the same allele frequency in which only half the cases are phenocopies would give y =.61, and with no phenocopies would give y =.75. Under heterogeneity, values of y, decrease as the number of loci increases, with A} never exceeding 1. If there are very few phenocopies and n equally contributing recessive loci each with a very rare trait allele, yj 1/n. In order to compare the mapping power achievable using a sample of inbred individuals to that achiev- - able with a sample of sibling pairs, it is necessary to understand the relationships between the parameters y, a, and 6 under various genetic models. This is difficult in the general case, but appendix D gives formulas for the two-allele case with one or more loci. Table 1 gives additional examples of values of y, a, and 6 under a twoallele model for various values of the genetic parameters. Am. J. Hum. Genet. 60: , 1997 The theory developed here for inbred individuals refers to an otherwise outbred population. For an inbred individual in an inbred population, the probability of HBD at an arbitrary locus would be, in Wright's notation (cf. Crow and Kimura 1970, p. 106), FIT = Fs + (1 - FIS)FsT. Here F1s is the coefficient of inbreeding due to the immediate family relation, which we have been calling F (h1/6 for a child of first cousins), while FST is the coefficient of background inbreeding of the population. To consider an admittedly extreme example, for the Hutterite population the value of FST has been estimated to be.04 (cf. Crow and Kimura 1970, p. 106). In such a population, the probability of HBD - in a child of first cousins is.1, or 50 % larger than the.0625 of an outbred population. To have a valid test of the null hypothesis of no linkage, one should in principle use this corrected HBD probability. Bodmer and Cavalli-Sforza (1976, p. 371) give estimated inbreeding coefficients for a number of subgroups in different parts of the world. Their estimates are often.001, which are small enough to be ignored. In a few cases of isolated populations they are -.01, which probably can be ignored when dealing with offspring of first cousins but might cause problems if one's data involved largely second cousins, where F = 1/ , so FIT ;.026. Results Comparison of Sibling Pair Statistics We have defined the general test statistic T,(c) = Z2,1 + cz,,,. We wish to determine which value(s) of c are best for mapping different kinds of traits. One interesting value is c = (1 -K)/2(1 + K), which gives the score statistic when 6/a = K. Usually this ratio will be unknown, so this statistic cannot be used in practice, but it provides an ideal against which to evaluate the performance of the other statistics. According to the Gaussian approximation, the relative sample sizes for different values of c, and thus the choice of the best statistic, depends on only K = 6/a, not on a, although the absolute sample sizes do of course depend on a. For a falsepositive error rate of.05, power of.90, and a genome length of 3,000 cm, sample sizes for the optimal c = (1 - K)/2(1 + K) range from 8.5/a2 for 6/a = 1 to 50.1/a2 for 6/a = 0. Figure 1 shows the performance of the statistic T,(c) for c = 1/2, 1/4, and 0, expressed in terms of the approximate percentage increase needed in the sample size compared to the optimal value. As expected, the "recessive" statistic T,(0) performs very well when K is close to 1, and the "dominant" statistic Tt( 1/2) performs very well when K is close to 0. The compromise statistic T,( 1/4) requires a sample size no more than 12% larger than the optimum over the entire range of K to achieve an arbitrary amount of power. This

5 Feingold and Siegmund: Strategies to Map Recessive Traits 969 Table 1 Example Sizes and Parameter Values for Sibling Pairs and Inbred Individuals, under Various Two-Allele Monogenic and Polygenic Models No. Sibling First Second Line of Loci fo + 2f' -d/f" Kc fo/kd ae f 5 1 Y2g Pairs Cousins Cousins s S S NOTES.-The column labeled "sibling pairs" gives the number of pairs necessary to obtain 90% power. The columns labeled "first cousins" and "second cousins" give the corresponding numbers of offspring of the indicated matings. The sample sizes for sibling pairs are computed using the Markov chain version of the statistic T,(O). They do not depend on the two-allele model or the number of loci, only on the values of a and B. The corresponding sample sizes for inbred individuals are specific to the two-allele model and the number of loci. The calculations behind the multilocus sample sizes assume that all loci have equal allele frequencies and penetrances. When there is more than one locus, power refers to the probability to detect a particular locus. a f0 + 2f is the penetrance of the homozygote. Lines 9 and 10 show the effect of reduced penetrance. b -dif is the degree of recessiveness under the two-allele model. -dif = 1 indicates a fully recessive gene, i.e., no heterozygote liability above the phenocopy level. Lines 11 and 12 show the effect of such heterozygote liability. c K is the trait frequency. d fo/k is the fraction of cases due to phenocopies. 'a is the sib-pair parameter indicating (roughly) the importance of the locus. It varies from 0 to 1. f 8 is the sib-pair parameter indicating (roughly) the degree of recessiveness of the disease allele. It varies from 0 to a. g yi and y2 are the parameters for offspring of first and second cousins, respectively. They vary from 0 to 1. would seem to be a reasonable choice when the mode of inheritance is unknown. The simpler T,(O) performs very well when the trait is at least moderately recessive (IC > /2). Remark 1.-The simple comparisons of figure 1 are based on the Gaussian approximations, which to some extent make T,(O) look better than it actually is when the sample size is small to moderate. The reasons are that on unlinked chromosomes the distribution of T,(O) is skewed, so the correct threshold is slightly larger than the Gaussian approximation suggests, and at a trait locus the true variance of T,(O) is larger than the variance used in the Gaussian approximation, so the power is actually less than the approximation suggests. This approximation is, nevertheless, useful, because it gives us a simple qualitative picture based on the single parameter K instead of a more accurate but more difficult to interpret description involving N, a, and 6. Remark 2.-When the penetrance and trait incidence are both large, affected-unaffected sibling pairs can be quite powerful for mapping (Rogus and Krolewski 1996). The parameterization for discordant pairs is exactly the same as for affected pairs, although the values of a and 6 are now negative and satisfy the constraint a < 6 0. Although the values of a and 6 are different - for discordant pairs than for affected pairs, the value of a/6 is the same. Thus, for a given trait, whatever test statistic of the form T,(c) is appropriate for a sample of

$970 en \ 0-10S/1c=1/2.0 30 (D) E >E 0 N * F. of)th/ttsi tc 2t+clti... hw o / "oiat Figure 1 IC Comparison of sibling-pair test statistics.$

6 970 en \ 0-10S/1c=1/ (D) E >E 0 N * F. of)th/ttsi tc 2t+clti... hw o / "oiat Figure 1 IC Comparison of sibling-pair test statistics. Performance of the statistic T1(c) Z2,1 = + cz,,, is shown for c "2 ("dominant" statistic), c = 1/4 ("compromise" statistic), and c = 0 ("recessive" statistic). Performance is expressed as the approximate percentage increase in sample size needed over what is required for the statistic that is optimal if the degree of recessiveness (the parameter ic) is known. affected pairs, the negative of that statistic is appropriate for a sample of discordant pairs. Comparing Sibling Pairs to More Distant Relatives Several authors (e.g., Risch 1990b; Elston 1992; Feingold et al. 1993) have noted that for additive traits, defined by the condition Xs = X0, pairs of more distant relatives are more efficient than sibling pairs when Xs = X0 is large. For completely recessive traits, where Xs >> Xo, clearly unilineal relative pairs are not useful, but it is interesting to see how they compare to sibling pairs when a trait is partially recessive. Figure 2 shows the relative efficiency of sibling pairs to first-cousin pairs. For additive traits (Xs = X0), sibling pairs are more powerful than cousin pairs only when Xs is small (approximately Xs < 2), a result that agrees with the authors above. However, when Xo = (1 + Xs)/2, which by (1) would be equivalent to K = 1/2 for a monogenic trait, sibling pairs are more efficient than cousin pairs up to about Xs = 13 and are much more efficient for small Xs. The relative efficiencies given in the figure also apply for a heterogeneous trait, provided 8i = a,/2 at the locus under consideration. (This means the degree of recessiveness observed at the pedigree level for the trait as a whole is the same as at the particular locus. The situation would be quite different if the overall intermediate mode of inheritance resulted from some loci behaving dominantly and others behaving recessively.) Am. J. Hum. Genet. 60: , 1997 Comparing Sibling Pairs to Inbred Individuals Table 1 gives examples of parameter values (a, 8, and y) and of sample sizes for sibling pairs and for offspring of first and second cousins. The parameter values are computed for the two-allele models that are shown in the table. The calculations for multilocus traits assume that all loci are equally contributing, unlinked, and in linkage equilibrium. Power to detect any particular locus is 90%. The sample sizes for sibling pairs do not depend on the two-allele model or the number of loci, only on the values of a and 8. The corresponding sample sizes for inbred individuals are specific to the two-allele model and the number of loci. This means that the relation between (a, 8) and (Ti, a2), hence the relative value of sibling pairs and inbred individuals, depends on the number of loci. All sample sizes are computed using the Markov chain models to give the best possible accuracy. The sample sizes for sibling pairs are computed using the Markov chain equivalent of the statistic T,(O) (see appendix A). For a fully recessive (-dif = 1) single locus with homozygote penetrance fo + 2f = 1, parameter values vary with the phenocopy level and the trait frequency (lines 1-8 of table 1). For a given trait frequency, the presence of phenocopies decreases a slightly, increases 8 slightly, and reduces Ty and T2 moderately (see, for example, lines 4 and 5 or lines 6 and 7), indicating that phenocopies reduce the power of inbred individuals more than that of sibling pairs. Rarer traits have higher values of all parameters and thus smaller sample sizes. If the homozygote penetrance is only.5, parameter values are somewhat lowered, but not dramatically so (compare lines 1 and 5 to lines 9 and 10). However, if the risk to heterozygotes is elevated even a small amount above the phenocopy rate (e.g., -dif =.9), values of 8, Ti, and T2 are quite low, indicating that "recessive-trait" mapping strategies will be less useful (compare lines 1 and 5 to lines 11 and 12). It is important to note that, in the case of elevated heterozygote risk, the value of 8 varies nonmonotonically with the allele frequency (see appendix D) and thus can be somewhat unpredictable (e.g., line 11). Except for a near 1 (few phenocopies and a very rare allele), sibling pairs are substantially more powerful than inbred individuals, and offspring of closer relatives are more powerful than offspring of distant relatives. Lander and Botstein (1987) obtained similar results, although they focused on marker heterozygosity and distance between markers rather than on phenocopies and heterogeneity as factors governing the difficulty of detecting linkage. The pattern is similar for heterogeneous traits, but the import is somewhat different. As long as there is no heterozygote liability and few phenocopies and the trait is moderately rare (e.g., lines 13, 14, 16, and 19-21),

Feingold and Siegmund: Strategies to Map Recessive Traits inbred individuals are more efficient than sibling pairs. The advantage increases as the number of loci increases.

7 Feingold and Siegmund: Strategies to Map Recessive Traits inbred individuals are more efficient than sibling pairs. The advantage increases as the number of loci increases. In addition, if the trait is quite rare, 6i is about equal to ai, and offspring of more distant relatives can be more powerful (lines and 19-20). Again, their advantage increases with the number of loci. If there is a high level of phenocopies (lines 15 and 18) or even a small amount of heterozygote liability, sibling pairs regain much of their relative efficiency, followed by offspring of closer relatives. In interpreting the sample sizes for heterogeneous traits in table 1, one must keep in mind that the sample size is that required to detect linkage to a given traitsusceptibility locus and that all loci are assumed to contribute equally to the trait. The results would be qualitatively similar, although quantitatively different if a small number of loci account for a comparatively large part of the incidence of the trait or if our requirement were different, e.g., to detect at least one of the trait loci or to detect all trait loci. The assumption that all trait loci contribute equally is a convenient conceptual device to simplify an intrinsically complex situation, but it is unlikely to be even approximately satisfied when several loci are involved. To explore the full range of possibilities there does not seem to be any substitute for performing a large number of computations under a variety of assumptions. For example, Gschwend et al. (1996) >, c) 0._ a) a) Figure 2 Relative statistical efficiency of sibling pairs to cousin pairs for additive and partially recessive traits. Values for the additive case, Xs X0, = are computed using the sibling statistic T,(1/2). Values for the partially recessive case, X0 (1 = + Xs)/2, are computed using Tt( '/4). scan the genome to detect linkage for Fanconi anemia, which through complementation analysis is known to be heterogeneous and to involve at least five loci. Because of the rarity of the disease, inbreds can be expected to be more powerful than sibling pairs and offspring of second cousins more powerful than offspring of first cousins. Gschwend et al.'s sample of inbred affecteds, which was selected not to contain families segregating at a previously linked trait locus on chromosome 9 (Strathdee et al. 1992), consisted of 23 pedigrees that involved primarily offspring of first cousin matings. Assuming for the sake of discussion that, after the chromosome 9 locus is excluded, there are exactly four equally contributing loci and that the sample consists of exactly 23 offspring of first-cousin matings, we find that the power to detect linkage to any given locus would be.45, while the power to detect at least one of the four loci would be -1 - (1 -.45)4, or -.9. In fact, according to Gschwend et al. (1996), one locus appears responsible for a substantial majority of the total incidence, so the power to detect that locus is large, while the power to detect the other loci is very small. Sibling Triples We have extended our models (not shown) to briefly consider the value of sibling triples. We have considered only the case of a completely recessive trait, using a statistic which counts the number of triples where all three siblings are IBD on both chromosomes. In this case, sibling triples are more valuable than pairs, but not as much more valuable as in the case of a dominant trait of comparable incidence (Teng and Siegmund 1997, in this issue). In the examples we have examined, sibling triples are always more valuable than offspring of first cousins, even in the cases in which offspring of cousins are better than sibling pairs. The number of sibling triples necessary to achieve 90% power for lines 6, 7, 15, 16, and 17 of table 1 are respectively 10, 16, 22, 33, and 41. Discussion 971 Recommendations on Study Design Our overall conclusion about study design is that sibling pairs are likely to be more useful than inbred individuals for a fairly wide range of recessive and partially recessive traits, especially if there is only one locus. In the presence of heterogeneity, inbred individuals can be much more useful than sibling pairs, as long as phenocopies are rare and the disease allele is purely recessive. This is an important case, which includes such traits of current interest as Fanconi anemia, xeroderma pigmentosum, and some loci implicated in retinitis pigmentosa. Efficiency is particularly important in the case

8 972 of a heterogeneous trait, because of the large sample sizes required and hence the possibility of large savings. In contrast, efficiency is not as critical for a monogenic trait with few phenocopies, because small sample sizes will provide sufficient power. We also find that the previously known advantage of more distant relative pairs over sibling pairs for mapping additive traits does not hold for partially recessive traits unless the effect of the locus is quite large. For inbred individuals, we find that offspring of closer relatives provide greater mapping power in most cases but that, if the level of phenocopies is low, the allele is rare, and there is no heterozygote liability, offspring of more distant relatives provide greater power, with the advantage increasing as the amount of heterogeneity increases. If there is a sparsity of markers or if unaffected family members are unavailable for typing, sibling pairs are likely to be more useful, and offspring of distant relatives less useful, than the above results indicate (see Marker Spacing and Incomplete Polymorphism, below). Sibling-Pair Statistics To compare the usefulness of different sibling-pair statistics, we measure the degree of recessiveness in terms of the parameter Kc = 8/a, which varies between 0 and 1, and is large when a double dose of a relatively rare allele causes a large increase in liability. We find the statistic T,( /4) to be quite efficient over almost all values of K, and the statistic T,(0) to be very good as long as K > 1/2. In the presence of elevated heterozygote liability (partial recessiveness), values of K can be quite low, and T,( '/2) is likely to be a better choice. Several other statistics have been suggested for the situation where the mode of inheritance is unknown. Perhaps the most appealing is the likelihood ratio or "possible triangle" test (Holmans 1993). Faraway (1993) also proposed essentially the same test. Within the Gaussian framework, this likelihood-ratio test is equivalent to estimating the optimal value of c in the statistic T,(c) from the marker data at the putative trait locus and using that estimated value to define an adaptive statistic. A detailed analysis is given in J. Dupuis and D. Siegmund (unpublished data). Although we have not included this test in the simple comparative analysis of figure 1, numerical calculations indicate that over a range of conditions it requires a sample size ranging from -4-% to -10% above the minimum sample size. Overall, it behaves much like the test based on Tt( 1/4). While Tt( 1/4) would be optimal in the case 8 = cr13, or equivalently VD = VA, the likelihood-ratio test does not favor any particular value of K. As a result, the test based on TP(/4) is slightly more efficient than the likelihoodratio test when the mode of inheritance is indeed intermediate between additive and recessive and less efficient Am. J. Hum. Genet. 60: , 1997 when the mode of inheritance is closer to one of the two extremes. The minimum power for both tests occurs at the extreme models 8 = 0 and 8 = a. A rather different test, at least in spirit, is Schaid and Nick's (1990) suggestion to use max(tj0], T,['/2]). This test will perform very well if one of the two extreme models is correct but less well for an intermediate case. Some rough calculations based on a Bonferroni correction indicate that it performs about as well as the likelihood ratio test or the test using T,(1/4); it does slightly better at the extremes and slightly less well in between. Since none of these tests seems uniformly preferable to the others, our preference for T,( 1/4) is to some extent arbitrary. A secondary consideration favoring statistics of the form T,(c) is that they are more easily combined with data from pedigrees containing other affected relative pairs, affected-unaffected pairs, or affected children of inbreds, when such data are available. Our results may at first glance appear inconsistent with the results of Knapp et al. (1994a, 1994b), who show that the statistic T,( 1/2) is in a certain sense optimal for detecting linkage of a recessive trait. However, their result is based on the assumptions of a monogenic recessive trait without phenocopies and a test of a single marker. Under these conditions, T,( 1/2) can be quite efficient. If the trait is rare, only a small sample is needed and efficiency is not an important issue. If the trait is not rare, then K is not close to one, so in that case as well T,(1/2) may be reasonable. For example, for a completely recessive trait without phenocopies and with an allele frequency p =.25, a =.84, and 8 =.36; the required sample sizes to achieve 90% power are 32 for TP('/2) and 37 for T,(0). However, for a rare heterogeneous trait T,(0) can be substantially more powerful than T,(1/2). Suppose, for example, that four loci act recessively and make equal contributions, so a 8 ;.25 at all loci. To have 90% power to detect a given locus, one would need -204 observations when using T,(1/2) and -148 when using T,(0) (line 19 of table 1). Similar results would be obtained for a rare recessive trait with a high level of phenocopies. Our analyses also put in context the observation of Knapp (1996), who notes that there may be evidence to favor linkage even when one observes "fewer alleles IBD than expected" (i.e., T,['/2] less than expected). He gives the hypothetical example of 2/3 of the pairs sharing zero alleles IBD and the remaining 1/3 sharing two alleles IBD. For this data set, Tt( 1/2) is less than would be expected under the null hypothesis, but T,(0) is quite large. Marker Spacing and Incomplete Polymorphism The models we have used in this paper assume that all possible information is extracted from the study population, i.e., that complete genomewide information on

9 Feingold and Siegmund: Strategies to Map Recessive Traits IBD and HBD is available. In practice, of course, this is not the case. Typical studies involve typing markers every 5, 10, or 20 cm to determine identity-by-state (IBS), which is then converted to IBD probabilities by use of estimated allele frequencies, although sometimes only unequivocal IBD information is used. In general, regions that generate high LOD scores are then saturated with more markers. This sometimes allows haplotypes to be examined and IBD to be established conclusively. When siblings are used without parents, of course, only IBS information is available. All of these situations amount to losses of varying amounts of information, adding noise to the processes we have modeled. Multipoint mapping methods can regain quite a bit of that information by looking at adjoining markers to estimate IBD probabilities more accurately (Kruglyak and Lander 1995; Kruglyak et al. 1996). The fact that real studies extract something less than the full IBD information does have implications for the interpretation of our results. The implications are quite difficult to quantify precisely, which is the reason for looking at the idealized situation as we did. Future work by J. Teng and D. Siegmund will address some of these issues quantitatively. However, several general qualitative statements are possible now. Kruglyak and Lander (1995) suggest a measure of the amount of information extracted from a sibling-pair analysis and use simulations to estimate the amount of information extracted for varying values of marker spacing and polymorphism. Their results suggest that studies using moderately closely spaced markers and multipoint methods are probably extracting very close to the full information, and thus our results as stated apply. The definition of "moderately closely spaced markers" in the above statement depends, however, on the amount of IBD that is expected. For offspring of distant relatives, there will be a greater loss of information from distant marker spacing than for offspring of closer relatives and sibling pairs. There can be a considerable loss of information for siblings without parents, even when multipoint methods are used, but the use of inbred individuals without additional family members would involve even more information loss. Thus, in any situation where less than the full information is being extracted, inbred individuals and distant relatives will be relatively less useful than our results imply. 973 Other Study Design Considerations Several factors we did not consider can complicate the comparison of the relative efficiency of sibling pairs and inbred individuals. In many, cases one could use or extend our models to examine these factors. For example, because the task of finding affected inbred individuals for linkage analysis is often facilitated by concentrating on isolated or inbreeding populations, it may be the case that, even if both sibling pairs and inbred individuals are available, the inbred individuals will come from a distinct subpopulation that is not directly comparable to the outbred population. In this case, the disease etiology parameters for the two different populations must be considered separately. Another possible complicating factor is familial aggregation of phenocopies, which would essentially mean that the population of sibling pairs has a higher phenocopy rate than does the population of inbred individuals. Qualitatively, this would make sibling pairs relatively less valuable than our results imply. The quantitative effect could be examined more precisely by extending our models. Another issue is that the relative utility of outbred siblings and inbred individuals for gene mapping depends to some extent on their frequencies within the population. For a monogenic recessive disease with a negligible level of phenocopies, caused by homozygosity of a rare allele of frequency p, the probability that both outbred parents of a pair of siblings are carriers is - (2p)2. The conditional probability that both siblings are affected is 146, so the probability that a randomly selected pair of outbred siblings are homozygous for the disease allele is _p2/4. Similarly, the probability that one or the other of the common grandparents of a pair of cousins is a carrier of the allele is G-p. The conditional probability that the cousins are both carriers is -'146, and the conditional probability they would both pass this allele to their child is 1/4. Hence, if ntc denotes the frequency of first-cousin matings, the probability that a child is the homozygous product of a first-cousin mating is approximately equal to pncr/16. Roughly speaking then, affected children of first-cousin matings will be sufficiently common to be useful if the frequency of first-cousin matings is large compared to the allele frequency p. Of course, the choice of whether to use inbred individuals or sibling pairs (or some other type of data) for a study is often made simply on the basis of the availability of a study population. In this case, our results will not be used to make the choice, but they can still give useful guidance in study design. For example, if inbred individuals are used, it is important to define the trait and the population carefully, in order to eliminate phenocopies as much as possible. We have not discussed the issue of combining sibling pairs and inbred individuals into a single analysis. In principle, this is quite possible. A naive statistic is to simply add up the number of pairs and individuals showing IBD and HBD at each locus. Choosing an optimal statistic is mathematically similar to the problem of choosing the best sibling-pair statistic; it is a matter of finding weightings that are robust over a wide range of

10 974 trait etiologies. One obvious caution is that, if the sibling pairs and inbred individuals are not really from the same population, it is probably not advisable to combine them, particularly in the situation of likely heterogeneity that we have presumed. A final issue is that the genome search whose properties are described in table 1 is fundamentally a search for a single trait locus, although it will, of course, detect multiple loci if the observed statistic exceeds the threshold b at several places along the genome. It is also possible to design procedures specifically to search for multiple loci simultaneously (Lander and Botstein 1986; Dupuis et al. 1995) or sequentially (Dupuis et al. 1995; Gschwend et al. 1996). These require that one be prepared to make assumptions about the number of major loci and their mode of interaction. The resulting procedures can be considerably more powerful than a singlelocus search when those assumptions are correct. Acknowledgments We wish to thank David Botstein and Michele Gschwend, for discussions of homozygosity mapping and in particular of Fanconi anemia, and Leonid Krugylak and Stephanie Sherman, for their comments on preliminary versions of the manuscript. This work is supported in part by NIH grant R01-HG00848-OlA1. Appendix A Markov Chain Models for Affected Relative Pairs The Markov process X, = (X0,t, X1,t, X2,t) gives the number of sibling pairs that share zero, one, and two alleles IBD at point t on a chromosome. One can test for linkage by looking for extreme values of a onedimensional summary of X,. In our previous work, we examined the process X2,t + X1,, which is appropriate if one and two alleles IBD cannot be distinguished, but is a very poor statistic for recessive traits. Others of interest include 2X2,t + X1,t, which is the score statistic for an additive model (6 = 0), and X2,t, which is the score statistic when 5 = a. One can also use an arbitrary weighting, analogous to the Gaussian process Tt(c). The calculations for the arbitrary weighting of the Markov chain are complex and are not discussed here, but they can be derived by extensions of the methods described by Feingold (1993). For any of these processes, the test statistic is the maximum value attained by the process over the length of the genome. The P value associated with an observed peak of height b is the probability that the maximum exceeds b under the null hypothesis that there is no Am. J. Hum. Genet. 60: , 1997 linked gene on the genome. Feingold (1993) discussed appropriate theory for finding such P values. For a large class of affected relative pair statistics, an approximation to the P value is 1- expl l('')p1 (1 p)n (b - Npo)1, (Al) where the binomial probability (N)pb(l _ po)n-b is the stationary probability that the process is in state b, I is the total length of the portion of the genome being examined, and P is a constant associated with the process (see appendix B). The case of X2,, + X1i,, for which (Al) applies with the values po = 3/4 and P = 16X/3 was fully described by Feingold (1993). The case of X2,, is very similar: the appropriate parameters for equation (Al) are po = 1/4 and P = 16V3. The case of 2X2,, + X1i, can also be handled similarly by noting that a single sibling pair is equivalent to the sum of two independent half-sibling processes (under the null hypothesis of no linked gene on the chromosome), since the maternal and paternal meioses are independent. Thus, N sibling pairs can be treated exactly as 2N half-sibling pairs, a case that was covered by Feingold (1993). The approximate P value is a special case of (Al) with po = 1/2, 5 = 4k, and with N replaced by 2N. The power of any of these tests can be written as P{Yr, b} + PIYr < b, Yt b for some t * r), where r - is the trait locus and Y is the one-dimensional process of interest. This expression omits the probability of finding a significant peak on chromosomes other than the one containing the trait locus r and should thus be interpreted as the power to find a peak on the true chromosome. The first term of this expression is generally easy to evaluate. The statistics X2,t + X1,t and X2,t have binomial distributions. For 2X2,, + X1,t, a normal approximation can be used. The first term by itself often provides a reasonable approximation to the power of the test; the more complicated second term usually adds -5%-10% to the total. The second term of the power expression can be approximated by b-l X P{Yr = i)(2vi - Vi) E i=o where Vi is approximately equal to ( Q(b - 15 b) Ab-i' Q(b- 1,b -2J (A2)

11 Feingold and Siegmund: Strategies to Map Recessive Traits which comes from assuming that Y acts like a simple random walk near b. The quantities Q(b - 1, b) and Q(b - 1, b - 2) are the instantaneous average transition rates away from state b - 1, as described by Feingold (1993), with the exception that they must be calculated under the alternative hypothesis. For the statistic X2,, Q(b - 1, b - 2) = 4X(b - 1) and Q(b - 1, b) = 4X(N - b + 1)[(1-6)/(3 - a - 26)]. For X2,, + X1,, Q(b - 1, b - 2) = 4X(b - 1)[(1-6)/(3 + a)] and Q(b - 1, b) = 4X(N - b + 1). For 2X2,1 + X1,,, Q(b - 1, b - 2) = 2X(b - 1) and Q(b - 1, b) = 2X(2N - b + 1), the same as under the null hypothesis. If the sample size is reasonably large, the computation of (A2) can be simplified with a continuous approximation, which is especially useful when P{Yr = i } is not a binomial probability. The continuous approximation is 2 2bfl/2 f 1 1 Ek( - )vd i1)vxdx _00-2 exp[p(- b) + 1/2 Pt2d] - exp[2p(,u - b) + 2p2t2] X - /2-22p] where g and T are the mean and standard deviation of Yr and p nq(b - 1, b - 2)0 P=ln( Q(b-lb) 2)). Appendix B Properties of the Gaussian Process Tt(c) If there is a single trait locus, r, on the chromosome the expected value of T,(c) can be written as 4R,(t - r), where 4 = N142[a + 26(1 - c)]/4 and RI(x) = a + 6 Ie-4XjxI - (2c - 1) e-8xx a +26(1- c) a +28(1 -c) The function R,(x) can be approximated as 1 -,1 x I, when x is close to zero, where Pi = 4X[a + 6(3-4c)]1 [a + 26(1 - c)]. The covariance of T,(c) and T,(c), where s and t are any two points on the same side of r, can be written as i2r(t - s), where G2 = L2 + (2c - 1)21/16 and - 1)2-8XIxI R(x)= 22 + e-4xjxi (2c 2+2c-1)22 +(2cl)2e The function R(x) can be approximated as 1 - Pjx, where,b = (2c - 1)21/12 + (2c - 1)21. (This constant IP is the same as that used for the corresponding Markov chains.) An approximation to the P value (Feingold et al. 1993; Lander and Schork 1994) for a test based on T,(c) is given by P(max Tt(c)/a - a) 1 - exp[-n(l - F(a)) - Ila4(a)], where and are the standard normal density and distribution, respectively, 1 is the total genetic length of the region of the genome being searched, and n is the number of chromosomes. For a genome of 23 chromosomes and a total genetic length of 30 Morgans, the value a ; 4.1 gives a P value of -.O. An approximation to the power of the test (Feingold et al. 1993) is given by 1 (-a o) + 0(a The parameter 4, which is a function of N, a, and 6, is the only part of the power formula that depends on the sample size, so we do sample size calculations by numerically finding the value of 4 that gives the desired level of power, setting that equal to the algebraic expression for 4, and solving for N. Appendix C 975 Markov Chain Models for HBD in Inbred Individuals Our test statistic is the maximum over the genome of the process, Yt, that records the number of offspring of a particular mating type who are HBD at each marker locus. This process is not a Markov chain, but it can be described as a function of an underlying,

12 976 Figure C1 child mating I IE_ X \\\\\R I chromosome 1 chromosome 2 Schematic of pattern of HBD in offspring of a parent/ unseen Markov chain in the following way. The underlying Markov chains are different for each type of consanguineous mating (e.g., parent/child, first cousin, second cousin). For the ith individual, let yui) equal 1 when there is HBD and 0 when there is not. Then Yt = E Ylti). Figure C1 shows a sample pattern of HBD on one pair of chromosomes in the offspring of a parent/child mating. It can be seen from the diagram that the offspring is HBD at a given locus only if (1) the offspring and its maternal grandmother do not share IBD at that locus and (2) the offspring and its mother do share IBD on their respective paternally derived chromosomes. This information allows us to construct an underlying Markov chain of which y(i) is a function. Let X(') = (X(ia), X(ib)), where X(ia) is the IBD process between the paternally derived chromosomes of the mother and offspring that equals one if those two chromosomes are IBD and zero if they are not, and X(lb) is the IBD process between the inbred offspring and its maternal grandmother. The processes X(ia) and X(ib) are independent Markov chains and are described in more detail by Feingold (1993) as half-sibling and grandparent/grandchild processes, Am. J. Hum. Genet. 60: , 1997 respectively. The process we actually observe, y(i) equals one only when X(') = (1, 0). This representation allows us to calculate an approximate P value by the methods of Feingold (1993). The approximate P value is given by equation (Al) with po = 1/4 and D = 4X. For a half-sibling mating, a similar analysis shows that (Al) applies with po = '/8 and,3 = 32V7. The other cases we have examined do not require the description of new underlying chains, because they are identical to affected-pair processes described by Feingold (1993). For example, the HBD process for a child of siblings is exactly the same as the IBD process for first cousins, because the child of siblings can be regarded as its own first cousin. The P values can be approximated by (Al) with parameters as given in table Cl. Power and sample size calculations can be done in the same manner as for sibling pairs, by use of the fact that the distribution of Yr is binomial [N, F + y(l - F)]. Appendix D Formulas for a, 8, and y Under the additive model for locus heterogeneity described by Risch (1990a), the parameters ct,, 5i associated with the ith trait susceptibility locus can be expressed in terms of locus-specific additive and dominance variances of the penetrances: A.i = VAI/2 + VDi/4 = K2 + E [VA//2 + VD,/4] VD,/4 K2 + E [VAj/2 + VD,/4] Under a two-allele per locus model, the prevalence, K, of the trait is fo + 2pqd + 2pf, the dominance variance is VD, = 4p2q2d2 (Kempthorne 1957) and the additive variance is VA, = 2pq[f + d(q p)]2. _ Then Table C1 P-Value Parameters for Different Mating Types Inbreeding Type Affected-Pair Type po (=F) 0 Sibling Cousin 1/4 16X/3 Avuncular Cousin once removed l/8 40X/7 Cousin Second cousin '/I6 32X/5 Cousin once removed Second cousin once removed 1/32 224X/31 Second cousin Third cousin l/64 512X/63

13 Feingold and Siegmund: Strategies to Map Recessive Traits 977 8iloci pq(d/f)2 = pq(d/f)2 + [1 + (d/f)(q p)]2 _ trait locus is qfo + p(fo + 2f) = fo + 2pf = K + V7 - for d 0. Since VD = 4p2q2d2 = 4K26/(1 - a), we have If -dif = 1, this gives 8i/cti = (1 - p)/(1 + 3p). If -dif < 1, the ratio 6i/ai takes a maximum value of (d/f)2/l4-3(d/f)21, achieved at allele frequency p =(1 + df)12. To derive formulas for y, it is useful to start with a general one-locus n-allele model. This sets the penetrance of an individual with alleles i and j equal to K + f, + fj + dij (Kempthorne 1957), where K is the prevalence of the trait in the outbred population, fi is the net additive effect of allele i, and dij is the net effect of the interaction between alleles i and j. Writing the frequency of allele i as pi, we have Xi pif, = 0 and Yi pidi, = 0. Then the probability that an inbred individual is HBD given that he or she is affected is so F + y(l F) -P{HBD}P~affected HBD} Plaffected) pidii) K + F E pidii _ F(K + FYE pidii K + FE pidii Thus, the relationship between values of y for different types of inbreeding depends on F, on K, and on I pidii, making it fairly complicated. We gain some simplification under the two-allele model, where X pidii = -2pqd. The above formula for P{HBD affected) is correct in the presence of both phenocopies and of heterogeneity with other loci that behave additively. If there is heterogeneity with another recessive locus, we have and thus F + yi(1 - F) P{HBD at locus 1 affected) F(K + E pidii + F E p'd,!i) K + F +pid+ F pfd, aly, F E pidii K + F z pidii + FY pids, where the primes indicate the values for locus 2. We derive formulas for relationships between y, a, and 6 only under the two-allele model. For the onelocus, two-allele model, the probability that an inbred individual is affected given that he or she is HBD at the P{HBD affected} = F(K + v%) F(K + VVD) + (1 - F)K =F + (1 - F) A- ). 1 +2Fe For a heterogeneous trait, modeled by assuming additivity of penetrances between, for example, two loci in linkage equilibrium, the probability of HBD at both trait loci is F2 + F(1 - F)(y1 + y2); the probability of HBD at the first but not the second trait locus is F(1 - F)[1 + (1 - F)y1/F -Y2], etc. When both trait loci are diallelic, References 2F[8i/(l - al -a22 = 1 + 2F[( &'2)/(1 - al - a2)2] Blackwelder WC, Elston RC (1985) A comparison of sib-pair linkage tests for disease susceptibility loci. Genet Epidemiol 2:85-97 Bodmer WF, Cavalli-Sforza LL (1976) Genetics, evolution, and man. WH Freeman, San Francisco Crow JF, Kimura M (1970) An introduction to population genetics theory. Harper & Row, New York Davies JL, Kawaguchi Y, Bennett ST, Copeman JB, Cordell HJ, Pritchard LE, Reed PW, et al (1994) A genome-wide search for human type 1 diabetes susceptibility genes. Nature 371: Dupuis J. Brown PO, Siegmund D (1995) Statistical methods for linkage analysis of complex traits from high resolution maps of identity by descent. Genetics 140: Elston RC (1992) Designs for the global search of human genome by linkage analysis. Proc Int Biometric Conf Faraway JJ (1993) Improved sib-pair linkage test for disease susceptibility loci. Genet Epidemiol 10: Feingold E (1993) Markov processes for modeling and analyzing a new genetic mapping method. J Appl Prob 30: Feingold E, Brown PO, Siegmund D (1993) Gaussian models for genetic linkage analysis using complete high resolution maps of identity by descent. Am J Hum Genet 53: Gschwend M, Levran 0, Kruglyak L, Ranade K, Verlander PC, Shen S. Faure S, et al (1996). A locus for Fanconi anemia on 16q determined by homozygosity mapping. Am J Hum Genet 59: Holmans P (1993) Asymptotic properties of affected-sib-pair linkage analysis. Am J Hum Genet 52: Hugot JP, Laurent-Puig P, Gower-Rousseau C, Olson JM, Lee

978 Am. J. Hum. Genet. 60:965-978, 1997 JC, Beaugerie L, Naom I, et al (1996) Mapping of a susceptibility locus for Crohn's disease on chromosome 16.

14 978 Am. J. Hum. Genet. 60: , 1997 JC, Beaugerie L, Naom I, et al (1996) Mapping of a susceptibility locus for Crohn's disease on chromosome 16. Nature 379: Kempthorne 0 (1957) An introduction to genetic statistics. John Wiley & Sons, New York Khoury MJ, Beaty TH, Cohen BH (1993) Fundamentals of genetic epidemiology. Oxford University Press, New York Knapp M (1996) Even a deficit of shared marker alleles in affected sib pairs can yield evidence for linkage. Am J Hum Genet 59: Knapp M, Seuchter SA, Baur MP (1994a) Linkage analysis in nuclear families. 1. Optimality criteria for affected sib-pair tests. Hum Hered 44:37-43 (1994b) Linkage analysis in nuclear families. 2. Relationship between affected sib-pair tests and lod score analysis. Hum Hered 44:44-51 Kruglyak L, Daly MJ, Lander ES (1995) Rapid multipoint linkage analysis of recessive traits in nuclear families, including homozygosity mapping. Am J Hum Genet 56: Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES (1996) Parametric and nonparametric linkage analysis: a unified multipoint approach. Am J Hum Genet 58: Kruglyak L, Lander ES (1995) Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am J Hum Genet 57: Lander ES, Botstein D (1986) Strategies for studying heterogeneous genetic traits in humans by using a linkage map of restriction fragment length polymorphisms. Proc Natl Acad Sci USA 83: (1987) Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science 236: Lander ES, Kruglyak L (1995) Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 11: Lander ES, Schork NJ (1994) Genetic dissection of complex traits. Science 265: Nelson SF, McCusker JH, Sander MA, Kee Y. Modrich P, Brown PO (1993) Genomic mismatch scanning: a new approach to genetic linkage mapping. Nat Genet 4:11-18 Risch N (1990a) Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet 46: (1990b) Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am J Hum Genet 46: Rogus JJ, Krolewski AS (1996) Using discordant sib pairs to map loci for qualitative traits with high sibling recurrence risk. Am J Hum Genet 59: Schaid DJ, Nick TG (1990) Sib-pair linkage tests for disease susceptibility loci: common tests vs. the asymptotically most powerful test. Genet Epidemiol 7: Strathdee CA, Duncan AMV, Buchwald M (1992) Evidence for at least four Fanconi anaemia genes including FACC on chromosome 9. Nat Genet 1: Suarez BK, Rice J, Reich, T (1978) The generalized sib pair IBD distribution: its use in the detection of linkage. Ann Hum Genet 44:87-94 Teng J. Siegmund D (1997) Combining information within and between pedigrees in allele sharing methods of linkage analysis. Am J Hum Genet 60: (in this issue) Tsai M-S, Tangalos EG, Petersen RC, Smith GE, Schaid DJ, Kokmen E, Ivnik RJ, et al (1994) Apolipoprotein E: risk factor for Alzheimer disease. Am J Hum Genet 54:

Non-parametric methods for linkage analysis

BIOSTT516 Statistical Methods in Genetic Epidemiology utumn 005 Non-parametric methods for linkage analysis To this point, we have discussed model-based linkage analyses. These require one to specify a