STATISTICAL ISSUES IN THE DESIGN AND ANALYSIS OF GENE-DISEASE ASSOCIATION STUDIES. Duncan C. Thomas

Size: px

Start display at page:

Download "STATISTICAL ISSUES IN THE DESIGN AND ANALYSIS OF GENE-DISEASE ASSOCIATION STUDIES. Duncan C. Thomas"

Bryan Goodman
6 years ago
Views:

1 Revised 5/6/02 STATISTICAL ISSUES IN THE DESIGN AND ANALYSIS OF GENE-DISEASE ASSOCIATION STUDIES Duncan C. Thomas To appear in: Human Genome Epidemiology: Scientific Foundation for Using Genetic Information to Improve Health and Prevent Disease Edited by MJ Khoury, J Little, W Burke 1

2 INTRODUCTION Genetic epidemiology comprises two broad types of activity that entail the use of biological specimens: gene discovery and gene characterization. Studies of familial aggregation and segregation analysis are aimed at establishing a genetic component to a disease and inferring the mode of inheritance, but do not use any DNA analysis. Once evidence for the existence of one or more major genes has been found, geneticists use linkage analysis to localize these genes by identifying genetic markers at known chromosomal locations which appear to be transmitted within families in a manner that parallels the transmission of the disease. Once localized in this manner, association studies can be used either to test hypotheses about possible candidate genes within the region or to further localize the region using linkage disequilibrium. Association studies are also used once a causal gene has been cloned to characterize its age-specific penetrance function and interactions with other factors. Following a brief review of methods for gene discovery to set the stage for a unified approach to gene discovery and gene characterization, the remainder of this chapter focuses on design and analysis issues in these various types of association studies,. Throughout, we limit attention to binary disease traits, although many of the design and analysis issues also apply to continuous traits. Although there is obviously a continuous spectrum of gene effects, we are accustomed to thinking in terms of two general types of genes that are potentially detectable by genetic epidemiologists: major susceptibility genes having a high penetrance, such mutations usually being rare in the general population; and common low penetrance genes, such as those 2

3 involved in metabolic activation and detoxification of carcinogens, DNA repair, and other complex pathways involving multiple genes and interactions with environmental agents. It is increasingly being recognized that the classical positional cloning approaches (i.e., linkage analysis) are more effective for discovery of major susceptibility genes than common low penetrance genes, and genome-wide association studies are now being suggested as an approach to detection of the latter type (1). A variety of study designs are available for these various purposes. Linkage analysis of necessity requires family studies, typically either sib-pair designs (affected sib pairs for a binary disease trait) or extended pedigrees. Association studies, on the other hand, can use either family designs or population-based designs involving unrelated individuals. These various options will be discussed in greater detail below, with the suggestion that it is possible to design efficient population-based family studies that can be used for both linkage and association. We will also discuss the use of collections of high risk families in gene characterization studies and discuss a general population-based framework for discovery and characterization. Finally, we touch briefly on a number of modeling issues and future challenges that are likely to arise in studying complex diseases METHODS FOR GENE DISCOVERY Linkage analysis entails a search across the genome for markers that are associated with the disease within families, i.e., that there is a tendency for pairs of affected individuals to have the same alleles at that marker locus. Between families, however, different alleles may be associated 3

4 with the disease, because the marker allele per se has no causal role in the disease; the marker simply travels with the disease allele from parents to offspring because the two are close together on a chromosome. Thus, it is possible for a marker to be linked to a disease gene but not associated with it. Once one or more markers in a region have been found to be linked, additional nearby markers are then tested in an effort to localize the disease gene more precisely. Efficient multistage testing strategies for such genome scans have been discussed by Brown et al (2) and Elston et al (3). Multipoint linkage analyses might use several markers jointly for greater precision in estimating that location than can be obtain from a series of two-point linkage analyses one marker at a time (4). Linkage analyses can be model-based (parametric or lod score ) or model-free ( nonparametric ). The lod score method is based on the likelihood of the observed marker and disease data in a pedigree under a model for the distribution of the unobserved disease gene. Typically, the parameters of such a model (e.g., the age- and sex-specific penetrance function and disease allele frequency) are assumed to be known from earlier segregation analyses and the likelihood is maximized with respect to the recombination fraction θ (or the location of the disease gene in a multipoint analysis). This approach can be applied to nuclear families or extended pedigrees. By modeling the conditional distribution of markers given disease phenotypes, no assumption about the families having been ascertained in any systematic statistical sampling manner is needed; indeed, heavily loaded families identified through genetic counseling clinics are often used for this purpose and are typically the most informative for linkage analysis, even though they would be in no sense population-based. The lod score method is the most powerful approach if the genetic model is correctly specified, but can lose 4

5 power or even produce false evidence of linkage under some kinds of misspecification. In contrast, nonparametric approaches do not require any assumptions about the genetic model and thus are robust to model misspecification, but are generally less powerful than lod score methods; furthermore, the choice of the optimal nonparametric test will still depend upon the presumed mode of inheritance. Nonparametric methods are based on a comparison of the proportion of alleles shared identical by descent (IBD) by pairs of affected relatives against the proportion expected based solely on their relationship (e.g., one quarter of sibling pairs would be expected to share zero, one half to share one, and one quarter to share two alleles IBD). This approach is most commonly applied to affected sib pairs (where possible their parents are also included to aid in the determination of IBD status), although it is possible to include other relative types in other forms of analysis. The limit of resolution of linkage analysis is generally felt to be not much smaller than about 1 cm (1% recombination, corresponding to roughly one million base pairs (bp)), even with the largest pedigree studies. Thus, other techniques are needed to further localize a disease gene before undertaking massive sequencing in search of mutations. These techniques typically entail use of a very dense panel of markers, perhaps on the order of 10 kb (0.01 cm), which can be used in various ways. The simplest of these is to search for markers one-by-one that appear to associated with the trait across the population, a phenomenon known as linkage disequilibrium (LD). LD can arise in a number of ways new mutations, genetic drift, admixture, etc. but will tend to decay across generations G at a rate (1 θ) G ; thus, after many generations from the event that created the LD initially, only very nearby pairs of loci will remain associated. The extent of detectable LD in various human populations and its usefulness as a mapping tool is an 5

6 active area of research, but in most outbred populations is generally believed to be very short range, ~100kb or less. (In recently admixed populations, LD will generally extend over a much wider interval, making them potentially useful for genome scans (5); LD tends to be larger in magnitude and more consistent in population isolates, making them more useful for fine mapping (6).) Even within a region of significant LD, its magnitude will be extremely variable across pairs of nearby loci, owing to chance mutation, recombination, and coalescent events in the ancestral history of the population. Therefore, there is now great interest in using haplotypes sequences of adjacent marker alleles on a single chromosome as a tool for fine mapping. A variety of approaches have been proposed for doing this, including searching for segments that are frequently shared by pairs of cases (7-10), association between specific haplotypes and disease (11, 12), or using coalescent methods to model the ancestry and evolution of mutation-carrying haplotypes (13-15). DESIGNS FOR ASSOCIATION STUDIES Whether the aim is fine mapping by LD, testing an association with a candidate gene, or characterizing a cloned gene, a number of different study designs could be used, their relative merits depending upon the context. The most important distinction between these designs concerns whether families or unrelated individuals are studied. We therefore begin by discussing the standard population-based epidemiologic case-control and cohort designs using unrelated individuals and then survey a range of family-based designs case-control, case-parent trios, kin-cohort, and use of heavily loaded pedigrees. 6

7 Population-Based Case-Control and Cohort Designs The majority of disease traits studied in genetic epidemiology are relatively rare, so that casecontrol designs are natural to consider. The design of such studies is essentially no different for a genetic risk factor than for environmental risk factors and follows well recognized principles discussed in standard epidemiologic textbooks (e.g., (16-18)). Thus, for example, cases should be representative of all cases in the population and controls should be representative of the source population of cases. This is most easily accomplished in situations where a populationbased disease registry exists, such as the SEER registries in the U.S., and where there is some means of sampling from the total population. The latter is more difficult in the U.S., although some other countries maintain voter registration or other databases that are available for epidemiologic research. Absent such a register, some imagination may be needed to construct a suitable control selection procedure: techniques such as neighborhood censuses, random digit dialing, sampling from birth registries (for childhood diseases) or Medicare files (for diseases of old age), prepaid health maintenance organization rosters, or hospital controls have been used in various epidemiologic studies and their advantages and disadvantages have been widely discussed. Cases and controls are frequently individually or stratum-matched on potential confounding variables, such as age, gender, race, and possibly other established risk factors. See (19) in this volume for further discussion of some of the issues of validity and efficiency that can arise in such studies. In some respects, one of the major drawbacks of case-control studies in conventional risk factor epidemiology recall bias due to retrospective collection of exposure information is not as much a concern in genetic epidemiology, since a subject s constitutional genotype does vary over time and is not subject to the vagaries of an individual s memory. (Of 7

8 course, phenotypic assays of genotype could be distorted by the disease process, and other confounding or modifying factors could be misclassified.) Indeed, Clayton and McKeigue (20) have argued that because the transmission of genes from parents to offspring is random, a gene association study carries the same interpretability in terms of causality as a randomized control trial, at least in terms of freedom from bias and residual confounding. Thus such associations would reflect a causal effect of either the variant under study or a nearby one in linkage disequilibrium with it. This Mendelian randomization argument is directly applicable to the case-parent trio design discussed below, but they also apply it to ordinary case-control and cohort designs involving unrelated individuals. However, this extension of the principle requires that one address the problem of population stratification using one of the approaches discussed below. Cohort studies have well recognized advantages and disadvantages (20). For the purpose of genetic epidemiology, few investigators would contemplate initiating a new prospective cohort study for any but the most common diseases, but there are now in excess of a million persons enrolled in various existing cohorts for whom biological specimens have already been obtained and stored. Some of these cohorts have already accrued decades of follow-up time and represent a rich resource for genetic association studies (21). The cost of genotyping everyone in a large cohort would likely be prohibitive, even with recently developed high throughput technologies, but this can be avoided using nested case-control (22) or case-cohort (23) designs. These entail comparison of cases arising in the cohort with a sample of suitably selected controls drawn from the cohort, thereby capitalizing on the inferential advantages of a cohort design at greatly reduced cost. The design of such nested studies is in principle no different when studying a 8

9 genetic association than any other risk factor and are discussed in standard textbooks (24) and review articles (25). However, a number of options for efficient sampling are available, such as multistage sampling (26-29) and countermatching (30, 31). Multistage sampling might entail an initial random sampling of cases and controls on whom a surrogate for some risk factor (e.g., family history as a surrogate for genotype) is obtained. Subjects are then subsampled using this information for the more expensive determination of genotype and perhaps other risk factors. Countermatching aims to improve the efficiency of a matched case-control design by increasing the proportion of pairs that are discordant for the risk factor(s) of interest through systematically mismatching them on a surrogate for the factor: for example, in a genetic study, a case with a positive family history might be matched with a family-history-negative control and vice versa. The inherent bias in both these designs is then accounted for by including suitable weights in the analysis. Whether case-control, cohort, or any of these nested designs are used, any association study based on unrelated individuals potentially suffers from a form of confounding known in the genetics literature as population stratification (32). If the population comprises two or more subgroups with different allele frequencies and different baseline rates of disease, then Allele frequency Candidate Gene confounding can occur (see Figure), leading Ethnicity to increased risk of false positive associations, and biasing relative risk Baseline disease risk Disease estimates upwards or downwards, depending upon the direction of these two associations. If these subgroups were identifiable, standard techniques such as matching or statistical adjustment 9

10 could be used to control this problem indeed, epidemiologic studies are routinely matched or adjusted for race/ethnicity. The difficulty is that even within the broad categories of race/ethnicity that are conventionally used, there can be strong gradients in allele frequencies and baseline risks. Some authors (19, 33-35) have questioned the practical importance of this concern, at least for studies of common polymorphisms in non-hispanic whites of European descent, arguing that any correlation between baseline rates and allele frequencies that would give rise to confounding tends to disappear as larger numbers of subgroups are combined. Hence, they argue that adherence to standard principles of sound epidemiologic study design should be adequate to address it. However, in multiethnic populations, particularly heavily admixed populations such as African-Americans or Hispanics or those with a high prevalence of multi-racial individuals as in Southern California, individuals can be difficult to classify or match appropriately (36). A different approach to this problem which is gaining some theoretical attention (but remains to be applied widely), known as genomic control, is based on using a panel of markers unlinked to the gene under study to infer the hidden population structure and adjust for it. One such approach uses the distribution of test statistics for the markers to estimate an inflation factor by which to adjust the naïve chi square test for the overdispersion caused by population stratification (37, 38). Another approach uses Bayesian clustering or latent class analysis methods to assign cases and controls probabilistically to strata defined by their markers and then perform a stratified analysis within these strata (39, 40). Although not the focus of this chapter, all the designs discussed in this section and the familybased designs section which follows can also be used for testing gene-environment and genegene interactions (see (19) for more discussion of this topic). Another approach to testing such 10

11 interaction effects is known as the case-only or case-case design (41-43). In this approach, a series of unrelated cases is used and the association between genotype and environment (or two genes) is tested. If the two factors were independently distributed in the source population, then any such association in cases would be evidence of departure from a multiplicative model for their joint effects on disease risk. Of course, with this design it is not possible to test the main effects of either factor and careful consideration is needed to judge whether the assumption of independence in the source population is tenable (44), but if it is, the design is more powerful for testing that interaction than a conventional case-control design. Family-Based Designs The use of family-member case-control designs is appealing because family members have a common gene pool and hence the problem of population stratification is overcome by matching. The two main variants of this idea involve the use of sibling controls or case-parent-trios (also known as parental controls or pseudo-siblings ). In a case-sibling study, affected individuals are the cases and their unaffected siblings the controls, the data being analyzed as a matched case-control study using standard conditional logistic regression methods. (Unaffected cousins could also be used instead of siblings, although the protection against population stratification would no longer be absolute, since they would have only one pair of grandparents in common and the other grandparents might come from different ethnic groups.) In the case-parent-trio design, cases and their parents are genotyped, but the parents themselves are not used as controls; instead, one forms the set of hypothetical pseudo-siblings comprising the three other genotypes that could have been transmitted from the parents; the case-pseudosib sets are then 11

12 analyzed as a 1:3 matched case-control design using conditional logistic regression (45). A variant of this design, known as the Transmission-Disequilibrium Test (TDT) (46) compares each of the two alleles for the case separately against the other allele not transmitted by that parent as two independent contributions to a 1:1 matched case-control comparison of alleles rather than a single genotype comparison. The two procedures are mathematically equivalent under a multiplicative model for allele contributions to risk, but would differ under a dominant or recessive model. Both the case-sib and the case-parent-trio designs test an alternative hypothesis of linkage and association, i.e., they will detect associations only with causal genes or with genes that are in linkage disequilibrium with a causal gene. There are a number of drawbacks to the use of these designs. Because cases and their siblings are more likely to share genotypes (and environmental factors), their comparison tends to be less efficient than using unrelated controls (depending upon the genetic model, about 50% as efficient, meaning that double the sample size would be required to obtain the same statistical precision or power). For gene-environment interactions, however, the use of sib controls can be more efficient (47). In general, siblings should have attained the age of the case s diagnosis to rule out the possibility that he or she might still have been affected prior to the case (48), effectively limiting the pool of potential controls to older siblings for many cases; this lack of comparability on birth order, however, risks introduction of other biases, particularly if timedependent environmental variables are to be included in the model or if a substantial proportion of cases need to be excluded for lack of a suitable sib control (47). There are other subtleties if multiple cases or multiple controls are selected from the same family, since the possible permutations of disease status against genotypes are not equally likely under the null hypothesis 12

13 that the gene is not itself causal but is linked to a causal gene (49). Furthermore, the familial relationships amongst cases and controls must not give away who must have been the case: for example, if cousins were used as controls and one were drawn from each side of the family, it would be obvious which was the case because that he or she would be the only one who was a blood relative of both of the other two. While some authors have advocated limiting such comparisons by selecting a single case and a single control from each family (for example, taking the pair with the maximally different genotypes (49)), there has been a rapidly developing literature on valid family-based-association-tests (FBATs) that would exploit all the possible comparisons within a family (50-52). The case-parent-trio design generally does not suffer from the efficiency loss that the case-sib design does, and indeed can be more powerful that using unrelated controls for a recessive gene (47). However, it does require that both parents be available for genotyping, which makes it difficult to apply to diseases or middle- or old-age. Although some information is contained in the transmission from a single parent, if that is all that is available, care must be taken to avoid bias, since the subset of transmissions for which parental sources can be inferred unambiguously is not random (53, 54). As with the case-sib design, families with multiple cases do not contribute independent information under the null hypothesis of linkage but no association, so more sophisticated techniques are required (55). Although the trio design cannot be used to test for the main effects of environmental factors, it can in principle test for gene-environment interactions by comparing the genetic relative risks in exposed and unexposed cases (56). This comparison, however, involves an assumption that genes and environments are independently distributed within families (i.e., conditional on parental genotypes) (43, 57); an assumption that 13

14 is similar to that for the case-only design, but somewhat weaker because it applies within families, not between families, and thus would not be influenced by such factors as family history that could potentially induce such an association. See (58, 59) for a log-linear models approach to case-parent-trio data, with particular application to testing maternal genotype effects and imprinting. For example, birth defects could involve a direct or interactive effect of maternal alleles, in which case the deleterious allele would tend to have a higher frequency in mothers than fathers. A more complex design is the case-control-family study. Here, population-based series of unrelated cases and controls are identified, possibly matched on various factors as in a traditional case-control study, and their family members are also recruited as study participants. While more commonly used for testing familial aggregation (e.g., (60)) or segregation analysis (e.g., (61)) without use of any molecular data, the design can also be used for testing candidate gene associations or characterizing cloned genes (e.g., (62, 63)). Used in this way, it can be seen as an extension of the kin-cohort design discussed below, but its real advantage lies in its population-based nature and ability to serve as the basis for an integrated approach to gene discovery and gene characterization. The kin cohort design (64) entails ascertainment of a series of probands unselected with respect to family history and obtaining their genotypes and their family history in first-degree relatives (but relatives genotypes are not needed in this approach). The probands themselves could be affected or unaffected and need not necessarily be representative of the population (provided they are not biased with respect to family history). For example, in a study of the 14

15 penetrance of the BRCA1 and BRCA2 ancestral mutations in Ashkenazi Jews, Struewing et al. (65) enrolled volunteers from Jewish community organizations in the Washington, D.C. area. The cumulative incidence curves in first-degree relatives of carrier and noncarrier probands are then estimated using standard Kaplan-Meier survival analysis methods. Because first-degree relatives of carriers have a roughly 50% probability of carrying the same mutation while firstdegree relatives of noncarriers have only half the population probability of being a carrier, it is then possible to decompose the observed cumulative incidence curves into their constituent penetrance curves (cumulative incidence in carrier and noncarrier individuals) by a simple algebraic manipulation. This design has been extended in a number of ways. The relatively simple analysis described above does not exploit all the information in the sample, so Gail et al. proposed a maximum likelihood analysis, similar to segregation analysis but conditioning on the observed genotypes (66, 67). Using this likelihood, it then becomes straight-forward to extend the design to include more distant relatives, as well as measured genotype information on other relatives; they call this general approach the Genotyped Proband Design. Siegmund et al. (68) have considered the question of which members of a pedigree would be the most informative to genotype and concluded that for a common low penetrance dominant gene, genotyping additional relatives per family was more efficient that genotyping a single proband in a larger number of families (for the same total genotyping costs), while for a rare major dominant gene, the reverse was true. Since in the process of gene discovery, the heavily loaded families typically used are not representative of all cases in the population, it is natural to inquire whether any useful 15

16 information about penetrance or modifying factors can be obtained at the gene characterization phase. Their great advantage, beyond simply the cost efficiency from having already collected the pedigree information and biological specimens, is that such families will tend to have a higher prevalence of mutations, particularly for rare high penetrance genes, so that smaller sample sizes should be required. On the other hand, since such families were collected specifically because they had many cases, a naïve analysis that ignored the ascertainment process would greatly overestimate both penetrance and allele frequency, compared to their true values in the general population. While in principle, one might be able to construct a maximum likelihood analysis which conditioned on the ascertainment scheme, in practice such collections seldom can be described in terms of any well defined statistical sampling plan, and even if it were, the analysis of complex sampling schemes would likely be computationally intractable. Fortunately, an alternative approach is available that theoretically should allow valid estimates of population parameters even from samples that are not population-based. Known as the mod score (69-71) or retrospective likelihood (72) approach, the analysis is based on the conditional likelihood of the measured genotypes given the observed distribution of phenotypes in the family. By conditioning on the phenotypes in this manner, their ascertainment is automatically controlled for, assuming that families were ascertained solely on the basis of their phenotypes, not their genotypes. This is the approach that was used in the initial estimation of BRCA1/2 penetrance from the Breast Cancer Linkage Consortium families (73, 74), which led to an estimate of risk of breast or ovarian cancer by age 70 of 83%. Subsequent estimates based on the kin-cohort and population-based case-control-family designs have been substantially lower (63, 65, 75). This difference cannot be explained simply as an artifact of ascertainment bias because of the use of the mod score approach to analyze the high risk families. However, by 16

17 limiting that analysis to the linked families (done to address the problem of genetic heterogeneity, i.e., some families disease being due to genes other than the one under study), the assumption that families were ascertained solely on the basis of their phenotype was violated; this has been shown to lead to upwardly biased estimates of penetrance (76). Other explanations that have been offered to explain the discrepancy between the clinic-based and population-based estimates are that the former may also be segregating other modifying factors (other genes or environmental factors) leading to truly higher penetrance in such families, or that the penetrance varies by specific mutations, with the more commonly occurring mutations in the general population (e.g., the Ashkenazi founder mutations) having lower penetrance than those occurring in the heavily loaded families. INTEGRATED DESIGNS FOR DISCOVERY AND CHARACTERIZATION With this brief tour of approaches to discovering and characterizing genes, we now turn to the question of whether it is possible and efficient to try to design a resource that can be used for both purposes. The experience from the use of heavily loaded clinic-based collections of pedigrees to estimate BRCA1/2 penetrance should be somewhat cautionary about the limitations of relying exclusively on series that are not population-based. On the other hand, they are arguably the most efficient way to assemble pedigrees that are highly informative for linkage analysis. In an attempt to bridge this gap, Zhao et al (77, 78) have proposed a general framework based on the case-control-family design described above. Since the initial ascertainment of families is population-based, there would be no difficulty in estimating population parameters from such a design. Of course, the yield of rare major genes would be 17

18 relatively low, but multistage sampling of probands based on family history (as discussed earlier (28)) could be used to hone in on the families most likely to be segregating mutations, and this could be extended following the principles of sequential sampling of pedigrees (79). Here, the basic idea is that at each stage of pedigree extension, one is entitled to use all the phenotype information already collected systematically as well as knowledge of the pedigree structure (but not anecdotal information about phenotypes) in branches not yet explored, to decide whether and in what direction to extend the pedigree; once extended, all the phenotype information obtained must then be included in the analysis, whether additional cases were identified or not. Following these simple rules, Cannings and Thompson show that the likelihood for the pedigree need be conditioned solely on the initial ascertainment of probands, not on all the decisions made subsequently. Still, for mapping a very rare gene, it is unclear whether this process can yield a sufficient number of highly informative pedigrees, even using the most efficient approaches to multistage sampling and sequential extension of pedigrees, without requiring enrollment of a prohibitive number of probands. For genes with mutations that are not extremely rare, however, there is great merit in this approach, as it will not only provide a basis for mapping genes and then characterizing them in the same sample, but it will also provide a resource for continuing the search for additional genes after some have been discovered. For example, Antoniou et al. (80) and Cui et al. (81), using such approaches, have provided evidence for an additional major gene for breast cancer, possibly a more common recessive gene, after removing the families attributable to BRCA1 and BRCA2. Their approaches differ somewhat, Antoniou et al. fitting a multilocus model which includes the measured genes and all families in the analysis, Cui et al. 18

19 excluding the families known to be segregating one of the two measured genes. On the basis of such segregation analysis results, one might then feel confident to launch a further genome scan to localize such a gene, now using the more powerful lod score approaches which require a population-based estimate of the genetic model. Absent such knowledge, one would be forced to use the affected sib pair approach, first screening all pairs to exclude those that were carrying a known mutation. It is this general philosophy that underlies the establishment of the NCI s Cooperative Family Registries for Breast and Colorectal Cancer Research (82-85). In order to address the aims of both discovery and characterization, this multi-center resource comprises population-based and clinic-based series of families. The population-based series are ascertained through affected probands from population-based cancer registries, stratified in various ways. Some are unselected consecutive series, some restricted or sampled by age, race, or family history in firstdegree relatives; a few registries have used multistage sampling (29, 86). The clinic-based registries are intended to provide a large series of multiple-case families for gene discovery purposes, but would not be included in analyses aimed at characterization, except perhaps using the mod-score approach. Whatever the mode of ascertainment, all probands provide a standardized risk factor questionnaire, including extended family history, and blood samples that are being stored for genotyping and creation of cell lines. Participating centers differ in the specifics of their protocols for developing extended pedigrees, but in general as many surviving family members (affected and unaffected) as possible are enrolled as participants, providing the same risk factor information and blood samples which are also being stored. To date, over 6000 breast and 6000 colorectal cancer families have been enrolled, comprising over 100,000 19

20 individuals in each registry. Depending upon the specific scientific aims, these families might be sampled in various ways for genotyping. A variety of studies aimed at using this resource for gene discovery and characterization are currently underway. MODELS FOR COMPLEX DISEASES Whether parametric linkage analysis or association analysis is planned, some form of statistical model of penetrance is needed. Amongst the complexities that must be considered are variable age at onset, the role of polygenes, other major genes, and environmental factors, including their possible interactions, residual familial aggregation due to unmeasured factors, and heterogeneity of effect for genes with multiple mutations or polymorphisms. One might also wish to take account of somatic events, such as loss of heterozygosity, genomic instability, DNA methylation, and gene expression data. Genome-wide association studies are also being proposed as a means of gene discovery, perhaps requiring something of the order of a million statistical tests (1), introducing yet another level of statistical complexity. In this brief section, we can only outline a general approach to model building, leaving the details to other papers. For binary disease traits with variable age at onset, the techniques of survival analysis provide a natural framework for modeling penetrance. Letting λ(t) denote the incidence rate of disease at t age t ( hazard function ) and S(t) = exp( 0 λ(u) du) the probability of surviving to age t free of disease ( survival function ), then the likelihood contribution for a case diagnosed at age t is λ(t) S(t) and the contribution for a subject last seen at age t disease free at that time is simply S(t). If 20

21 we assumed that, conditional on all the measured risk factors, the outcomes of all subjects i = 1,,n were independent, then the overall likelihood of the data would be simply L = i=1 n λi (t i ) di S i(t i ) where d i is an indicator for affected (1) or not (0). The conditional independence assumption would not pose any difficulty for unrelated individuals (e.g., a population-based case-control study), but is more problematic for family data. If not all family members have been genotyped for a major gene, then a likelihood contribution for the entire family must be constructed by summing over the possible genotypes of all the untyped individuals that are compatible with the available genotype information on other family members. This is essentially a segregation analysis, but conditional on partially measured genotype information (63). Additional familial dependencies might be caused by other as yet unidentified genes, by unmeasured environmental factors, or by correlated measurement errors in measured risk factors. Such dependencies might be taken into account using regressive models (87), latent variable approaches like frailty models (88, 89), or marginal models using Generalized Estimating Equations methods (90). By whatever means the likelihood is constructed, a model is needed for the hazard function in relation to the various measured risk factors, genetic and environmental. One possibility is the proportional hazards model (91), which might be written as λ(t,g,z) = λ 0 (t) exp(β G + Z γ + ) 21

22 where G represents the major gene(s), β G the log relative risk associated with genotype G, Z the measured environmental covariates, λ 0 (t) an unspecified function representing the baseline risk as a function only of age, and indicates the possibility of adding additional interaction terms (e.g., gene-environment, gene-gene, gene-age, etc.). However, a number of major genes such as BRCA1 seem to have much stronger effects at younger ages on a relative risk scale. While this could be addressed by adding age genotype interaction terms, it might be preferable to reformulate the model as λ(t,g,z) = λ G (t) exp(z γ + ) i.e., with separate age-specific baseline rates for each genotype, but still assuming that environmental factors acted multiplicatively on these baseline rates, unless specific interaction terms were added to the model. In either of these approaches, the form of the baseline rates might be left completely unspecified, as in the Cox partial likelihood approach (92), or some parametric form could be adopted; for example, the S.A.G.E. package assumes a logistic distribution for the ages at onset amongst the affected, coupled with a logistic model for the lifetime risk of disease, either of which could depend upon genotype and/or covariates, as in an application to smoking-gene interactions for lung cancer (93). Other mathematical models might also be considered for the joint effects of age and genotype, such as an additive model of the form λ(t,g) = λ 0 (t) + β G or an accelerated failure time model of the form S(t,G) = S 0 (t e βg ). For example, Peto and Mack (94) have suggested that the rate of breast cancer in co-twins of affected twins or of second cancer in the contralateral breast is virtually constant as a function of 22

23 age or time since diagnosis of the first, suggesting an additive model for genetic effects might be appropriate. The coding of β G would depend upon what is assumed about dominance. For a dominant gene, with wild type allele a and mutant allele A, one would set β aa = 0 and constrain β aa = β AA ; likewise, for a recessive gene, one would set β aa = 0; for a codominant gene, one would estimate both β aa and β AA. For a multiallelic gene, one might have many more parameters to estimate. Most analysis of BRCA1 penetrance have treated all mutations as equivalent, but there is some evidence that different mutations confer different risks of breast vs ovarian cancer (95) and it remains an open question whether certain common polymorphisms in the gene also have an effect on penetrance (96). For genes like BRCA1 with hundreds of rare mutations, the prospects of ever having direct estimates of penetrance for any one of them are virtually nonexistent, so some kind of modeling approach is needed to test for systematic influences of broad classes of mutations (truncating or not, by location, etc.) as well as random between-mutation heterogeneity in effect. Hierarchical models (97) provide a natural framework for addressing such questions. Bayesian approaches to smoothing the effects of many haplotypes within a gene (sequences of alleles on a single chromosome) have also been suggested (12). This entails the use of a multi-level model, in which the first level would be a conventional logistic model for disease as a function of a set of relative risks for all possible haplotypes, and the second level would be a model for the prior means and covariances of haplotype relative risks in terms of their structural similarities to each other. 23

24 Increasingly, gene characterization efforts have been directed towards trying to understand complex pathways involving multiple genes and multiple exposures jointly, particularly for common polymorphisms in low-penetrance metabolic genes. For example, hypothesized causes of colorectal polyps and cancer include polycyclic aromatic hydrocarbons (PAHs) and heterocyclic amines (HCAs), which derive from tobacco smoke and well done red meat (98). The metabolic activation and detoxification of these compounds are regulated by a number of genes, including several Cytochrome P-450 enzymes (such as Cyp1A1 and Cyp1A2), various glutothione-s-transferases (such as GSTm3), N-acetyl-transferases (NAT1 and NAT2), and microsomal epoxide hydrolase (meh, aka EPHX1) (99). The complexity of these pathways makes it difficult to examine the effects of these exposures or these genes one at a time, or even in pairwise interactions, without allowing for the influence of the other factors, but the problems of sparse data and multiple comparisons precludes standard approaches based on multi-way stratification. Cortessis and Thomas (100) have proposed a Bayesian approach to such problems using physiologically based pharmacokinetic (PBPK) models. In essence, the approach entails estimating the concentrations of the various intermediate metabolites for each subject, as a function of the measured exposures and a set of unmeasured metabolic rates, which are in turn determined by the subject s genotypes at the relevant loci, and relating the estimated concentrations of the relevant metabolites to the disease risk. The distributions of the various individual parameters are determined by a set of population parameters that are the primary object of inference, e.g., regression coefficients for the contributions of exposures to pathways or of pathways to disease, means and variances of metabolic rates as a function of genotype, etc. 24

25 SUMMARY Both population-based and family-based designs have their uses in testing candidate gene associations and characterizing genes once their causal connection to a disease has been established. Appropriately designed, such studies can also be a useful resource for discovering other genes that may also be involved. Nonmendelian disorders may involve a complex interplay between multiple genes and multiple environmental factors, as well as age and other time-dependent factors, requiring sophisticated methods of analysis. While survival analysis techniques, such as Cox regression can provide a flexible framework for empirical modeling of penetrance functions, mechanistic models such as PBPK models for complex metabolic networks can also be useful. Stochastic models of carcinogenesis, which have long been used to describe exposure-time-response relationships for environmental exposures, might usefully be extended to incorporate the influence of germline mutations or such epigenetic phenomena as microsatellite instability and DNA methylation. 25

26 REFERENCES 1. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science 1996;273: Brown D, Gorin M, Weeks D. Efficient strategies for genomic searching using the affected-pedigree-member method of linkage analysis. American Journal of Human Genetics 1994;54: Elston R, Guo X, Williams L. Two-stage global search designs for linkage analysis using pairs of affected relatives. Genet Epidemiology 1996;13: Kruglyak L, Lander E. Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am J Hum Genet 1995;57: Stephens J, Briscoe D, O'Brien S. Mapping by admixture linkage disequilibrium in human populations: limits and guidelines. Am J Hum Genet 1994;55: Jorde L. Linkage disequilibrium as a gene-mapping tool. Am J Hum Genet 1995;56: Houwen R, Baharloo S, Blankenship K, et al. Genome screening by searching for shared segments: mapping a gene for benign recurrent intrahepatic cholestasis. Nature Genet 1994;8: Qian D, Thomas D. Genome scan of complex traits by haplotype sharing correlation. Genet Epidemiol 2001;21:S582-S Te Meerman G, Van Der Meulen M. Genomic sharing surrounding alleles identical by descent effects of genetic drift and population growth. Genet Epidemiol 1997;14: Bourgain C, Genin E, Holopainen P, et al. Use of closely related affected individuals for the genetic study of complex diseases in founder populations. American Journal of Human Genetics 2001;68: Chiano M, Clayton D. Fine genetic mapping using haplotype analysis and the missing data problem. Ann Hum Genet 1998;62: Thomas D, Morrison J, Clayton D. Bayes estimates of haplotype effects. Genet Epidemiol 2001;21 (Suppl 1):S712-S McPeek M, Strahs A. Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am J Hum Genet 1999;65: Morris A, Whittaker J, Balding D. Bayesian fine-scale mapping of disease loci, by hidden Markov models. Am J Hum Genet 2000;67: Niu T, Qin ZS, Xu X, Liu JS. Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms. Am J Hum Genet 2002;70: Breslow NE, Day NE. Statistical methods in cancer research: I. The analysis of casecontrol studies. Lyon: IARC Scientific publications, Rothman KJ, Greenland S. Modern epidemiology. Philadelphia: Lippencott-Raven, Klienbaum DG, Kupper LL, Morgentern H. Epidemiologic research: Principles and quantitative methods. Belmont, CA: Lifetime Learning Publications, Garcia-Closas M, Wacholder S, Caporaso N, Rothman N. Inference issues in epidemiological studies of genetic effects and gene-environment interactions. In: Khoury MJ, Little J, Burke W, eds. Human Genome Epidemiology: Scientific basis for using genetic information to improve health and prevent disease, 2002:Chapter 7. 26

27 20. Clayton DG, McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet 2001;358: Langholz B, Rothman N, Wacholder S, Thomas D. Cohort studies for characterizing measured genes. Monogr Natl Cancer Inst 1999;26: Mantel N. Synthetic retrospective studies and related topics. Biometrics 1973;29: Prentice R. A case-cohort design for epidemiologic studies and disease prevention trials. Biometrika 1986;73: Breslow NE, Day NE. Statistical methods in cancer research. II. The design and analysis of cohort studies. Lyon: IARC Scientific Publications, Thomas DC. New approaches to the analysis of cohort studies. Epidemiol Rev 1998;14: White J. A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol 1982;1982: Breslow N, Cain K. Logistic regression for two-stage case-control data. Biometrika 1988;75: Whittemore A, Halpern J. Multi-stage sampling in genetic epidemiology. Statistics in Medicine 1997;16: Siegmund K, Whittemore A, Thomas D. Multistage sampling for disease family registries. Monogr Natl Cancer Inst 1999;26: Langholz B, Borgan O. Counter-matching: a stratified nested case-control sampling method. Biometrika 1995;82: Andrieu N, Goldstein A, Langholz B, Thomas D. Counter-matching in gene-environment interaction studies: efficiency and feasibility. Am J Epidemiol 2001;153: Lander ES, Schork NJ. Genetic dissection of complex traits. Science 1994;265: Caparaso N, Rothman N, Wacholder W. Case-control studies of common alleles and environmental factors. Monogr Natl Cancer Inst 1999;26: Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. JNCI 2000;92: Wacholder S, Rothman N, Caporaso N. Counterpoint: Bias from population stratification is not a major threat to the validity of conclusions from epidemiologic studies of common polymorphisms and cancer. Cancer Epidemiol Prev Biomarkers 2002;11:in press. 36. Thomas D, Witte J. Population stratification: A problem for case-control studies of candidate gene associations? Cancer Epidemiol Prev Biomark 2001:Under review. 37. Devlin B, Roeder K. Genomic control for association studies. Biometrics 1999;55: Reich DE, Goldstein DB. Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol 2001;20: Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet 2000;67: Satten GA, Flanders WD, Yang Q. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet 2001;68: Umbach D, Weinberg C. Designing and analysing case-control studies to exploit independence of genotype and exposure. Statistics in Med 1997;16: Khoury M, Flanders W. Nontraditional epidemiologic approaches in the analysis of gene- 27

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits Harold Snieder Overview of presentation Designs: population vs. family based Mendelian vs. complex diseases/traits