Flexible Matching in Case-Control Studies of Gene-Environment Interactions

Similar documents
A Comparison of Sample Size and Power in Case-Only Association Studies of Gene-Environment Interaction

Selection Bias in the Assessment of Gene-Environment Interaction in Case-Control Studies

Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision

Does Body Mass Index Adequately Capture the Relation of Body Composition and Body Size to Health Outcomes?

CONTINUOUS AND CATEGORICAL TREND ESTIMATORS: SIMULATION RESULTS AND AN APPLICATION TO RESIDENTIAL RADON

Gene-Environment Interactions

REPRODUCTIVE ENDOCRINOLOGY

breast cancer; relative risk; risk factor; standard deviation; strength of association

Repeat Measurement of Case-Control Data: Corrections for Measurement Error in a Study of lschaemic Stroke and Haemostatic Factors

Sample size and power calculations in Mendelian randomization with a single instrumental variable and a binary outcome

Regression Methods for Estimating Attributable Risk in Population-based Case-Control Studies: A Comparison of Additive and Multiplicative Models

Controlling Bias & Confounding

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

11/24/2017. Do not imply a cause-and-effect relationship

Interpretation of Epidemiologic Studies

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

Reliability of Reported Age at Menopause

Today Retrospective analysis of binomial response across two levels of a single factor.

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA

Contingency Tables Summer 2017 Summer Institutes 187

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Methods for meta-analysis of individual participant data from Mendelian randomization studies with binary outcomes

OLS Regression with Clustered Data

Pearce, N (2016) Analysis of matched case-control studies. BMJ (Clinical research ed), 352. i969. ISSN DOI: /bmj.

BOOTSTRAPPING CONFIDENCE LEVELS FOR HYPOTHESES ABOUT QUADRATIC (U-SHAPED) REGRESSION MODELS

What is Multilevel Modelling Vs Fixed Effects. Will Cook Social Statistics

Maria-Athina Altzerinakou1, Xavier Paoletti2. 9 May, 2017

m 11 m.1 > m 12 m.2 risk for smokers risk for nonsmokers

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

The Exposure-Stratified Retrospective Study: Application to High-Incidence Diseases

Bias in randomised factorial trials

Using Statistical Principles to Implement FDA Guidance on Cardiovascular Risk Assessment for Diabetes Drugs

Transmission Disequilibrium Methods for Family-Based Studies Daniel J. Schaid Technical Report #72 July, 2004

RAG Rating Indicator Values

Title: The efficacy of fish oil supplements in the treatment of depression: food for thought

Confounding, Effect modification, and Stratification

Choice of axis, tests for funnel plot asymmetry, and methods to adjust for publication bias

The Regression-Discontinuity Design

Mammographic density and risk of breast cancer by tumor characteristics: a casecontrol

Case-control studies. Hans Wolff. Service d épidémiologie clinique Département de médecine communautaire. WHO- Postgraduate course 2007 CC studies

Comparison of Some Almost Unbiased Ratio Estimators

Measures of Association

A Methodological Issue in the Analysis of Second-Primary Cancer Incidence in Long-Term Survivors of Childhood Cancers

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Analysis of TB prevalence surveys

ADENIYI MOFOLUWAKE MPH APPLIED EPIDEMIOLOGY WEEK 5 CASE STUDY ASSIGNMENT APRIL

Unit 1 Exploring and Understanding Data

How should the propensity score be estimated when some confounders are partially observed?

Measuring cancer survival in populations: relative survival vs cancer-specific survival

Brief introduction to instrumental variables. IV Workshop, Bristol, Miguel A. Hernán Department of Epidemiology Harvard School of Public Health

Regression Discontinuity Analysis

Methods of Calculating Deaths Attributable to Obesity

Fixed Effect Combining

We expand our previous deterministic power

Modeling Binary outcome

Analysis of single gene effects 1. Quantitative analysis of single gene effects. Gregory Carey, Barbara J. Bowers, Jeanne M.

Stratified Tables. Example: Effect of seat belt use on accident fatality

Chapter 02. Basic Research Methodology

Allowing for Missing Parents in Genetic Studies of Case-Parent Triads

THE UNIVERSITY OF OKLAHOMA HEALTH SCIENCES CENTER GRADUATE COLLEGE A COMPARISON OF STATISTICAL ANALYSIS MODELING APPROACHES FOR STEPPED-

A Brief Introduction to Bayesian Statistics

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

Statistical questions for statistical methods

What is indirect comparison?

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

Confounding. Confounding and effect modification. Example (after Rothman, 1998) Beer and Rectal Ca. Confounding (after Rothman, 1998)

W e have previously described the disease impact

The Australian longitudinal study on male health sampling design and survey weighting: implications for analysis and interpretation of clustered data

Causal Mediation Analysis with the CAUSALMED Procedure

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

How to analyze correlated and longitudinal data?

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Using Direct Standardization SAS Macro for a Valid Comparison in Observational Studies

What s New in SUDAAN 11

Missing Data and Imputation

Differential Item Functioning

Performance of the Trim and Fill Method in Adjusting for the Publication Bias in Meta-Analysis of Continuous Data

MODEL SELECTION STRATEGIES. Tony Panzarella

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

Sample size determination for studies of gene-environment interaction

CONDITIONAL REGRESSION MODELS TRANSIENT STATE SURVIVAL ANALYSIS

Commentary SANDER GREENLAND, MS, DRPH

Lecture II: Difference in Difference and Regression Discontinuity

observational studies Descriptive studies

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Clinical Trials A Practical Guide to Design, Analysis, and Reporting

Today: Binomial response variable with an explanatory variable on an ordinal (rank) scale.

Epidemiologic challenges in the study of the efficacy and safety of medicinal herbs

CLASSICAL AND. MODERN REGRESSION WITH APPLICATIONS

Score Tests of Normality in Bivariate Probit Models

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

Bayesian Latent Subgroup Design for Basket Trials

Midterm Exam ANSWERS Categorical Data Analysis, CHL5407H

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Breast Cancer in First-degree Relatives and Risk of Lung Cancer: Assessment of the Existence of Gene Sex Interactions

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

A re-randomisation design for clinical trials

CHL 5225 H Advanced Statistical Methods for Clinical Trials. CHL 5225 H The Language of Clinical Trials

Transcription:

American Journal of Epidemiology Copyright 2004 by the Johns Hopkins Bloomberg School of Public Health All rights reserved Vol. 59, No. Printed in U.S.A. DOI: 0.093/aje/kwg250 ORIGINAL CONTRIBUTIONS Flexible Matching in Case-Control Studies of Gene-Environment Interactions Catherine L. Saunders and Jennifer H. Barrett From the Genetic Epidemiology Division, Cancer Research UK Clinical Centre at Leeds, Leeds, United Kingdom. Received for publication January 4, 2003; accepted for publication May 23, 2003. Because of the lack of power of case-control study s to detect gene-environment interactions, flexible matching has recently been proposed as a method of improving efficiency. In this paper, the authors consider a large-sample approximation method that allows estimation of the most efficient matching strategy when genotype and exposure are either independent or associated. The authors provide tables of the sample sizes required to detect gene-environment interactions if this flexible matching strategy is followed, and they make brief comparisons with other study s. case-control studies; epidemiologic methods; interaction; research ; statistics The potential of matching strategies to improve statistical power for detection of gene-environment interactions has been debated ( 4). Detecting the departure from multiplicative joint effects of two risk factors for disease (as in geneenvironment interaction) is important in understanding how risk factors act together in complex diseases (5) and in identifying high-risk groups. The power of a case-control study to detect interactions is low compared with the power to detect main effects (). This has resulted in many different s being proposed as strategies for improving power. In addition to matching strategies, because genetic risk factors are often under study, family s have also been considered for studies of genegene or gene-environment interactions (6 8). Unlike family studies of risk factor main effects (9), these s have been found to have the potential to improve power to detect interactions in some situations. A that samples only cases has also been proposed (0). The improvement in power for this is large. However, if the risk factors under study are not independent in the population from which the cases are sampled, the false-positive rate with this can become greatly inflated (). Therefore, matching strategies are one of several approaches to improving the power to detect gene-environment interactions. Sturmer and Brenner (3) recently proposed the use of flexible matching to address this problem. By increasing the prevalence of the environmental exposure in controls above the prevalence in cases, the authors showed that this method could offer a substantial improvement in statistical power. There are many scenarios in which environmental exposure, or at least a proxy thereof, may be measured in a relatively large set of potential controls. For example, in a case-control study of the interaction between genetic risk factors and smoking in relation to bladder cancer, potential controls could be screened by means of a simple question asking whether or not they had ever been a regular smoker. When interest is in interaction rather than the main effect of smoking, controls could then be selected for genotyping and detailed exposure evaluation by sampling according to their response to this question. Similarly, in a case-control study of sun exposure and the genes involved in melanoma risk, potential controls who had lived for some time in a hot country might be oversampled to improve power to test for geneenvironment interactions. Using frequency matching strategies, the researchers would sample controls to have the same exposure frequency as the cases, whereas with a flexible matching strategy they would seek to sample exposure at the frequency among controls that maximized the power to test for interactions. Sturmer and Brenner s simulations showed that the optimal degree of matching for exposure could be found in different situations. However, they concluded, Given the strong dependence of the power and efficiency gains by matching on the multiple parameters, general recommendations as to the best degree of matching in all settings are difficult, if not impossible (3, p. 599). Correspondence to Dr. Jennifer Barrett, Genetic Epidemiology Division, Cancer Research UK Clinical Centre at Leeds, Leeds LS9 7TF, United Kingdom (e-mail: jenny.barrett@cancer.org.uk). 7 Am J Epidemiol 2004;59:7 22

8 Saunders and Barrett In this paper, we use a large-sample approximation of the variance of the interaction odds ratio to show that the exposure frequency among flexibly matched controls that minimizes the variance of the interaction odds ratio, and thus maximizes the power for this, can be estimated. METHODS Using the notation of Sturmer and Brenner (3), let p ij, p ijc, and p ijm be the proportions of persons with level of the environmental exposure (the matching factor) i (i = (0) if the environmental exposure is present (absent)) and genetic susceptibility j (j = (0) if the genetic susceptibility is present (absent)) in the population, in cases, and in matched controls, respectively. In the same way, let n ij, n ijc, and n ijm be the numbers of persons in each group in a study. The variance of the log of the interaction odds ratio for departure from multiplicative joint effects can be estimated as follows for a population-based case-control study (2): ------- + -----. n ijc By a similar argument, the variance of the interaction effect from a study using flexible matching can be estimated by Here, the contribution to the variance of the log of the interaction odds ratio due to the population-based controls in equation, is replaced by the contribution from the flexibly matched controls, in equation 2. Therefore, the degree of matching that optimizes the efficiency of the flexible matching will be the degree that minimizes this variance. Because the flexible matching technique samples population-based cases, the variance in the interaction term that is due to the cases is unaffected by the matching strategy. Thus, the optimum strategy can be determined by finding the frequency of the environmental factor among controls that minimizes or, equivalently, that minimizes /p 00m + /p 0m + /p 0m + / p m. n ij ------- + --------. n ijc n ijm -----, n ij --------, n ijm -------- n ijm () (2) Let M E be the frequency of the matching factor (exposure) among flexibly matched controls, and let P G be the frequency of the genotype in the source population. When the two risk factors are independent in the source population, this term can be written as /[( M E )( P G )] + /[( M E )P G ] + /[M E ( P G )] + /[M E P G ], which simplifies to [P G ( P G )M E ( M E )]. Finding the value for M E that minimizes this variance is equivalent to finding a maximum for P G ( P G )M E ( M E ), which can be solved by differentiating with respect to M E and finding the solution at 0. Unsurprisingly, the variance is minimized when M E = 0.5. When the two risk factors are not independent, the most efficient frequency for the exposure sampling depends on both the odds ratio for the association between the genotype and the exposure in the source population (see the Appendix in Sturmer and Brenner (3)) and the frequencies of the two risk factors. The optimum frequency at which to sample exposure among controls (M E ) can be estimated using the following equation, where P E is the population exposure frequency and p 00, p 0, p 0, and p are, as before, the proportions of the population/unmatched controls with the different exposure/genotype combinations. Further details are given in the Appendix. The sample size required to detect interactions is calculated using the method of Self et al. (3 5). Briefly, the likelihood ratio test statistic for the interaction asymptotically follows a noncentral chi-squared distribution under the alternative hypothesis. A large exemplary data set with the risk factor frequencies among cases and controls expected under the alternative hypothesis is analyzed using standard statistical software. The likelihood ratio test statistic is the noncentrality parameter for this distribution. The required sample size is simply inversely proportional to this noncentrality parameter, which allows the application of this method to a wide range of s. RESULTS P E p 00 p 0 M E = -----------------------------------------------------------------------. P E p 00 p 0 + ( P E ) p 0 p Table shows the exposure frequency that maximizes the efficiency of a study to detect interactions over a range of control group genotype frequencies and magnitudes of risk factor associations (odds ratio for the association between genotype and exposure (OR GE )). We give the optimum exposure frequencies at particular matched control genotype frequencies rather than for specific population exposure and genotype frequencies. In practice, when exposure is sampled at a specific frequency among controls, this will also affect the frequency of the genotype, unless risk factors are independent; thus, the genotype frequency among flexible matching controls will not always be the same as that in the source population. (This is reflected in table 3, where the (3) Am J Epidemiol 2004;59:7 22

Flexible Matching in Gene-Environment Interaction Studies 9 TABLE. Optimum flexible matching exposure frequencies when risk factors are not independent Genotype frequency (proportion) among flexibly matched controls OR GE * 0.0 0. 0.3 0.5 0.7 0.9 0.5 0.66 0.63 0.57 0.5 0.43 0.37 0.8 0.55 0.54 0.52 0.5 0.48 0.46 0.9 0.53 0.52 0.5 0.5 0.49 0.48 0.5 0.5 0.5 0.5 0.5 0.5. 0.48 0.48 0.49 0.5 0.5 0.52.2 0.46 0.46 0.48 0.5 0.52 0.54.3 0.44 0.45 0.47 0.5 0.53 0.55.4 0.42 0.43 0.47 0.5 0.53 0.57.5 0.40 0.42 0.46 0.5 0.54 0.58.8 0.36 0.39 0.44 0.5 0.56 0.6 2 0.34 0.37 0.43 0.5 0.57 0.63 5 0.7 0.23 0.37 0.5 0.63 0.77 0 0.0 0.7 0.34 0.5 0.66 0.83 * OR GE, odds ratio for the association between genotype and exposure (defined as p 00 p /p 0 p 0 ). optimum matching frequencies for exposure are expressed with respect to the population genotype frequency and are slightly different from those in table.) When exposure/ genotype combination frequencies are known among unmatched controls, applying equation 3 is the simplest way to calculate the optimum exposure frequency if risk factors are not independent. When the association between risk factors is small or the genotype frequency among controls is close to 0.5, a frequency for the matching factor of 0.5 remains the most efficient. In addition, the values shown in table confirm the finding in Sturmer and Brenner (3) that when there is a strong positive association between risk factors and genotype frequency is low, the optimum degree of matching is smaller than when there is less association or no association. To consider the practical use of the flexible matching, we calculated required sample sizes under this optimal matching strategy for a range of magnitudes of risk factor effects and frequencies. These complement the relative efficiencies presented by Sturmer and Brenner (3). Sample sizes needed (number of cases required, assuming equal numbers of cases and controls) for a statistical power of 80 percent and a two-sided significance level of 0.05 are presented in table 2. Situations in which exposure and genotype are independent are considered first; therefore, in the flexible matching, exposure frequency among controls is simply sampled at 50 percent, and required sample sizes are provided for comparison with an unmatched population-based study and a case-only. We consider a situation with a rare disease (population frequency = 0. percent) and genotype main effect (relative risk of disease among unexposed people with the susceptibility genotype compared with people exposed to neither risk factor) equaling 2. Required sample sizes are provided for a range of genotype and exposure frequencies and magnitudes of interaction and main effects. It can be seen from table 2 that the sample size requirements for the flexible matching are always lower, and can be substantially lower, than those for the population-based case-control, especially when the exposure is relatively rare (frequency 0.). Although the sample size requirements for the case-only are lower still, the flexible matching does not require the assumption of independence of risk factors that makes the case-only untenable in many situations. Situations where exposure and genotype are not independent are also shown. Sample size requirements are not presented for the case-only, because it would not be an appropriate strategy in these situations. The flexible matching strategy, however, still shows a significant reduction in the required sample size in comparison with the populationbased controls in all situations. Table 3 shows the optimal frequencies at which exposure is sampled for table 2. When genotype and exposure are independent, this frequency is 50 percent. Because changing the frequency at which exposure is sampled will also alter the genotype frequency among flexibly matched controls when genotype and exposure are not independent, both genotype and exposure population frequencies, as well as the magnitude of their association, affect the optimal matching frequency for exposure. DISCUSSION The relative efficiency under the four scenarios given in Sturmer and Brenner s (3) table 2 can also be estimated by the ratio of the variances calculated using this largesample approximation method. Although, for each scenario, both methods (simulation in the paper by Sturmer and Brenner (3) and approximation here) gave the Am J Epidemiol 2004;59:7 22

20 Saunders and Barrett TABLE 2. Numbers of cases required to detect interactions for the flexible matching, case-control, and case-only study s* Exposure frequency Exposure main effect Interaction effect Genotype frequency = 0.0 Genotype frequency = 0. Genotype frequency = 0.2 Flexible matching Case-control Case-only Flexible matching Case-control Case-only Flexible matching Case-control Case-only OR GE = 0..5 39,549 69,506 2,76 4,98 8,328 2,862 3,55 5,4,978 3 4,320 7,454,855 554 943 274 367 6 203 2.5 3,75 6,479 2,982 3,853 7,236,770 2,46 4,386,233 3 3,767 6,822,7 467 847 79 30 54 36 0.3.5 28,486 3,306 9,369 3,403 3,736,294 2,03 2,303 9 3 3,568 3,739 877 435 463 40 277 298 2.5 27,485 30,287 8,3 3,267 3,599,39 2,0 2,20 85 3 3,558 3,728 795 435 463 35 279 300 2 0.5.5 27,556 27,556 8, 3,278 3,278,44 2,09 2,09 82 3 3,582 3,582 802 439 439 38 283 283 6 2.5 29,043 29,043 9,347 3,494 3,494,340 2,79 2,79 976 3 3,86 3,86 96 482 482 76 320 320 53 OR GE =.5 0..5 32,09 52,4 4,230 6,740 2,887 4,503 3 3,666 5,743 496 788 348 554 2.5 26,734 46,563 3,399 5,889 2,246 3,849 3 3,299 5,320 427 73 290 49 0.3.5 26,880 28,282 3,257 3,456 2,05 2,92 3 3,474 3,58 429 444 279 293 2.5 26,494 27,89 3,90 3,388,994 2,35 3 3,54 3,560 436 45 285 300 0.5.5 28,927 29,309 3,4 3,438 2,085 2,093 3 3,847 3,949 468 475 299 30 2.5 3,0 3,472 3,70 3,736 2,29 2,298 3 4,54 4,252 523 529 345 347 OR GE = 2 0..5 28,523 43,60 3,907 5,973 2,787 4,233 3 3,363 4,95 47 75 344 532 2.5 24,357 39,303 3,92 5,240 2,87 3,62 3 3,085 4,597 4 650 289 474 0.3.5 26,80 27,555 3,248 3,379 2,057 2,68 3 3,536 3,527 436 445 285 296 2.5 26,799 27,546 3,225 3,356 2,023 2,33 3 3,609 3,603 449 458 295 306 0.5.5 3,03 32,040 3,63 3,680 2,79 2,99 3 4,95 4,44 504 59 37 322 2.5 33,889 34,805 3,984 4,049 2,424 2,444 3 4,570 4,782 569 584 37 375 * In the flexible matching and case-control s, the number of controls is equal to the number of cases. OR GE, odds ratio for the association between genotype and exposure (defined as p 00 p /p 0 p 0 ). same degree of matching as the most efficient, the magnitudes of the relative efficiencies were slightly different. Discrepancies can be attributed to equation 2, which, though widely used, only calculates an asymptotic approximation to the variance of the interaction odds ratio. Am J Epidemiol 2004;59:7 22

Flexible Matching in Gene-Environment Interaction Studies 2 TABLE 3. Flexible matching exposure frequencies for table 2 Exposure frequency OR GE * = Genotype frequency 0.0 0. 0.2 0. 0.5 0.5 0.5 0.3 0.5 0.5 0.5 0.5 0.5 0.5 0.5 OR GE =.5 0. 0.45 0.46 0.47 0.3 0.45 0.46 0.47 0.5 0.45 0.46 0.47 OR GE = 2 0. 0.42 0.44 0.46 0.3 0.42 0.43 0.45 0.5 0.42 0.43 0.45 * OR GE, odds ratio for the association between genotype and exposure (defined as p 00 p /p 0 p 0 ). Strategies similar to flexible matching for interactions have been discussed previously. Cain and Breslow (6) discussed a strategy similar to the one detailed above for improving power to detect interactions and main effects. They considered a situation where exposure information on cases and controls was available before sampling of the particular controls for which more detailed information would be collected (in this case, genotyping). They advocated a strategy in which controls are sampled with balanced numbers from each exposure stratum. Cain and Breslow found that the balanced is always much more powerful than the unstratified for detecting interactions. Indeed, the only time they found the strategy less efficient was when there is a strong negative correlation between the variables that are measured in the first and second stages; this is also reflected here in the case where the optimum sampling frequency for the exposure is potentially greater or less than 50 percent when the two risk factors are strongly associated. Breslow and Cain (7) similarly recognized for the twostage that unbiased estimates of the interaction parameter can be obtained from an unmatched analysis even though the exposure is used as a matching factor, in the same way as for the flexible matching. However, estimates of the population exposure frequency can also be used to additionally allow estimation of the exposure main effects. This is an aspect that could also be applied to the flexible matching if, at the control sampling stage of the study, an estimate of the population exposure frequency could be made, or if the controls were being sampled from a preexisting cohort for which exposure information was available. At the analysis stage, the log of the exposure group frequency (i.e., exposed or unexposed) is used as an offset in the logistic regression model, to retrieve unbiased estimates of exposure main effects. One advantage of this result is that the offset has no effect on the power of the to detect interactions (7). Thus, if this information is not available, this does not detract from the strength of the for detecting interactions. Understanding how the power of the flexible matching can be optimized is helpful in understanding comparisons between different s that have been proposed as strategies for detecting interactions. Table 2 reflects well that although the exposure frequency among controls is chosen to minimize the variance, the decrease in the required sample size is still small in comparison with the case-only, where there is no component of variance in the interaction estimate due to the controls. The inappropriateness of the case-only in the presence of risk factor association and concerns about the false-positive rate when this assumption is violated (, 8) mean that alternative strategies are still attractive and should be explored. By considering the large-sample approximation to the variance of the interaction parameter for the flexible matching, one can see why using family controls has the potential to improve the power to detect interactions (7, 8). When risk factors are rare (and this is the situation in which most improvement in power from family s has been observed), the exposure frequencies among controls are raised above the population levels towards the most optimal frequencies of 50 percent due to within-family correlation of genetic, and to a lesser extent environmental, risk factors. Similar arguments can be considered for other s, such as the that compares case subjects who have two primary cancers with cases who have only one primary cancer (9). This sampling strategy will increase the prevalence of rare risk factors among all study participants, again decreasing variation in the interaction parameter and increasing power. Matching strategies such as flexible matching are often the most rational approach to choosing an efficient for detecting interactions, if the assumption of independence of genotype and exposure that is required for the case-only proves untenable (). The strategies described here can be used to find the most informative risk factor frequencies. If the population exposure frequency is known, the theory from two-stage s can be incorporated at the analysis stage to estimate the main effects of the matching variables. This further increases the attractiveness of these s. REFERENCES. Smith PG, Day NE. The of case-control studies: the influence of confounding and interaction effects. Int J Epidemiol 984;3:356 65. 2. Thomas DC, Greenland S. The efficiency of matching in casecontrol studies of risk-factor interactions. J Chronic Dis 985; 38:569 74. 3. Sturmer T, Brenner H. Flexible matching strategies to increase power and efficiency to detect and estimate gene-environment interactions in case-control studies. Am J Epidemiol 2002;55: 593 602. Am J Epidemiol 2004;59:7 22

22 Saunders and Barrett 4. Sturmer T, Brenner H. Potential gain in efficiency and power to detect gene-environment interactions by matching in casecontrol studies. Genet Epidemiol 2000;8:63 80. 5. Brennan P. Gene-environment interaction and aetiology of cancer: what does it mean and how can we measure it? Carcinogenesis 2002;23:38 7. 6. Gauderman WJ. Sample size requirements for association studies of gene-gene interaction. Am J Epidemiol 2002;55:478 84. 7. Gauderman WJ. Sample size requirements for matched casecontrol studies of gene-environment interaction. Stat Med 2002;2:35 50. 8. Witte JS, Gauderman WJ, Thomas DC. Asymptotic bias and efficiency in case-control studies of candidate genes and geneenvironment interactions: basic family s. Am J Epidemiol 999;49:693 705. 9. Schaid DJ, Rowland C. Use of parents, sibs, and unrelated controls for detection of associations between genetic markers and disease. Am J Hum Genet 998;63:492 506. 0. Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only s for assessing susceptibility in population-based case-control studies. Stat Med 994;3: 53 62.. Albert PS, Ratnasinghe D, Tangrea J, et al. Limitations of the case-only for identifying gene-environment interactions. Am J Epidemiol 200;54:687 93. 2. Cuzick J. Interaction, subgroup analysis and sample size. In: Boffetta P, Caporaso N, Cuzick J, et al, eds. Metabolic polymorphisms and susceptibility to cancer. Lyon, France: International Agency for Research on Cancer, 999:09 2. (IARC Scientific Publication no. 48). 3. Self SG, Mauritsen RH, Ohara J. Power calculations for likelihood ratio tests in generalized linear models. Biometrics 992; 48:3 9. 4. Brown BW, Lovato J, Russell K. Asymptotic power calculations: description, examples, computer code. Stat Med 999;8: 337 5. 5. Longmate JA. Complexity and power in case-control association studies. Am J Hum Genet 200;68:229 37. 6. Cain KC, Breslow NE. Logistic regression analysis and efficient for two-stage studies. Am J Epidemiol 988;28: 98 206. 7. Breslow NE, Cain KC. Logistic regression for two-stage casecontrol data. Biometrika 988;75: 20. 8. Saunders CL, Gooptu C, Bishop DT, et al. The use of case only studies for the detection of interactions, and the non-independence of genetic and environmental risk factors for disease. (Abstract). Genet Epidemiol 200;2:74. 9. Begg CB, Berwick M. A note on the estimation of relative risks of rare genetic susceptibility markers. Cancer Epidemiol Biomarkers Prev 997;6:99 03. APPENDIX As before, let p ij and p ijm be the proportions of persons with level of the environmental exposure (the matching factor) i (i = (0) if the environmental exposure is present (absent)) and genetic susceptibility j ( j = (0) if the genetic susceptibility is present (absent)) in the population and in matched controls, respectively. Let P E be the exposure frequency in the source population, and let M E be the exposure frequency among flexibly matched controls. The p ij s are calculated following the method of Sturmer and Brenner (3). They depend on the genotype and exposure frequencies and the magnitude of the association between the two factors. Alternatively, if an unmatched control group were available, then the values of the proportions p ij could be observed directly. Therefore, the proportions of persons with each genotype/exposure combination when controls are selected under a flexible matching scheme, p ijm, are calculated as follows, such that the frequency of exposure among controls is M E. p 00m = p 00 ( M E )/( P E ). p 0m = p 0 ( M E )/( P E ). p 0m = p 0 M E /P E. p m = p M E /P E. Therefore, the variance of the log of the interaction odds ratio due to the flexibly matched controls can be estimated by ( P E )/[p 00 ( M E )] + ( P E )/[p 0 ( M E )] + P E /[p 0 M E ] + P E /[p M E ]. By differentiating this function with respect to M E and finding the value of M E when this is zero, one can find the value of M E that minimizes this variance. After some simple algebra, the derivative can be expressed as p 0u p 00u ( M E ) 2 /( P E ) 2 p 0u p u M E 2 /P E 2. Setting this to zero, the equation can be solved for M E by factorization, since the derivative is the difference of two squares, providing the solution in equation 3. Am J Epidemiol 2004;59:7 22