White Paper Estimating Genotype-Specific Incidence for One or Several Loci

Similar documents
(b) What is the allele frequency of the b allele in the new merged population on the island?

GENETIC LINKAGE ANALYSIS

Pedigree Analysis Why do Pedigrees? Goals of Pedigree Analysis Basic Symbols More Symbols Y-Linked Inheritance

HARDY- WEINBERG PRACTICE PROBLEMS

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

DEFINITIONS OF HISTOCOMPATIBILITY TYPING TERMS

Analysis of single gene effects 1. Quantitative analysis of single gene effects. Gregory Carey, Barbara J. Bowers, Jeanne M.

GENOME-WIDE ASSOCIATION STUDIES

Decomposition of the Genotypic Value

Example HLA-B and abacavir. Roujeau 2014

White Paper Guidelines on Vetting Genetic Associations

Supplementary Figures

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

IN SILICO EVALUATION OF DNA-POOLED ALLELOTYPING VERSUS INDIVIDUAL GENOTYPING FOR GENOME-WIDE ASSOCIATION STUDIES OF COMPLEX DISEASE.

Systems of Mating: Systems of Mating:

An Introduction to Quantitative Genetics I. Heather A Lawson Advanced Genetics Spring2018

IB BIO I Genetics Test Madden

Association-heterogeneity mapping identifies an Asian-specific association of the GTF2I locus with rheumatoid arthritis

MENDELIAN GENETICS. MENDEL RULE AND LAWS Please read and make sure you understand the following instructions and knowledge before you go on.

Non-parametric methods for linkage analysis

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Inbreeding and Inbreeding Depression

The Association Design and a Continuous Phenotype

Mendel s Methods: Monohybrid Cross

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

For more information about how to cite these materials visit

Pedigree Construction Notes

Replacing IBS with IBD: The MLS Method. Biostatistics 666 Lecture 15

Bayesian approaches to handling missing data: Practical Exercises

UNIVERSITY OF CALIFORNIA, LOS ANGELES

A Unified Sampling Approach for Multipoint Analysis of Qualitative and Quantitative Traits in Sib Pairs

Genetics Unit Exam. Number of progeny with following phenotype Experiment Red White #1: Fish 2 (red) with Fish 3 (red) 100 0

Ascertainment Through Family History of Disease Often Decreases the Power of Family-based Association Studies

Mendel explained how a dominant allele can mask the presence of a recessive allele.

Assessing Gene-Environment Interactions in Genome-Wide Association Studies: Statistical Approaches

Single Gene (Monogenic) Disorders. Mendelian Inheritance: Definitions. Mendelian Inheritance: Definitions

Statistical Evaluation of Sibling Relationship

Role of Genomics in Selection of Beef Cattle for Healthfulness Characteristics

Introduction to Quantitative Genetics

Chapter 4 PEDIGREE ANALYSIS IN HUMAN GENETICS

Exam #2 BSC Fall. NAME_Key correct answers in BOLD FORM A

Selection at one locus with many alleles, fertility selection, and sexual selection

Heritability and genetic correlations explained by common SNPs for MetS traits. Shashaank Vattikuti, Juen Guo and Carson Chow LBM/NIDDK

SNPrints: Defining SNP signatures for prediction of onset in complex diseases

Roadmap. Inbreeding How inbred is a population? What are the consequences of inbreeding?

CS2220 Introduction to Computational Biology

Meiosis and Genetics

Complex Traits Activity INSTRUCTION MANUAL. ANT 2110 Introduction to Physical Anthropology Professor Julie J. Lesnik

Bio 312, Spring 2017 Exam 3 ( 1 ) Name:

Solutions to Genetics Unit Exam

A Comparison of Sample Size and Power in Case-Only Association Studies of Gene-Environment Interaction

Tutorial on Genome-Wide Association Studies

The Making of the Fittest: Natural Selection in Humans

Lecture 17: Human Genetics. I. Types of Genetic Disorders. A. Single gene disorders

CSE 258 Lecture 1.5. Web Mining and Recommender Systems. Supervised learning Regression

A. Incorrect! Cells contain the units of genetic they are not the unit of heredity.

DOES THE BRCAX GENE EXIST? FUTURE OUTLOOK

BIOL 364 Population Biology Fairly testing the theory of evolution by natural selection with playing cards

12 MENDEL, GENES, AND INHERITANCE

Mendelism: the Basic principles of Inheritance

Gregor Mendel and Genetics Worksheets

UNIT 3 GENETICS LESSON #30: TRAITS, GENES, & ALLELES. Many things come in many forms. Give me an example of something that comes in many forms.

Handling Immunogenetic Data Managing and Validating HLA Data

Welcome Back! 2/6/18. A. GGSS B. ggss C. ggss D. GgSs E. Ggss. 1. A species of mice can have gray or black fur

additive genetic component [d] = rded

New Enhancements: GWAS Workflows with SVS

Lesson Overview 11.2 Applying Mendel s Principles

HERITABILITY INTRODUCTION. Objectives

Review Statistics review 11: Assessing risk Viv Bewick 1, Liz Cheek 1 and Jonathan Ball 2

Title:Validation study of candidate single nucleotide polymorphisms associated with left ventricular hypertrophy in the Korean population

META-ANALYSIS OF THE EFFECTS OF ALCOHOL DEHYDROGENASE GENOTYPE ON ALCOHOL DEPENDENCE AND ALCOHOLIC LIVER DISEASE

DEFINITIONS: POPULATION: a localized group of individuals belonging to the same species

The Making of the Fittest: Natural Selection in Humans

Human population sub-structure and genetic association studies

breast cancer; relative risk; risk factor; standard deviation; strength of association

Quantitative Methods in Managment. An introduction to GLMs and measurement theory

How Populations Evolve

Genetic association analysis incorporating intermediate phenotypes information for complex diseases

Essential Skills for Evidence-based Practice: Statistics for Therapy Questions

Sexual Reproduction and Genetics. Section 1. Meiosis

Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model

Comparison of Linkage-Disequilibrium Methods for Localization of Genes Influencing Quantitative Traits in Humans

Punne% Square Quiz A AP Tes2ng this week 15-Week Grades due next week Note: media center is hos2ng tes2ng Turn in all make-up work

INTERACTION BETWEEN NATURAL SELECTION FOR HETEROZYGOTES AND DIRECTIONAL SELECTION

NOTES: Exceptions to Mendelian Genetics!

Ct=28.4 WAT 92.6% Hepatic CE (mg/g) P=3.6x10-08 Plasma Cholesterol (mg/dl)

When bad things happen to good genes: mutation vs. selection

Dan Koller, Ph.D. Medical and Molecular Genetics

Activities to Accompany the Genetics and Evolution App for ipad and iphone

Bio 1M: Evolutionary processes

Will now consider in detail the effects of relaxing the assumption of infinite-population size.

Effects of age-at-diagnosis and duration of diabetes on GADA and IA-2A positivity

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

Pedigree Analysis. A = the trait (a genetic disease or abnormality, dominant) a = normal (recessive)

Your Vocabulary words-- write into your journal:

The laws of Heredity. Allele: is the copy (or a version) of the gene that control the same characteristics.

4. A homozygous tall plant and a heterozygous tall plant are crossed. What is the percent probability of short offspring?

Model of an F 1 and F 2 generation

Transcription:

White Paper 23-01 Estimating Genotype-Specific Incidence for One or Several Loci Authors: Mike Macpherson Brian Naughton Andro Hsu Joanna Mountain Created: September 5, 2007 Last Edited: November 18, 2007 Summary: We wish to estimate the incidence for a given trait based on an individual s genotype at one or more SNPs associated with the trait and any available phenotypic data for that individual. Here we describe the methods used to estimate these incidences, first for the case of a single SNP, then for the case of multiple SNPs.

Single Locus Calculation In practice, we provide an estimate of trait incidence conditional on the individual s genotype and available phenotypic information. The calculations below, however, hold whether incidence or prevalence estimates are computed. We henceforth use the generic term risk instead of incidence to indicate this fact. We assume a binary trait D and a single associated locus at which there are three possible genotypes G 1, G 2, and G 3, ordered so that G 1 is the lower-risk homozygote and G 3 is the higher-risk homozygote. The individual for whom we wish to estimate the incidence has genotype G m : m {1, 2, 3}. The quantity we wish to compute is Pr(D G m ), or the probability that the individual is affected given their genotype. The unconditional risk for the trait, denoted Pr(D), is assumed known for a subpopulation of which the individual is a member. For instance, this might be an estimate of Type 2 diabetes prevalence for Asian subjects between the ages of 20 and 40. We further assume that estimates of the three genotype frequencies Pr(G i ) are available for the same subpopulation. Lastly, we assume that estimates of the three genotype-specific odds ratios OR 1, OR 2, and OR 3 are available, where OR 1 =1by definition. Under these assumptions, we may compute the Pr(D G i ) by solving the following system of equations: Pr(D) = Pr(D G 1 )Pr(G 1 ) + Pr(D G 2 )Pr(G 2 ) + Pr(D G 3 )Pr(G 3 ) (1) OR 2 = Pr(D G 2)/(1 Pr(D G 2 )) Pr(D G 1 )/(1 Pr(D G 1 )) (2) OR 3 = Pr(D G 3)/(1 Pr(D G 3 )) Pr(D G 1 )/(1 Pr(D G 1 )) (3) Equation 1 follows from basic probability theory, and Equations 2 and 3 follow from the definition of an odds ratio. The estimated genotype-specific risk at the single locus is then Pr(D G m ). To convey the genotype-specific risk relative to the average risk in the population, we compute the quantity OR m = odds(d G m )/odds(d), where, for brevity, we have introduced the function odds(x) = Pr(X)/(1 Pr(X)). The inverse odds function will be used later on, and is odds 1 (X) = Pr(X)/(1 + Pr(X)). The superscript asterisk on OR m is a convention we adopt to distinguish an odds ratio computed relative to the average odds, rather than relative to the odds of lowerrisk homozygote, which is the standard definition. 1

Complications Odds Ratio Estimates: In the association study literature, odds ratios are commonly estimated via logistic regression assuming an additive model. This means that the logarithm of the odds ratio is assumed to relate linearly to the number of copies of the higher-risk allele (cf. Jewell, 2003). When the odds ratio is estimated in this way, the higher-risk homozygote s estimated log odds ratio is, by definition, exactly twice that of the heterozygote s log odds ratio, from which it follows that the higher-risk homozygote s odds ratio is the square of the heterozygote s odds ratio. Thus the odds ratio reported when the additive model is assumed is that associated with one copy of the higher-risk allele, which could be called an allelespecific odds ratio. In such cases we set the allele-specific odds ratio equal to the heterozygous genotype s odds ratio, OR 2, and set the higher-risk homozygote s odds ratio OR 3 =(OR 2 ) 2. In cases for which genotype-specific odds ratios are estimated separately, we use those estimates directly. Prevalence v. Incidence: We typically report genotype-specific risk in terms of incidence. However, the formulas in this white paper apply equally well to prevalence and incidence data. Specifically, the value Pr(D) may represent either a prevalence or an incidence datum. Mismatched Datasets: In practice, we rely on association studies that are most often based on individuals of European descent for our odds ratio estimates. We obtain genotype frequency estimates primarily from HapMap, which provides one population of European descent, one of Yoruban descent, and one of Asian (Chinese & Japanese) descent. We obtain prevalence/incidence estimates from a variety of sources, and these estimates often pertain to individuals of European descent alone. Ideally, we would have perfectly-matched odds ratio, genotype frequency, and prevalence estimates, meaning that, if the prevalence estimate pertains to Asian subjects between 20 and 40, we would also have genotype frequency estimates and odds ratio estimates for Asians between 20 and 40. It is most often the case that we do not have such matched estimates. At this writing, we assume that the overall HapMap genotype frequency estimates apply across all age ranges, i.e. that the genotype frequencies within any age range are identical to the overall frequencies. We also assume that odds ratios apply across age ranges. We record whether an odds ratio estimate derives from a European, African, or Asian population, and do the same for prevalence estimates. By derives we mean that we consider the population for which a given odds ratio or prevalence estimate, and (subjectively) decide whether that population is near enough to one of the three 2

HapMap populations for us to use the estimate. We only report genotype-specific risk estimates when the underlying estimates match at the population level. For example, when an odds ratio estimate exists for an Asian population, a genotype frequency estimate exists for an Asian population, and a prevalence estimate exists for an Asian population, we will report the genotype-specific risk estimates, even if the prevalence estimate is age-range specific. At this writing, we have not studied how this policy affects the accuracy of the estimates. Higher Moments: At this writing, we do not provide estimates of our certainty in the reported genotype-specific risk point estimate. It is standard practice to provide confidence intervals for odds ratios reported in the literature. We have sample sizes for the HapMap frequency data, and often have sample sizes for prevalence estimates, so it should be relatively straightforward to produce risk confidence intervals, and informative to our users. Multiple Locus Calculation For many traits, associations exist between the trait and several SNPs. We wish to combine the genotype-specific risk estimates from multiple loci into a single, genotype-specific composite risk estimate for an individual s genotype. Here we assume that the composite odds ratio, OR C, is given by the product of the individual s odds ratios at each locus. This is very similar to what is done in multiple logistic regression under an additive model, where each copy of a risk allele at a given locus is assumed to add one unit of the log odds ratio specific to that locus to the overall log odds ratio (cf. Jewell, 2003; Risch, 1990). It is not exactly the same because a multiple regression would be performed simultaneously on all the data, where we use individual odds ratios obtained from separate studies, which may, for instance, differ widely in sample size. We assume that there are K loci of interest, and denote the kth odds ratio of the ith genotype OR i,k. Similarly, the recentered odds ratios are denoted ORi,k. We denote the genotypes at the kth locus G 1,k,G 2,k,G 3,k, and denote the individual s genotype at the kth locus G mk,k. Thus OR C = K ORm k,k (4) k=1 The quantity OR C has the interpretation 3

OR C = odds(d K k=1 G mk,k)/odds(d), (5) where odds(d K k=1 G m k,k) is the odds of the individual s multilocus genotype. In computing the product ORC, we implicitly assume that the point where log(odds(d)) intersects the respective logistic regression line for each locus is the same point in each of the individual regression calculations, as would be true if a multiple logistic regression had been performed. Then Pr(D K k=1 G mk,k) =odds 1 [OR C odds(d)]. (6) Complications Higher Moments: As in the single-locus case, we do not calculate confidence intervals for the estimates we provide. Provided confidence intervals for individual risk at a single locus, it is straightforward to compute them for the composite risk. References Jewell, Nicholas P. 2003. Statistics for Epidemiology. Chapman & Hall/CRC. Risch, N. 1990. Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet, 46(2), 222 228. 4