Human population sub-structure and genetic association studies

Human population sub-structure and genetic association studies Stephanie A. Santorico, Ph.D. Department of Mathematical & Statistical Sciences Stephanie.Santorico@ucdenver.edu

Global Similarity Map from 23andme.com 2

Besides being cool This information can help us think about genetic influence on complex traits. Why is information about ancestry important in disease mapping studies? How can we measure ancestry from genetic data? How is this information used in the context of genetic association studies and, more specifically, in genome-wide association studies? 3

Motivating data: Sample of 4,920 Native Americans of the Pima and Papago tribes [1] Type 2 diabetes Haplotype from the Gm 3;5,13,14 system of human immunoglobulin G Question of interest: Is there association between the haplotype and disease? 4

H 0 : Haplotype frequency in cases = Haplotype frequency in controls Haplotype appears to be protective: 8% diabetic versus 29% diabetic 5

A BRIEF DETOUR 6

Confounding Factor of Interest Disease Confounder A confounding variable is an extraneous variable in a statistical model that correlates with both the factor of interest and the independent variable. 7

Public service announcement: Confounding is everywhere and is a good phenomenon to be aware of in any scientific study. 8

Here, confounding is due to admixture. HOW? More generally, we can have confounding due to population substructure or population stratification. Population stratification is the presence of a systematic difference in allele frequencies between subpopulations in a population. This is why great care is taken in genetic association studies to match and/or control for ancestry. 9

This example motivated a good deal of research on methods for correcting for confounding due to population substructure. The issue is not specific to any one complex trait. The problem does not go away with a bigger sample or more markers. However, with more markers, we can use that information to adjust for population structure. 10

Context: genome-wide association studies (GWAS) Box 2, i of [4]: an example from the type 2 diabetes component of the Welcome Trust Case Control Consortium study From here on, we will assume that tests have been conducted for each SNP over the genome using a chi-squared test statistic, e.g., X 2 2 1,, X 1,817,348. 11

Methods for Dealing with Population Substructure in GWAS 1. Genomic control 2. Principal components analysis 3. Structured association methods 4. Family-based studies 5. Mixed models With the exception of familybased studies, these are in order of increasing complexity. For each method, we will go through an overview and pros/cons for their usage. [2] is a good review of methods and further literature. 12

1. Genomic Control Devlin and Roeder [3] suggested the use of a genomic inflation factor, denoted by Concept: Population substructure inflates significance 13

FIGURE 1 of [2] The figure shows simulated P P plots under three scenarios for genome-wide scans with no causal markers. a No stratification: p-values fit the expected distribution. b Stratification without unusually differentiated markers: p- values exhibit modest genome-wide inflation. c Stratification with unusually differentiated markers: p-values exhibit modest genome-wide inflation and severe inflation at a small number of markers. 14

1. Genomic Control Devlin and Roeder [3] suggested the use of a genomic inflation factor, denoted by Concept: Population substructure inflates significance Measure inflation using the median of the chi-squared test statistics divided by the median corresponding to the distribution under no inflation: λ = Median X i 2 0.456 If λ > 1, this indicates the presence of substructure Adjust the chi-squared statistics by dividing by this factor. This reduces the inflation Pros: Exceptionally easy. Cons: a single adjustment for all SNPs may not be appropriate. 15

Now days, GC is more often used as an indicator that appropriate adjustments have been made No adjustments made From [2], based on 99,900 SNPs + 100 unusually differentiated SNPs added for =0.6 16

2. Principal Components Analysis (PCA) PCA is a general, well-studied and utilized statistical method that dates to 1901. PCA finds uncorrelated combinations of variables that maximally explain variation. In GWAS, PCA is used to derive continuous axes of genetic variation These PCs can be used to detect outliers. Population substructure can be explored. Information from PCs can be used to match cases and controls based on ancestry or as a covariate adjustment in linear or logistic regression. 17

Genetic PCs correlated with ancestry Figure 1a from [5]

PCA usage in GWAS: Conduct a PCA with study samples and samples from public sources, representing diverse world populations, such as from 1000 genomes Determine if study samples match with selfreported ancestry, removing those that do not represent a homogenous study sample PCs on study samples can be used as covariates in subsequent analysis or for matching individuals based on ancestry. Software note: most statistical software packages will perform a PCA. Commonly used packages specific to this application are EIGENSOFT and SMARTPCA Pros: relatively easy with existing software. Powerful exploratory technique. Cons: requires some pre-treatment of SNPs, e.g., pruning of SNPs due to linkage disequilibrium and minor allele frequency Principal component analysis of 3557 study subjects with 1194 HapMap controls. Color-coding distinguishes HapMap groups and the study subjects (TZ). Axis labels indicate the percentage of variance explained by each eigenvector 19

3. Structured association methods These methods model or infer underlying population structure and assign individuals to these sub-populations or clusters Some established methods: GEM (Genetic Matching) [6] STRUCTURE [7] ADMIXTURE [8] 20

3. Structured association methods: Genetic Matching Figure 1 A&B from [6]: Flowchart for Genetic-Matching Algorithm Illustrated with Portions of the T1D Data. Distances between individuals are determined by the major axes of variation in the EVD representation. Outlier removal, illustrated by (A), is critical for revealing the subtle variability between individuals of similar ancestry. After major outliers are removed, clustering is used for discovery of homogeneous clusters; four distinct clusters are displayed here (B), plotted as principal component axes. 21

After matching cases and controls, testing is done within each strata of matched casecontrols and then evidence is combined over the strata, e.g., Cochran Mantel Hansel test or Conditional Logistic Regression Strata 1 Allele Case Control Strata K Allele Case Control A 32 74 a 16 60... A 21 62 a 9 40 n 111 μ 111 = 32 28.0 Var(n 111 ) = 8.6 n 11K μ 11K = 21 18.9 Var(n 11K ) = 5.4 22

After matching cases and controls Testing is done within each strata of matched case-controls and then evidence is combined over the strata, e.g., Cochran Mantel Hansel test and, Conditional Logistic Regression Since testing is done within strata prior to combining information, inflation due to population substructure is controlled. Pros: allows for control of fine level structure. Enables rigorous use of public controls Cons: a multi-step process that requires a fair amount of computation, though software exists for such analyses 23

In [10], the TDT was used to test demonstrate association of class 1 alleles of the insulin gene 5 VNTR with insulin dependent diabetes. Previous case-control tests had found significant association, but linkage studies were not able to find significant linkage for this marker. H 0 : proportion of 1 alleles transmitted to diseased individuals from 1/2 parents is 0.5 NOT TRANSMITTED TRANSMITTED Class 1 Other Class 1 78 Other 46 T ( n n ) n n 2 12 21 12 21 T 2 = 78 46 2 78+46 =8.3 The p-value can be computed from a chisquared distribution with 1 degree of freedom to be 0.00811427 2 25

4. Family-based studies Family-based association tests focus on within-family information. Popular starting in the 90 s with the transmission disequilibrium test or TDT and numerous extensions such as the FBAT, PDT and QTDT Matching is done by focusing on transmitted versus non-transmitted alleles from heterozygous parents Since matching is internal to the family, population substructure is appropriately taken into account Pros: Statistics are easy to compute and easy to interpret. Does not require genome-wide data. Tests incorporate linkage information. Cons: These tests do not use between family information and hence are not fully powered. Information from homozygous parents is not used at all. H 0 : proportion of 1 alleles transmitted to diseased individuals from 1/2 parents is 0.5 26

Mixed model example 28 From [9]

5. Mixed Models Mixed by using a fixed component for a SNP effect and a random component reflecting underlying structure, e.g., population substructure and cryptic relatedness Adjusts for underlying structure by using genetic data to measure correlation between individuals y i = β 0 + β k X ik + η i + ε i var Y = σ a 2 S N + σ e 2 I S N is derived based on the matrix K = k ij where k ij = 1 M M k=1 n ik 2p k 2p k 1 p k n jk 2p k Pros: Adjusts for fine level structure including relatedness Cons: Much more computationally intensive but feasible with existing software packages such as EMMAX [9] as well as with the use of computing clusters 29

Concluding remarks and test Confounding presents itself in genetic studies through which is a concern because This can be detected through use of a summary measure,, which can be used as a quality control step and/or as an adjustment to test statistics. Beyond family-based methods, there are a number of methods that allow for adjustment of substructure. In increasing order of complexity, four such methods are 30

References 1. Knowler WC, Williams RC, Pettitt DJ, Steinberg a G. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet. 1988;43:520 6. 2. Price AL, Zaitlen N a, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet [Internet]. Nature Publishing Group; 2010;11(7):459 63. 3. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997 1004. 4. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP a, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9(May):356 69. 5. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7218):98 101. 6. Luca D, Ringquist S, Klei L, Lee AB, Gieger C, Wichmann H-E, et al. On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am J Hum Genet. 2008 Feb;82(2):453 63 7. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945 59. 8. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655 64. 9. Kang HM, Sul JH, Service SK, Zaitlen N a, Kong S-Y, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet [Internet]. Nature Publishing Group; 2010;42(4):348 54. 10. Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52: 506-516 31