Human population sub-structure and genetic association studies

Similar documents
Supplementary Figures

Genome-wide association studies (case/control and family-based) Heather J. Cordell, Institute of Genetic Medicine Newcastle University, UK

CS2220 Introduction to Computational Biology

Tutorial on Genome-Wide Association Studies

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin,

New Enhancements: GWAS Workflows with SVS

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Transmission Disequilibrium Test in GWAS

Mendelian Randomization

Allowing for Missing Parents in Genetic Studies of Case-Parent Triads

Ascertainment Through Family History of Disease Often Decreases the Power of Family-based Association Studies

Supplementary Online Content

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

GENOME-WIDE ASSOCIATION STUDIES

Transmission Disequilibrium Methods for Family-Based Studies Daniel J. Schaid Technical Report #72 July, 2004

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

HHS Public Access Author manuscript Nat Genet. Author manuscript; available in PMC 2015 September 01.

Introduction to the Genetics of Complex Disease

Statistical Genetics : Gene Mappin g through Linkag e and Associatio n

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

Analyzing the genetic structure of populations: individual assignment

Assessing Accuracy of Genotype Imputation in American Indians

Accurate Liability Estimation Substantially Improves Power in Ascertained Case. Running Title: Liability Estimation Improves Case Control GWAS

Introduction to Genetics and Genomics

Imaging Genetics: Heritability, Linkage & Association

A total of 2,822 Mexican dyslipidemic cases and controls were recruited at INCMNSZ in

Predicting Country of Origin from Genetic Data G. David Poznik

Review and Evaluation of Methods Correcting for Population Stratification with a Focus on Underlying Statistical Principles

American Indians with Genetic Admixture

Association mapping (qualitative) Association scan, quantitative. Office hours Wednesday 3-4pm 304A Stanley Hall. Association scan, qualitative

Quality Control Analysis of Add Health GWAS Data

Power Calculation for Testing If Disease is Associated with Marker in a Case-Control Study Using the GeneticsDesign Package

Using Ancestry Matching to Combine Family-Based and Unrelated Samples for Genome-Wide Association Studies

Nature Genetics: doi: /ng Supplementary Figure 1

Supplementary Methods

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Dan Koller, Ph.D. Medical and Molecular Genetics

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Rare Variant Burden Tests. Biostatistics 666

Example HLA-B and abacavir. Roujeau 2014

Association-heterogeneity mapping identifies an Asian-specific association of the GTF2I locus with rheumatoid arthritis

Using ancestry estimates as tools to better understand group or individual differences in disease risk or disease outcomes

Reliability of Ordination Analyses

Nonparametric Linkage Analysis. Nonparametric Linkage Analysis

Statistical power and significance testing in large-scale genetic studies

White Paper Guidelines on Vetting Genetic Associations

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

AN INFORMATION VISUALIZATION APPROACH TO CLASSIFICATION AND ASSESSMENT OF DIABETES RISK IN PRIMARY CARE

Understandable Statistics

On Missing Data and Genotyping Errors in Association Studies

Family-based association tests for sequence data, and. comparisons with population-based association tests

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

SNPrints: Defining SNP signatures for prediction of onset in complex diseases

Supplementary Methods. 1. Cancer Genetic Markers of Susceptibility (CGEMS) Prostate Cancer Genome-Wide Association Scan

LTA Analysis of HapMap Genotype Data

CONTENT SUPPLEMENTARY FIGURE E. INSTRUMENTAL VARIABLE ANALYSIS USING DESEASONALISED PLASMA 25-HYDROXYVITAMIN D. 7

Biostatistics Faculty Publications

An Extension of the Regression of Offspring on Mid-Parent to Test for Association and Estimate Locus-Specific Heritability: The Revised ROMP Method

Small-area estimation of mental illness prevalence for schools

Summary. Introduction. Atypical and Duplicated Samples. Atypical Samples. Noah A. Rosenberg

Genome-wide association study identifies variants in TMPRSS6 associated with hemoglobin levels.

BST227: Introduction to Statistical Genetics

Using Imputed Genotypes for Relative Risk Estimation in Case-Parent Studies

Nature Genetics: doi: /ng Supplementary Figure 1. Country distribution of GME samples and designation of geographical subregions.

Performing. linkage analysis using MERLIN

For more information about how to cite these materials visit

Can We Increase the Likelihood of Success for Future Association Studies in Epilepsy?

STATISTICAL GENETICS 98 Transmission Disequilibrium, Family Controls, and Great Expectations

# For the GWAS stage, B-cell NHL cases which small numbers (N<20) were excluded from analysis.

Ridge regression for risk prediction

Introduction of Genome wide Complex Trait Analysis (GCTA) Presenter: Yue Ming Chen Location: Stat Gen Workshop Date: 6/7/2013

Effects of Stratification in the Analysis of Affected-Sib-Pair Data: Benefits and Costs

Identification of Tissue Independent Cancer Driver Genes

White Paper Estimating Genotype-Specific Incidence for One or Several Loci

Nature Genetics: doi: /ng Supplementary Figure 1

IN SILICO EVALUATION OF DNA-POOLED ALLELOTYPING VERSUS INDIVIDUAL GENOTYPING FOR GENOME-WIDE ASSOCIATION STUDIES OF COMPLEX DISEASE.

Variation in PNPLA3 is associated with outcomes. in alcoholic liver disease

A UNIFIED FRAMEWORK FOR VARIANCE COMPONENT ESTIMATION WITH SUMMARY STATISTICS IN GENOME-WIDE ASSOCIATION STUDIES 1

Genome-wide Association Analysis Applied to Asthma-Susceptibility Gene. McCaw, Z., Wu, W., Hsiao, S., McKhann, A., Tracy, S.

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Dajiang J. Liu 1,2, Suzanne M. Leal 1,2 * Abstract. Introduction

Unit 1 Exploring and Understanding Data

MBG* Animal Breeding Methods Fall Final Exam

What can we contribute to cancer research and treatment from Computer Science or Mathematics? How do we adapt our expertise for them

Memorial Sloan-Kettering Cancer Center

Detecting Identity by Descent and Homozygosity Mapping in Whole-Exome Sequencing Data

Bayesian hierarchical modelling

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Our Stage 1 genotype scan was performed using Illumina Human1 Beadarrays, which have a

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed.

Supplementary Figure 1. Quantile-quantile (Q-Q) plot of the log 10 p-value association results from logistic regression models for prostate cancer

Data Analysis in Practice-Based Research. Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine

Publications (* denote senior corresponding author)

TITLE: A Genome-wide Breast Cancer Scan in African Americans. CONTRACTING ORGANIZATION: University of Southern California, Los Angeles, CA 90033

Multiple Regression Analysis

Challenges in Developing Learning Algorithms to Personalize mhealth Treatments

Transcription:

Human population sub-structure and genetic association studies Stephanie A. Santorico, Ph.D. Department of Mathematical & Statistical Sciences Stephanie.Santorico@ucdenver.edu

Global Similarity Map from 23andme.com 2

Besides being cool This information can help us think about genetic influence on complex traits. Why is information about ancestry important in disease mapping studies? How can we measure ancestry from genetic data? How is this information used in the context of genetic association studies and, more specifically, in genome-wide association studies? 3

Motivating data: Sample of 4,920 Native Americans of the Pima and Papago tribes [1] Type 2 diabetes Haplotype from the Gm 3;5,13,14 system of human immunoglobulin G Question of interest: Is there association between the haplotype and disease? 4

H 0 : Haplotype frequency in cases = Haplotype frequency in controls Haplotype appears to be protective: 8% diabetic versus 29% diabetic 5

A BRIEF DETOUR 6

Confounding Factor of Interest Disease Confounder A confounding variable is an extraneous variable in a statistical model that correlates with both the factor of interest and the independent variable. 7

Public service announcement: Confounding is everywhere and is a good phenomenon to be aware of in any scientific study. 8

Here, confounding is due to admixture. HOW? More generally, we can have confounding due to population substructure or population stratification. Population stratification is the presence of a systematic difference in allele frequencies between subpopulations in a population. This is why great care is taken in genetic association studies to match and/or control for ancestry. 9

This example motivated a good deal of research on methods for correcting for confounding due to population substructure. The issue is not specific to any one complex trait. The problem does not go away with a bigger sample or more markers. However, with more markers, we can use that information to adjust for population structure. 10

Context: genome-wide association studies (GWAS) Box 2, i of [4]: an example from the type 2 diabetes component of the Welcome Trust Case Control Consortium study From here on, we will assume that tests have been conducted for each SNP over the genome using a chi-squared test statistic, e.g., X 2 2 1,, X 1,817,348. 11

Methods for Dealing with Population Substructure in GWAS 1. Genomic control 2. Principal components analysis 3. Structured association methods 4. Family-based studies 5. Mixed models With the exception of familybased studies, these are in order of increasing complexity. For each method, we will go through an overview and pros/cons for their usage. [2] is a good review of methods and further literature. 12

1. Genomic Control Devlin and Roeder [3] suggested the use of a genomic inflation factor, denoted by Concept: Population substructure inflates significance 13

FIGURE 1 of [2] The figure shows simulated P P plots under three scenarios for genome-wide scans with no causal markers. a No stratification: p-values fit the expected distribution. b Stratification without unusually differentiated markers: p- values exhibit modest genome-wide inflation. c Stratification with unusually differentiated markers: p-values exhibit modest genome-wide inflation and severe inflation at a small number of markers. 14

1. Genomic Control Devlin and Roeder [3] suggested the use of a genomic inflation factor, denoted by Concept: Population substructure inflates significance Measure inflation using the median of the chi-squared test statistics divided by the median corresponding to the distribution under no inflation: λ = Median X i 2 0.456 If λ > 1, this indicates the presence of substructure Adjust the chi-squared statistics by dividing by this factor. This reduces the inflation Pros: Exceptionally easy. Cons: a single adjustment for all SNPs may not be appropriate. 15

Now days, GC is more often used as an indicator that appropriate adjustments have been made No adjustments made From [2], based on 99,900 SNPs + 100 unusually differentiated SNPs added for =0.6 16

2. Principal Components Analysis (PCA) PCA is a general, well-studied and utilized statistical method that dates to 1901. PCA finds uncorrelated combinations of variables that maximally explain variation. In GWAS, PCA is used to derive continuous axes of genetic variation These PCs can be used to detect outliers. Population substructure can be explored. Information from PCs can be used to match cases and controls based on ancestry or as a covariate adjustment in linear or logistic regression. 17

Genetic PCs correlated with ancestry Figure 1a from [5]

PCA usage in GWAS: Conduct a PCA with study samples and samples from public sources, representing diverse world populations, such as from 1000 genomes Determine if study samples match with selfreported ancestry, removing those that do not represent a homogenous study sample PCs on study samples can be used as covariates in subsequent analysis or for matching individuals based on ancestry. Software note: most statistical software packages will perform a PCA. Commonly used packages specific to this application are EIGENSOFT and SMARTPCA Pros: relatively easy with existing software. Powerful exploratory technique. Cons: requires some pre-treatment of SNPs, e.g., pruning of SNPs due to linkage disequilibrium and minor allele frequency Principal component analysis of 3557 study subjects with 1194 HapMap controls. Color-coding distinguishes HapMap groups and the study subjects (TZ). Axis labels indicate the percentage of variance explained by each eigenvector 19

3. Structured association methods These methods model or infer underlying population structure and assign individuals to these sub-populations or clusters Some established methods: GEM (Genetic Matching) [6] STRUCTURE [7] ADMIXTURE [8] 20

3. Structured association methods: Genetic Matching Figure 1 A&B from [6]: Flowchart for Genetic-Matching Algorithm Illustrated with Portions of the T1D Data. Distances between individuals are determined by the major axes of variation in the EVD representation. Outlier removal, illustrated by (A), is critical for revealing the subtle variability between individuals of similar ancestry. After major outliers are removed, clustering is used for discovery of homogeneous clusters; four distinct clusters are displayed here (B), plotted as principal component axes. 21

After matching cases and controls, testing is done within each strata of matched casecontrols and then evidence is combined over the strata, e.g., Cochran Mantel Hansel test or Conditional Logistic Regression Strata 1 Allele Case Control Strata K Allele Case Control A 32 74 a 16 60... A 21 62 a 9 40 n 111 μ 111 = 32 28.0 Var(n 111 ) = 8.6 n 11K μ 11K = 21 18.9 Var(n 11K ) = 5.4 22

After matching cases and controls Testing is done within each strata of matched case-controls and then evidence is combined over the strata, e.g., Cochran Mantel Hansel test and, Conditional Logistic Regression Since testing is done within strata prior to combining information, inflation due to population substructure is controlled. Pros: allows for control of fine level structure. Enables rigorous use of public controls Cons: a multi-step process that requires a fair amount of computation, though software exists for such analyses 23

4. Family-based studies Family-based association tests focus on within-family information. Popular starting in the 90 s with the transmission disequilibrium test or TDT and numerous extensions such as the FBAT, PDT and QTDT Matching is done by focusing on transmitted versus non-transmitted alleles from heterozygous parents Since matching is internal to the family, population substructure is appropriately taken into account H 0 : proportion of 1 alleles transmitted to diseased individuals from 1/2 parents is 0.5 24

In [10], the TDT was used to test demonstrate association of class 1 alleles of the insulin gene 5 VNTR with insulin dependent diabetes. Previous case-control tests had found significant association, but linkage studies were not able to find significant linkage for this marker. H 0 : proportion of 1 alleles transmitted to diseased individuals from 1/2 parents is 0.5 NOT TRANSMITTED TRANSMITTED Class 1 Other Class 1 78 Other 46 T ( n n ) n n 2 12 21 12 21 T 2 = 78 46 2 78+46 =8.3 The p-value can be computed from a chisquared distribution with 1 degree of freedom to be 0.00811427 2 25

4. Family-based studies Family-based association tests focus on within-family information. Popular starting in the 90 s with the transmission disequilibrium test or TDT and numerous extensions such as the FBAT, PDT and QTDT Matching is done by focusing on transmitted versus non-transmitted alleles from heterozygous parents Since matching is internal to the family, population substructure is appropriately taken into account Pros: Statistics are easy to compute and easy to interpret. Does not require genome-wide data. Tests incorporate linkage information. Cons: These tests do not use between family information and hence are not fully powered. Information from homozygous parents is not used at all. H 0 : proportion of 1 alleles transmitted to diseased individuals from 1/2 parents is 0.5 26

5. Mixed Models Mixed by using a fixed component for a SNP effect and a random component reflecting underlying structure, e.g., population substructure and cryptic relatedness Adjusts for underlying structure by using genetic data to measure correlation between individuals y i = β 0 + β k X ik + η i + ε i var Y = σ a 2 S N + σ e 2 I S N is derived based on the matrix K = k ij where k ij = 1 M M k=1 n ik 2p k 2p k 1 p k n jk 2p k 27

Mixed model example 28 From [9]

5. Mixed Models Mixed by using a fixed component for a SNP effect and a random component reflecting underlying structure, e.g., population substructure and cryptic relatedness Adjusts for underlying structure by using genetic data to measure correlation between individuals y i = β 0 + β k X ik + η i + ε i var Y = σ a 2 S N + σ e 2 I S N is derived based on the matrix K = k ij where k ij = 1 M M k=1 n ik 2p k 2p k 1 p k n jk 2p k Pros: Adjusts for fine level structure including relatedness Cons: Much more computationally intensive but feasible with existing software packages such as EMMAX [9] as well as with the use of computing clusters 29

Concluding remarks and test Confounding presents itself in genetic studies through which is a concern because This can be detected through use of a summary measure,, which can be used as a quality control step and/or as an adjustment to test statistics. Beyond family-based methods, there are a number of methods that allow for adjustment of substructure. In increasing order of complexity, four such methods are 30

References 1. Knowler WC, Williams RC, Pettitt DJ, Steinberg a G. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet. 1988;43:520 6. 2. Price AL, Zaitlen N a, Reich D, Patterson N. New approaches to population stratification in genome-wide association studies. Nat Rev Genet [Internet]. Nature Publishing Group; 2010;11(7):459 63. 3. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55(4):997 1004. 4. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP a, et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008;9(May):356 69. 5. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, et al. Genes mirror geography within Europe. Nature. 2008;456(7218):98 101. 6. Luca D, Ringquist S, Klei L, Lee AB, Gieger C, Wichmann H-E, et al. On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am J Hum Genet. 2008 Feb;82(2):453 63 7. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945 59. 8. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19:1655 64. 9. Kang HM, Sul JH, Service SK, Zaitlen N a, Kong S-Y, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet [Internet]. Nature Publishing Group; 2010;42(4):348 54. 10. Spielman RS, McGinnis RE, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52: 506-516 31