Tutorial on Genome-Wide Association Studies

Similar documents
New Enhancements: GWAS Workflows with SVS

Genome-wide Association Analysis Applied to Asthma-Susceptibility Gene. McCaw, Z., Wu, W., Hsiao, S., McKhann, A., Tracy, S.

CS2220 Introduction to Computational Biology

Dan Koller, Ph.D. Medical and Molecular Genetics

Supplementary Figures

Genome-wide association studies (case/control and family-based) Heather J. Cordell, Institute of Genetic Medicine Newcastle University, UK

Introduction to the Genetics of Complex Disease

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

GENOME-WIDE ASSOCIATION STUDIES

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

Quality Control Analysis of Add Health GWAS Data

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin,

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

Human population sub-structure and genetic association studies

An Introduction to Quantitative Genetics I. Heather A Lawson Advanced Genetics Spring2018

Nature Genetics: doi: /ng Supplementary Figure 1

LTA Analysis of HapMap Genotype Data

CNV PCA Search Tutorial

Quantitative genetics: traits controlled by alleles at many loci

For more information about how to cite these materials visit

Statistical power and significance testing in large-scale genetic studies

Research Article Power Estimation for Gene-Longevity Association Analysis Using Concordant Twins

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

Big Data Training for Translational Omics Research. Session 1, Day 3, Liu. Case Study #2. PLOS Genetics DOI: /journal.pgen.

New methods for discovering common and rare genetic variants in human disease

AN ANALYSIS OF GENOME-WIDE ASSOCIATION STUDIES TO PRODUCE EVIDENCE USEFUL IN GUIDING THEIR REPORTING AND SYNTHESIS

Reviewers' comments: Reviewer #1 (Remarks to the Author):

Imaging Genetics: Heritability, Linkage & Association

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Introduction of Genome wide Complex Trait Analysis (GCTA) Presenter: Yue Ming Chen Location: Stat Gen Workshop Date: 6/7/2013

SUPPLEMENTARY DATA. 1. Characteristics of individual studies

Discontinuous Traits. Chapter 22. Quantitative Traits. Types of Quantitative Traits. Few, distinct phenotypes. Also called discrete characters

Interaction of Genes and the Environment

Example HLA-B and abacavir. Roujeau 2014

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Heritability and genetic correlations explained by common SNPs for MetS traits. Shashaank Vattikuti, Juen Guo and Carson Chow LBM/NIDDK

Supplementary information. Supplementary figure 1. Flow chart of study design

Statistical Genetics : Gene Mappin g through Linkag e and Associatio n

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS

Missing Heritablility How to Analyze Your Own Genome Fall 2013

Complex Multifactorial Genetic Diseases

Estimating genetic variation within families

EXPLORING THE GENETIC ARCHITECTURE OF LATE-ONSET ALZHEIMER DISEASE IN AN AMISH POPULATION. Anna Christine Cummings

Global variation in copy number in the human genome

HST.161 Molecular Biology and Genetics in Modern Medicine Fall 2007

Assessing Gene-Environment Interactions in Genome-Wide Association Studies: Statistical Approaches

MULTIFACTORIAL DISEASES. MG L-10 July 7 th 2014

Genetics and Pharmacogenetics in Human Complex Disorders (Example of Bipolar Disorder)

Lecture 1 Mendelian Inheritance

Mendelian Randomization

Pedigree Construction Notes

Nature and nurture in. Department of Biological Psychology, Neuroscience Campus / EMGO + institute VU University, Amsterdam The Netherlands

Summary & general discussion

PopGen4: Assortative mating

Mendelian Inheritance. Jurg Ott Columbia and Rockefeller Universities New York

Nature Neuroscience: doi: /nn Supplementary Figure 1. Missense damaging predictions as a function of allele frequency

Problem set questions from Final Exam Human Genetics, Nondisjunction, and Cancer

Multifactorial Inheritance

Genetics and Genomics in Medicine Chapter 8 Questions

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

The Inheritance of Complex Traits

GWAS of HCC Proposed Statistical Approach Mendelian Randomization and Mediation Analysis. Chris Amos Manal Hassan Lewis Roberts Donghui Li

Using Imputed Genotypes for Relative Risk Estimation in Case-Parent Studies

Mendelian & Complex Traits. Quantitative Imaging Genomics. Genetics Terminology 2. Genetics Terminology 1. Human Genome. Genetics Terminology 3

BST227: Introduction to Statistical Genetics

The genetic architecture of type 2 diabetes appears

Title: Pinpointing resilience in Bipolar Disorder

Supplementary Online Content

Epigenetics. Jenny van Dongen Vrije Universiteit (VU) Amsterdam Boulder, Friday march 10, 2017

Lecture 17: Human Genetics. I. Types of Genetic Disorders. A. Single gene disorders

Multiple Approaches to Studying Gene-Environment Interplay. Sara Jaffee University of Pennsylvania

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

The Chromosomal Basis Of Inheritance

White Paper Guidelines on Vetting Genetic Associations

On Missing Data and Genotyping Errors in Association Studies

Heritability of surface area and cortical thickness: a comparison between the Human Connectome Project and the UK Biobank dataset

IS IT GENETIC? How do genes, environment and chance interact to specify a complex trait such as intelligence?

INTRODUCTION TO GENETIC EPIDEMIOLOGY (EPID0754) Prof. Dr. Dr. K. Van Steen

Transmission Disequilibrium Methods for Family-Based Studies Daniel J. Schaid Technical Report #72 July, 2004

Supplementary figures

Mendel Short IGES 2003 Data Preparation. Eric Sobel. Department of of Human Genetics UCLA School of of Medicine

Chapter 15: The Chromosomal Basis of Inheritance

Genetics of common disorders with complex inheritance Bettina Blaumeiser MD PhD

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Role of Genomics in Selection of Beef Cattle for Healthfulness Characteristics

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Lecture 6 Practice of Linkage Analysis

Today s Topics. Cracking the Genetic Code. The Process of Genetic Transmission. The Process of Genetic Transmission. Genes

INTRODUCTION TO GENETIC EPIDEMIOLOGY (1012GENEP1) Prof. Dr. Dr. K. Van Steen

Using ancestry estimates as tools to better understand group or individual differences in disease risk or disease outcomes

Supplementary Methods, Tables and Figures

Developing and evaluating polygenic risk prediction models for stratified disease prevention

University of Groningen. Metabolic risk in people with psychotic disorders Bruins, Jojanneke

Chapter 15: The Chromosomal Basis of Inheritance

Supplementary Figures

Performing. linkage analysis using MERLIN

Ch. 23 The Evolution of Populations

ASSOCIATION OF KCNJ1 VARIATION WITH CHANGE IN FASTING GLUCOSE AND NEW ONSET DIABETES DURING HCTZ TREATMENT

Identification of regions with common copy-number variations using SNP array

Transcription:

Tutorial on Genome-Wide Association Studies Assistant Professor Institute for Computational Biology Department of Epidemiology and Biostatistics Case Western Reserve University

Acknowledgements Dana Crawford Holli Dilks-Hutchinson Marylyn Ritchie William S. Bush 2014 2

Key References www.ploscollections.org/translationalbioinformatics William S. Bush 2014 3

Overview Common Study Designs for GWAS Quality Control Procedures for GWAS Data Statistical Analysis Replication William S. Bush 2014 4

Goals of a Genetic Study Determine if there is a genetic component (heritability) Describe mode of inheritance (segregation analysis) Determine the effect size of the genetic component Identify the gene causing the disease William S. Bush 2014 5

Family Studies Allows estimation of genetic component Allows examination of mode of inheritance Difficult to collect Power derived from the number of families William S. Bush 2014 6

Population Studies Easier to collect Larger sample sizes Assumes there is a genetic component Power derived from ratio of controls to cases William S. Bush 2014 7

Retrospective vs. Prospective Manolio et al. Nature Reviews Genetics 7, 812 820 (October 2006) Exposure is a genotype William S. Bush 2014 8

Choosing a Study Design What samples are available? Is a genetic component known? Details of the trait being studied (age at onset, disease frequency, penetrance, etc.) Interest in other factors of disease (environmental exposures, survival, effect size) William S. Bush 2014 9

Choosing a Study Design Pearson, T. A. et al. JAMA 2008;299:1335 1344. William S. Bush 2014 10

Assumptions of GWAS Examines only the Common Disease Common Variant hypothesis Relies on dense sets of genetic markers Exploits linkage disequilibrium to make indirect associations Goal: Identify markers with significant associations to disease William S. Bush 2014 11

Recombination William S. Bush 2014 12

Linkage Disequilibrium William S. Bush 2014 13

Exploiting Linkage Disequilibrium William S. Bush 2014 14

How GWAS Data is Generated William S. Bush 2014 15

Quality Control of GWAS Data Variable Genotyping Call Rate Genotyping Quality Sex concordance Sample Relatedness Comments Low call rate often correlates with error. Some low call rate SNPs or samples may still be good. Worse quality score (GenCall) correlates strongly with error rate Check expectations for X marker heterozygosity and Y marker positive results. Can estimate error rate. Check for related samples (expected or unexpected) Mendelian Inheritance Errors Replicate concordance For trio/family data, can identify problem samples and families. Can estimate error rate. Check for consistent genotype calls in duplicate samples Batch effects Check for genotyping call differences due to plate Hardy Weinberg Equilibrium Population Stratification Violation across all sample groups may indicate error, but can also be a good test of association Check for population substructure using the genome wide data William S. Bush 2014 16

Marker and Sample Call Rate William S. Bush 2014 17

Sex Concordance Check emerge_id Pedsex SNPsex PLINK_F Note 16230834 2 0 0.4746 CIDR comment after review of B allele freq and Log R ratio plots for all chromosomes: This sample has large loss of heterozygosity (LOH) blocks on X (and other autosomes). The sample is definitely female (2 X chromosomes by intensities). 16228083 2 0 0.2654 Same as above 16231930 2 0 0.4376 Same as above 16233764 2 0 0.2603 Same as above 16221112 2 0 0.2048 XX/XO mosaic not caught by initial check completed by CIDR 16222319 2 0 0.7452 Annotation by CIDR at data release: Appears to be XX/XO mosaic 16228204 2 1 1 Annotation by CIDR at data release: Appears to be XX/XO mosaic 16233113 1 0 0.4752 Annotation by CIDR at data release: Appears to be XXY 16214881 1 2 0.136 Annotation by CIDR at data release: Appears to be XXY/XY mosaic Female: pedsex=2, SNPsex=2 Male: pedsex= 1, SNPsex=1 A male call is made if the F (actual X chromosome inbreeding estimate) is more than 0.8; a female call is made if the F is less than 0.2. William S. Bush 2014 18

Normal Chromosome 1 Sex Concordance Check William S. Bush 2014 19

Possible XXY/XY Mosaic Sex Concordance William S. Bush 2014 20

Sample Relatedness Z0 Z1 Z2 Kinship Relationship 0.0 0.0 1.0 1.0 MZ twin or duplicate 0.0 1.0 0.0 0.50 Parent offspring 0.25 0.50 0.25 0.50 Full siblings 0.50 0.50 0.0 0.25 Half siblings 0.75 0.25 0.0 0.125 Cousins 1.0 0.0 0.0 0.0 Unrelated William S. Bush 2014 21

Sample Relatedness http://www.sph.umich.edu/csg/abecasis/publications/11524377.html William S. Bush 2014 22

Mendelian Inheritance Errors Even with Case/Control data, HapMap trios are typically plated with study samples for QC Number Mendelian Errors Number SNPs pre QC Number SNPs post marker QC 0 558821 552346 1 1519 1353 2 97 64 3 5 1 William S. Bush 2014 23

Sample Replicate Concordance emerge Samp1 samp2 discordant total concordance_rate 16231453 A B 171 558882 0.99969 16223704 A B 137 557783 0.99975 16216270 A B 133 559711 0.99976 16230108 A B 69 559341 0.99987 16224359 A B 67 558868 0.99988 16234120 A B 43 560202 0.99992 16232463 A B 42 560355 0.99992 16234233 A B 33 560384 0.99994 16216349 A B 30 559345 0.99994 16215309 A B 12 560041 0.99997 16224779 A B 7 560412 0.99998 16231724 A B 5 560427 0.99999 16233841 A B 4 560519 0.99999 16221647 A B 2 560457 0.99999 16230404 A B 2 560309 0.99999 16226433 A B 2 560500 0.99999 16234367 A B 2 560373 0.99999 16224635 A B 1 560560 0.99999 16219214 A B 1 560535 0.99999 16231219 A B 1 560547 0.99999 16220060 A B 0 560580 1 William S. Bush 2014 24

Hardy Weinberg Equilibrium All individuals threshhold below exp_below excess_below 0.05 37690 28022 9668 0.01 12774 5604 7170 0.001 4766 560 4206 1.00E 04 2949 56 2893 1.00E 05 2337 5 2332 1.00E 06 2004 0 2004 1.00E 07 1785 0 1785 All cases threshold below exp_below excess_below 0.05 34646 28022 6624 0.01 10843 5604 5239 0.001 3642 560 3082 1.00E 04 2194 56 2138 1.00E 05 1792 5 1787 1.00E 06 1563 0 1563 1.00E 07 1394 0 1394 All controls threshold below exp_below excess_below 0.05 30557 28022 2535 0.01 8859 5604 3255 0.001 2614 560 2054 1.00E 04 1517 56 1461 1.00E 05 1180 5 1175 1.00E 06 982 0 982 1.00E 07 860 0 860 William S. Bush 2014 25

Population Stratification Principal Component Analysis (PCA) Can cause confounding Genes Mirror Geography in Europe, Novembre, Nature Genetics, 2008 William S. Bush 2014 26

Batch Effects Evidence that associations can result due to allele frequency difference due to plate effects Careful consideration when creating plate maps Plate cases and controls together Randomize by race, gender, age, BMI, others After genotyping look for plate effects MAF differences by plate Call rate by plate Association tests (one plate versus all others) William S. Bush 2014 27

Importance of QC Pre QC Thresholds Post QC Thresholds Many false positives disappear after QC William S. Bush 2014 28

GWAS Analysis Consider 500,000 SNPs across the human genome Each SNP has its own statistical test Each SNP has a different statistical power (depending on allele frequency) William S. Bush 2014 29

Analysis of GWAS Data Basic Statistical Methods are usually applied Linear Regression (continuous trait) Logistic Regression (dichotomous trait) Adjustments are critical to avoid confounding William S. Bush 2014 30

Software Tools PLINK - http://pngu.mgh.harvard.edu/~purcell/plink/ PLATO https://ritchielab.psu.edu/software/platodownload R http://www.r-project.org/ Bioconductor - http://www.bioconductor.org/ William S. Bush 2014 31

Logistic Regression Examines differences between two groups (cases and controls) Transforms the Y- Axis of a typical regression using a logit function Produces a probability of case or control status (Odds Ratio) William S. Bush 2014 32

QQ Plot Systematic deviation from the line indicates population stratification / genomic inflation William S. Bush 2014 33

Multiple Testing Perform 500,000 analyses Type I error set at 5%, we can expect 25,000 false positive results Bonferroni correction False Discovery Rate (FDR) Gene-based correction (principal components) Genome-wide significance is p<10-8 Can be problematic for non-european Populations William S. Bush 2014 34

Winner s Curse Consider an item with a fixed value (pashmina) If there are ten American tourists bidding on the same item, the bids will average around the item s true value By definition, the winner will ALWAYS overpay (Dana) William S. Bush 2014 35

Winner s Curse in GWAS Similarly when running a GWAS and discovering a SNP association, you will OVERESTIMATE the strength of the association Power calculations use an effect size to know how many samples you need to detect this effect If the effect size is actually SMALLER than you think, you ll need MORE samples to see you effect again William S. Bush 2014 36

Replication Now required for consideration in top journals Second sample, preferably with larger sample sizes to increase power Ideally should be interchangeable with the first sample in every way Need all the covariates you used in the first dataset William S. Bush 2014 37

Great GWAS Examples Multiple Sclerosis GWAS Trio design, extensive QC http://www.ncbi.nlm.nih.gov/pubmed/17660530 Type II Diabetes GWAS http://www.ncbi.nlm.nih.gov/pubmed/17463246?do pt=abstract&holding=npg William S. Bush 2014 38