Rare Variant Burden Tests. Biostatistics 666

Similar documents
Field wide development of analytic approaches for sequence data

Leveraging the Exome Sequencing Project: Creating a WHI Resource Through Exome Imputation in SHARe. Chris Carlson on behalf of WHISP May 5, 2011

Missing Heritablility How to Analyze Your Own Genome Fall 2013

New Enhancements: GWAS Workflows with SVS

REVIEW Rare-Variant Association Analysis: Study Designs and Statistical Tests

TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data

Genomic approach for drug target discovery and validation

Nature Genetics: doi: /ng Supplementary Figure 1. Rates of different mutation types in CRC.

Statistical power and significance testing in large-scale genetic studies

Bench to Bedside PCSK9 Children are Different

When bad things happen to good genes: mutation vs. selection

Home Brewed Personalized Genomics

Whole-genome detection of disease-associated deletions or excess homozygosity in a case control study of rheumatoid arthritis

Large-scale identity-by-descent mapping discovers rare haplotypes of large effect. Suyash Shringarpure 23andMe, Inc. ASHG 2017

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Nature Neuroscience: doi: /nn Supplementary Figure 1. Missense damaging predictions as a function of allele frequency

3) It is not clear to me why the authors exclude blond hair from the red hair GWAS, and blond and red hair from the brown hair GWAS.

Assessing Accuracy of Genotype Imputation in American Indians

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

Introduction to genetic variation. He Zhang Bioinformatics Core Facility 6/22/2016

Association mapping (qualitative) Association scan, quantitative. Office hours Wednesday 3-4pm 304A Stanley Hall. Association scan, qualitative

Bio 1M: Evolutionary processes

EIGEN: A SPECTRAL APPROACH TO THE INTEGRATION OF FUNCTIONAL GENOMICS ANNOTATIONS FOR BOTH CODING AND NONCODING SEQUENCE VARIANTS

Supplementary Figures

Computational Systems Biology: Biology X

White Paper Guidelines on Vetting Genetic Associations

Introduction of Genome wide Complex Trait Analysis (GCTA) Presenter: Yue Ming Chen Location: Stat Gen Workshop Date: 6/7/2013

Lecture 20. Disease Genetics

BST227: Introduction to Statistical Genetics

The Complexity of Simple Genetics

An expanded view of complex traits: from polygenic to omnigenic

Ct=28.4 WAT 92.6% Hepatic CE (mg/g) P=3.6x10-08 Plasma Cholesterol (mg/dl)

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Heritability and genetic correlations explained by common SNPs for MetS traits. Shashaank Vattikuti, Juen Guo and Carson Chow LBM/NIDDK

SEX. Genetic Variation: The genetic substrate for natural selection. Sex: Sources of Genotypic Variation. Genetic Variation

Identifying Mutations Responsible for Rare Disorders Using New Technologies

CS2220 Introduction to Computational Biology

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

SUPPLEMENTARY INFORMATION

NGS in tissue and liquid biopsy

Supplementary Figure 1. Quantile-quantile (Q-Q) plot of the log 10 p-value association results from logistic regression models for prostate cancer

Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

Global variation in copy number in the human genome

CITATION FILE CONTENT/FORMAT

MBG* Animal Breeding Methods Fall Final Exam

An Introduction to Quantitative Genetics I. Heather A Lawson Advanced Genetics Spring2018

Supplementary Online Content

Introduction to Genetics and Genomics

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

SNPrints: Defining SNP signatures for prediction of onset in complex diseases

Multiple rare alleles at LDLR and APOA5 confer risk for early-onset myocardial infarction

A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families

Research Strategy: 1. Background and Significance

Human population sub-structure and genetic association studies

5/2/18. After this class students should be able to: Stephanie Moon, Ph.D. - GWAS. How do we distinguish Mendelian from non-mendelian traits?

Reviewers' comments: Reviewer #1 (Remarks to the Author):

Missing data. Patrick Breheny. April 23. Introduction Missing response data Missing covariate data

UNIVERSITY OF CALIFORNIA, LOS ANGELES

Dajiang J. Liu 1,2, Suzanne M. Leal 1,2 * Abstract. Introduction

Nature Genetics: doi: /ng Supplementary Figure 1

Genetics and Genomics in Medicine Chapter 8 Questions

Genetic Approaches to Alcoholism, Alcohol Abuse Susceptibility, and Therapeutic Response. Ray White Verona March

Genetic Studies of Abdominal Aortic Aneurysms David J. Carey, PhD Weis Center for Research Geisinger Clinic

The Association Design and a Continuous Phenotype

Interrogation of rare functional variation within bipolar disorder and suicidal behavior cohorts

Supplementary Online Content

Developing and evaluating polygenic risk prediction models for stratified disease prevention

Research Article Power Estimation for Gene-Longevity Association Analysis Using Concordant Twins

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

GENOME-WIDE ASSOCIATION STUDIES

VARIANT PRIORIZATION AND ANALYSIS INCORPORATING PROBLEMATIC REGIONS OF THE GENOME ANIL PATWARDHAN

De novo mutational profile in RB1 clarified using a mutation rate modeling algorithm

Non-parametric methods for linkage analysis

Lab 5: Testing Hypotheses about Patterns of Inheritance

MultiPhen: Joint Model of Multiple Phenotypes Can Increase Discovery in GWAS

A guide to understanding variant classification

Introduction to Quantitative Genetics

Title: Pinpointing resilience in Bipolar Disorder

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

A total of 2,822 Mexican dyslipidemic cases and controls were recruited at INCMNSZ in

Taking a closer look at trio designs and unscreened controls in the GWAS era

CNV Detection and Interpretation in Genomic Data

Nature Genetics: doi: /ng Supplementary Figure 1. Country distribution of GME samples and designation of geographical subregions.

Analysis with SureCall 2.1

PROGRESS: Beginning to Understand the Genetic Predisposition to PSC

Variant Annotation and Functional Prediction

Dr Rick Tearle Senior Applications Specialist, EMEA Complete Genomics Complete Genomics, Inc.

Compound heterozygosity Yurii S. Aulchenko yurii [dot] aulchenko [at] gmail [dot] com. Thursday, April 11, 13

Overview of Animal Breeding

Request for Applications Post-Traumatic Stress Disorder GWAS

Supplementary Online Content

Selection at one locus with many alleles, fertility selection, and sexual selection

DNA is the genetic material that provides instructions for what our bodies look like and how they function. DNA is packaged into structures called

Identification of Genetic Association of Multiple Rare Variants Using Collapsing Methods

For more information about how to cite these materials visit

Supplementary Figure 1. Principal components analysis of European ancestry in the African American, Native Hawaiian and Latino populations.

Imaging Genetics: Heritability, Linkage & Association

PERSONALIZED GENETIC REPORT CLIENT-REPORTED DATA PURPOSE OF THE X-SCREEN TEST

LTA Analysis of HapMap Genotype Data

Transcription:

Rare Variant Burden Tests Biostatistics 666

Last Lecture Analysis of Short Read Sequence Data Low pass sequencing approaches Modeling haplotype sharing between individuals allows accurate variant calls for shared variants Assembly Based Analyses Conveniently allow many different types of variation to be analyzed in the same framework

Variants Discovered in Low Pass Analysis As Function of Allele Frequency In 1000 Genomes Project Phase I (1094 samples @ 4x), Hyun Min Kang

Exome Sequencing Today Association Analysis Of Rare Coding Variants Single Variant Analysis Burden Tests Weighted Burden Tests Allowing for Direction of Effect Example of an exome sequencing study

Why Study Rare Variants? COMPLETE GENETIC ARCHITECTURE OF EACH TRAIT Are there additional susceptibility loci to be found? What is the contribution of each identified locus to a trait? Sequencing, imputation and new arrays describe variation more fully Rare variants are plentiful and should identify new susceptibility loci UNDERSTAND FUNCTION LINKING EACH LOCUS TO A TRAIT Do we have new targets for therapy? What happens in gene knockouts? Use sequencing to find rare human knockout alleles Good: Results may be more clear than for animal studies Bad: Naturally occurring knockout alleles are extremely rare

Why Study Rare Variants? COMPLETE GENETIC ARCHITECTURE OF EACH TRAIT Are there additional susceptibility loci to be found? What is the contribution of each identified locus to a trait? Coding Variants Especially Useful! Sequencing, imputation and new arrays describe variation more fully Rare variants are plentiful and should identify new susceptibility loci UNDERSTAND FUNCTION LINKING EACH LOCUS TO A TRAIT Do we have new targets for therapy? What happens in gene knockouts? Use sequencing to find rare human knockout alleles Good: Results may be more clear than for animal studies Bad: Naturally occurring knockout alleles are extremely rare

The Scale of Rare Variation

Lots of Rare Functional Variants to Discover SET # SNPs Singletons Doubletons Tripletons >3 Occurrences Synonymous 270,263 128,319 (47%) 29,340 (11%) 13,129 (5%) 99,475 (37%) Nonsynonymous 410,956 234,633 (57%) 46,740 (11%) 19,274 (5%) 110,309 (27%) Nonsense 8,913 6,196 (70%) 926 (10%) 326 (4%) 1,465 (16%) Non-Syn / Syn Ratio 1.8 to 1 1.6 to 1 1.4 to 1 1.1 to 1 There is a very large reservoir of extremely rare, likely functional, coding variants. NHLBI Exome Sequencing Project

Allele Frequency Spectrum (After Sequencing 12,000+ Individuals) Variant Count 1,000,000 100,000 10,000 1,000 100 NonSynonymous Splice Variants Stop 10 1 1 10 100 1,000 10,000 Minor Allele Count http://genome.sph.umich.edu/wiki/exome_chip_design

How Much Variation Might Rare Variants Explain? All variation neutral, population size constant MAF<0.1% variants explain 0.2% of heritability MAF<1.0% variants explain 2.0% of heritability MAF<5.0% variants explain 10% of heritability Nonsynonymous frequency spectrum from 12,000 exomes MAF<0.1% variants explain 3.6% of heritabilty MAF<1.0% variants explain 10.6% of heritability MAF<5.0% variants explain 22.7% of heritability Assuming rare variants effect sizes are ~2x larger on average Above estimates increase to about 8.6, 25.4 and 54.0% Assuming rare variants effect sizes are ~3x larger on average Above estimates increase to about 11.6, 34.1 and 72.6%

Do Rare Variants Have Large Effects? The main driver is natural selection Most variants that impact function are expected to be deleterious Natural selection will prevent them from becoming common Good evidence that non-synonymous variants are depleted among common variant lists

Rare Variants Have Large Effects More Often Lipid Associated Variants in 200,000 individuals Effect Size (SD) Results from analysis of >190,000 individuals Allele Frequency Sengupta et al (unpublished)

Genome Scale Approaches To Study Rare Variation Deep whole genome sequencing Can only be applied to limited numbers of samples Most complete ascertainment of variation Exome capture and targeted sequencing Can be applied to moderate numbers of samples SNPs and indels in the most interesting 1% of the genome Low coverage whole genome sequencing Can be applied to moderate numbers of samples Very complete ascertainment of shared variation New Genotyping Arrays and/or Genotype Imputation Examine low frequency coding variants in 100,000s of samples Current catalogs include 97-98% of sites detectable by sequencing an individual

Genome Scale Approaches To Study Rare Variation Deep whole genome sequencing Can only be applied to limited numbers of samples Most complete ascertainment of variation Exome capture and targeted sequencing Can be applied to moderate numbers of samples SNPs and indels in the most interesting 1% of the genome Low coverage whole genome sequencing Can be applied to moderate numbers of samples Very complete ascertainment of shared variation Our Focus For Today New Genotyping Arrays and/or Genotype Imputation Examine low frequency coding variants in 100,000s of samples Current catalogs include 97-98% of sites detectable by sequencing an individual

European Ancestry SNPs Per Individual # SNP # HET # ALT # Singletons Ts/Tv SILENT 10127 6174 3953 38.2 5.10 MISSENSE 8541 5184 3357 72.2 2.16 NONSENSE 86 57 29 2.1 1.70 African Ancestry Primarily European Ancestry Primarily African Ancestry # SNP # HET # ALT # Singletons Ts/Tv SILENT 12028 8038 3990 53.2 5.19 MISSENSE 9870 6502 3367 94.2 2.16 NONSENSE 92 57 35 2.4 1.57

Rare Variant Association Testing Consider variant with frequency of ~0.001 Significance level of 5x10-6 Corresponds to ~100,000 independent tests Disease prevalence of ~10% Detecting a two-fold increase in risk, requires ~33,000 cases and ~33,000 controls! Detecting a three-fold increase in risk requires ~11,000 cases and ~11,000 controls!

Rare Variant Association Testing Consider variant with frequency of ~0.001 Power Depends Both On: Significance level of 5x10-6 Corresponds to ~100,000 independent tests Disease prevalence of ~10% Frequency Effect Size Detecting a two-fold increase in risk, requires ~33,000 cases and ~33,000 controls! Even with large effects, rare variants Detecting a three-fold increase in risk requires ~11,000 cases and ~11,000 controls! can only be detected in large samples

Alternatives to Single Variant Tests

Collapsing Rare Variants Instead of testing rare variants individually, group variants likely to have similar function Score presence or absence of rare variants per individual Use rare variant score to predict trait values If all variants are causal, leads to large increase in power In practice, success depends on: Number of associated variants, Number of neutral variants diluting signals Whether direction of effect is consistent within gene Li and Leal (2008) Am J Hum Genet 83:311-321

Burden vs. Single Variant Tests Single Variant Test Combined Test 10 variants / all have risk 2 / All have frequency.005.05.86 10 variants / all have risk 2 / Unequal Frequencies.20.85 10 variants / average risk is 2, but varies / frequency.005.11.97 Power tabulated in collections of simulated data, for 250 cases and 250 controls Combining variants can greatly increase power Currently, appropriately combining variants is expected to be key feature of rare variant studies. Li and Leal (2008) Am J Hum Genet 83:311-321

Impact of Null Alleles Single Variant Test Combined Test 10 disease associated variants.05.86 10 disease associated variants + 5 null variants.04.70 10 disease associated variants + 10 null variants.03.55 10 disease associated variants + 20 null variants.03.33 Power tabulated in collections of simulated data Including non-disease variants reduces power Power loss is manageable, combined test remains preferable to single marker tests Li and Leal (2008) Am J Hum Genet 83:311-321

Impact of Missing Disease Alleles Single Variant Test Combined Test 10 disease associated variants.05.86 10 disease associated variants, 2 missed.05.72 10 disease associated variants, 4 missed.05.52 10 disease associated variants, 6 missed.04.28 10 disease associated variants, 8 missed.03.08 Power tabulated in collections of simulated data Missing disease associated variants loses power Li and Leal (2008) Am J Hum Genet 83:311-321

Refining Rare Variant Tests The original Li and Leal (2008) test simply collapses rare variants into one allele Multiple refinements have been proposed since Counting the number of rare variants per individual Weighting rare variants according to frequency Weighting rare variants according to function Including imputed variants in the analysis Each of these methods may improve power, but few practical examples provide guidance

CMAT: Combined Minor Allele Test Consider gene with k variants in sample of N cases and N controls. For polymorphism i define: w i, a weight based on functional annotation, minor allele frequency, imputation accuracy g ij, the expected posterior minor allele count in individual j. Set m = The test statistic is then A k w i i= 1 j= case Σ g ij CMAT M A = k w i i= 1 j= case (2 g mamu mu M A = N( ma + mu )( M A + MU ij ) ) Significance of the test statistic evaluated by permutation of affection status. Zawistowski et al (2010)

Weights Use computational algorithms to prioritize functional variants Based on conservation Based on biochemical properties Frequency is an independent predictor of functional consequence.

Imputation in Rare Variant Burden Tests Zawistowski et al (2010)

Power as a Function of No. of Sequenced and Genotyped Samples Zawistowski et al (2010)

Maximizing the Power Power depends on summed frequency choose threshold for defining rare carefully. Enriching functional variants in cases increases power perhaps by focusing on loss of function variants only. For quantitative traits, focus on individuals with extreme trait values. For discrete traits, focus on individuals with family history of disease.

Enriching based on familial risk Classic Case-Control Familial enrichment

Benefits of Favoring Family History of Disease

Practical Example: Exome Sequencing and Burden Tests NHLBI Exome Sequencing Project University of Washington and Broad Institute Cristen Willer and Leslie Lange

Exome Sequencing Project The NHLBI Exome Sequencing Project is studying heart, lung and blood related traits One of the traits of interest is LDL, a major risk factor for cardiovascular disease Let s review their preliminary findings, in analysis of 400 selected from top and bottom 2% of population 1,600 individuals selected without consideration of LDL

LDL Results Burden Test, MAF < 5% (logistic regression adjusted by PC1, PC2, age, gender, center) UNFILTERED PASS-FILTER PCSK9 PCSK9 PCSK9 (2 nd ) p = 5x10-7 LDLR (162 nd ) p = 0.009 PCSK9 (1 st ) p = 5x10-7 LDLR (75 th ) p = 0.006 Cristen Willer and Leslie Lange, NHLBI Exome Sequencing Project

LDL Results Burden Test, MAF < 0.1% (logistic regression adjusted by PC1, PC2, age, gender, center) UNFILTERED PASS-FILTER LDLR LDLR LDLR (1 st ) p = 3x10-6 PCSK9 (31 st ) p = 0.004 LDLR (1 st ) p = 3x10-6 PCSK9 (30 th ) p = 0.004 Cristen Willer and Leslie Lange, NHLBI Exome Sequencing Project

LDL Results Burden Test, MAF < 0.5% (logistic regression adjusted by PC1, PC2, age, gender, center) UNFILTERED PASS-FILTER NPC1L1 NPC1L1 NPC1L1(2 nd ) p = 7x10-5 NPC1L1(2 nd ) p = 7x10-5 Cristen Willer and Leslie Lange, NHLBI Exome Sequencing Project

Variable Threshold Tests Different definitions of rare lead to different signals Conducting multiple analyses quickly becomes hard to manage What to do? Variable threshold tests consider all possible thresholds for each gene and search for maximum test statistic Evaluate significance by permutation Price et al (2010) AJHG 86:832-838

Variable Threshold Tests Price et al (2010) originally suggested using permutations for evaluating significance of variable threshold association tests Lin and Tang (2011) showed that statistics using different thresholds could be described using a multivariate normal distribution allowing for p-value calculation without permutations. Lin and Tang (2011) AJHG 89:354-367

Additional Complications! What to do if a gene includes some rare alleles that increase risk, others that decrease it? What sort of signal do you expect? What sort of strategies might identify these signals?

Summary Analysis of individual rare variants requires very large samples. Power may be increased substantially by combining information across variants. Strategy for combining information across variants allows for many tweaks. This is an extremely active research area.

Recommended Reading Li and Leal (2008) Am J Hum Genet 83:311-321 Zawistowski M, Gopalakrishnan S, Ding J, Li Y, Grimm S, Zöllner S (2010) Am J Hum Genet 87:604-617