TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data

Similar documents
Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes

Rare Variant Burden Tests. Biostatistics 666

Nature Genetics: doi: /ng Supplementary Figure 1

SUPPLEMENTARY INFORMATION

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Variant Detection & Interpretation in a diagnostic context. Christian Gilissen

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Statistical power and significance testing in large-scale genetic studies

Analysis with SureCall 2.1

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Nature Methods: doi: /nmeth.3115

Using large-scale human genetic variation to inform variant prioritization in neuropsychiatric disorders

A Quick-Start Guide for rseqdiff

Strength of functional signature correlates with effect size in autism

Sequencing studies implicate inherited mutations in autism

Math Released Item Grade 3. Find the Area and Identify Equal Areas 1749-M23082

caspa Comparison and Analysis of Special Pupil Attainment

Tutorial on Genome-Wide Association Studies

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

Package CancerMutationAnalysis

Nature Neuroscience: doi: /nn Supplementary Figure 1

Metabolomic Data Analysis with MetaboAnalyst

SUPPLEMENTARY INFORMATION

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

Design for Targeted Therapies: Statistical Considerations

Package BUScorrect. September 16, 2018

Title: Prediction of HIV-1 virus-host protein interactions using virus and host sequence motifs

Dan Koller, Ph.D. Medical and Molecular Genetics

Ascertainment Through Family History of Disease Often Decreases the Power of Family-based Association Studies

How many disease-causing variants in a normal person? Matthew Hurles

Introduction to the Genetics of Complex Disease

De novo mutational profile in RB1 clarified using a mutation rate modeling algorithm

User Guide. Association analysis. Input

A Case Study: Two-sample categorical data

Package AbsFilterGSEA

For general queries, contact

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Congenital Heart Disease How much of it is genetic?

The University of Texas MD Anderson Cancer Center Division of Quantitative Sciences Department of Biostatistics. CRM Suite. User s Guide Version 1.0.

1 in 68 in US. Autism Update: New research, evidence-based intervention. 1 in 45 in NJ. Selected New References. Autism Prevalence CDC 2014

Module Overview. What is a Marker? Part 1 Overview

LTA Analysis of HapMap Genotype Data

Types of Modifications

Supplementary Information. Data Identifies FAN1 at 15q13.3 as a Susceptibility. Gene for Schizophrenia and Autism

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

CITATION FILE CONTENT/FORMAT

Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

IMPaLA tutorial.

Journal: Nature Methods

Clustering Autism Cases on Social Functioning

Supplementary Figure 1: Features of IGLL5 Mutations in CLL: a) Representative IGV screenshot of first

Population Genetics Simulation Lab

Package xseq. R topics documented: September 11, 2015

A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

Integrated Bayesian analysis of rare exonic variants to identify risk genes for schizophrenia and neurodevelopmental disorders

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

Reporting TP53 gene analysis results in CLL

SubLasso:a feature selection and classification R package with a. fixed feature subset

Introduction to Bayesian Analysis 1

Package cssam. February 19, 2015

Naïve Bayes classification in R

Answers to end of chapter questions

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Burning debate: What s the best way to nab real autism genes?

Mediation Analysis With Principal Stratification

Epigenetics. Jenny van Dongen Vrije Universiteit (VU) Amsterdam Boulder, Friday march 10, 2017

Hands-On Ten The BRCA1 Gene and Protein

1. Create a mutation rate table from intergenic SNPs for all possible trinucleotide to trinucleotide changes

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials

4. Model evaluation & selection

Chapter 8: Two Dichotomous Variables

PSSV User Manual (V2.1)

Transmission Disequilibrium Methods for Family-Based Studies Daniel J. Schaid Technical Report #72 July, 2004

Quantitative genetics: traits controlled by alleles at many loci

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma.

Gene Expression Analysis Web Forum. Jonathan Gerstenhaber Field Application Specialist

Bayesian Prediction Tree Models

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Section 6: Analysing Relationships Between Variables

Field wide development of analytic approaches for sequence data

Lecture 20. Disease Genetics

IN SILICO EVALUATION OF DNA-POOLED ALLELOTYPING VERSUS INDIVIDUAL GENOTYPING FOR GENOME-WIDE ASSOCIATION STUDIES OF COMPLEX DISEASE.

Package HAP.ROR. R topics documented: February 19, 2015

Asingle inherited mutant gene may be enough to

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

ARTICLE RESEARCH. Macmillan Publishers Limited. All rights reserved

In-house* validation of Qualitative Methods

(ii) The effective population size may be lower than expected due to variability between individuals in infectiousness.

Bayes Factors for t tests and one way Analysis of Variance; in R

What can genetic studies tell us about ADHD? Dr Joanna Martin, Cardiff University

Lab 5: Testing Hypotheses about Patterns of Inheritance

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

CS2220 Introduction to Computational Biology

Reducing INDEL calling errors in whole genome and exome sequencing data.

Introduction to Computational Neuroscience

New Enhancements: GWAS Workflows with SVS

Transcription:

TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data Each person inherits mutations from parents, some of which may predispose the person to certain diseases. Meanwhile, new mutations may occur spontaneously during the reproductive process, and if disrupting key genes, such de novo mutations may increase risks of disease. TADA (Transmission And De novo Association test) is a Bayesian model that effectively combines data from de novo mutations, inherited variants in families, and standing variants in the population (identified with case-control studies). This approach significantly increases the power of gene discovery, as we demonstrated through the studies of exome sequencing data of Autism Spectrum Disorder (ASD). Website: http://wpicr.wpic.pitt.edu/wpiccompgen/ Author: Xin He <xinhe2@gmail.com> Lane Center of Computational Biology, Carnegie Mellon University Reference: Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes, Xin He, et al., PLoS Genetics, 2013 TADA-Denovo: It is possible to use TADA to analyze only the de novo mutations from exome sequencing data. This would make it considerably easier to run the analysis: easier to parameterize the program and much faster. We create a specialized version of TADA for this purpose, and call it TADA-Denovo. Below we describe the use of TADA and TADA-Denovo in two separate sections, and you can decide which program best suits your need. The files in the package includes: TADA.R: R functions of TADA. TADA_demo.R: R code demonstrating the use of TADA, using the data of Autism Spectrum Disorder (ASD). TADA_denovo.pdf: explains the advantages of using TADA-Denovo for analyzing de novo mutations. TADA_denovo _demo.r: R code demonstrating the use of TADA-Denovo. ASC_2231trios_1333trans_1601cases_5397controls.csv: the ASD data used for the demonstration code. known_asd_genes.csv: a short list of 20 published ASD genes. TADA_results.csv: the results of running TADA on the ASD data. TADA_denovo_results.csv: the results of running TADA-Denovo on the ASD data. Background In this section, we explain some background you need to understand to use the software. Note that if you plan to use TADA.Denovo only, you can skip the explanations in this section about variant counts in the transmission and case/control data. Variant collapsing and categories:

In TADA, all mutations/variants of a given type (e.g. loss-of-function or LoF) of a gene are collapsed, and are effectively treated as a single variant. So we can talk about the relative risk (called gamma in the model) and allele frequency (called q) of this variant. TADA generally considers two types of variants: LoF and missense. In our experiments, we further restrict to those missense variants that are predicted to be "probably-damaging" to the protein function by PolyPhen 2 (denoted as mis3 variants). Variant counts: The main input of TADA function (see below) is the variant counts of a gene to be tested. For LoF variants, the counts of any gene should have three numbers: the number of de novo LoF mutations in trios, the number of LoF variants in cases and the number of LoF variants in controls. The counts of transmission data are readily added in TADA. Basically, the number of transmitted variants is treated the same as that of cases (add to the case count), and similarly, the number of nontransmitted variants is treated as controls (add to the control count). If you do not have transmission data, simply ignore them. In the sample file, ASC_2231trios_1333trans_1601cases_5397controls.csv, each row provides the counts of one gene. The columns are named: dn.lof, case.lof, ctrl.lof. If you have transmission data, before calling TADA function, the number of transmitted alleles and case count should be combined, and similarly, the non-transmitted count and the control count should be combined. The sample size needs to be modified accordingly. TADA-Denovo When one only has de novo mutations from family data, TADA-Denovo is the program to use. The simple approach of analyzing de novo data is the Poison test on the number of de novo mutations in a gene (comparing with the expected number based on the estimated mutation rate). The main benefit of TADA-Denovo is that it can take advantage of the functional annotations of the mutations, for example, a de novo nonsense mutation will be weighted more than a de novo missense mutation. We explain the rationale and the model details of TADA-Denovo in the file, TADA_denovo.pdf. We include in this package some code that illustrates the use of TADA-Denovo. Please see the file TADA_denovo_demo.R. Running TADA-Denovo In the section Application of TADA-Denovo of the demo file, we compute Bayes Factors (BFs) and p- values of a set of genes. This code can be slightly modified for your analysis. The main function is: TADA.denovo(counts, N, mu, mu.frac, gamma.mean, beta) counts: the count data, an m x K matrix, where m is the number of genes, and K is the number of mutational categories. counts[i,j] is the number of de novo mutation in the j-th category of the i-th gene. N: the sample size, i.e. the number of families. mu: the mutation rates of genes (m-dimensional vector). mu.frac: a K-dimensional vector, an element of this vector is multiplied to the gene-level mutation rate to obtain the mutation rate of a specific mutational category. gamma.mean: the mean relative risks (RR), one value per mutational category. beta: the other parameter of the RR distribution. The RR of a gene follows the Gamma distribution: Gamma(gamma.mean*beta, beta).

The results of running this function are the BFs of all input genes, in exactly the same order. It is possible to obtain the p-values, though we recommend the Bayesian FDR control procedure described below. The function TADAp.denovo(counts, N, mu, mu.frac, gamma.mean, beta, l=100) computes the p-values by generating random mutational data. In other words, for each gene, we use its mutation rate to sample the number of de novo mutations in this gene, assuming it is not a susceptibility gene. This sampling procedure is repeated l times, and we apply TADA-Denovo to the sampled data to obtain the null distribution of BFs. Typically l = 100 should be sufficient for whole exome sequencing data. The minimum p-value that can be obtained is approximately 1/(20000 100) = 5 10-7 (assuming a total of 20,000 human genes). To control for FDR, we use a Bayesian approach, called Direct Posterior Approach [1], which determines the threshold of BFs at a given FDR. We provide code in the software for the convenience of users: Bayesian.FDR(BF, pi0) BF: BFs sorted in the decreasing order. pi0: the prior probability that the null hypothesis is true. The results (in the field FDR ) are the q-values of the input BFs, in the same order. Model parameterization The section Estimation of de novo parameters using Method of Moment approach of the demo file explains how a user could set the parameters of TADA-Denovo. First, the mutation rate of a gene is defined as the total single nucleotide substitution rate. The mutation rates of the input genes should be provided in the input file. In our analysis of ASD data, the mutation rates of all human genes were based on [2]. Of course the users could obtain the rates from some other resources. In addition, since TADA works on each type of mutation (LoF or missense) separately, we need to specify the rate of each type of mutation, as a fraction of the total gene-level mutation rate. In our analysis of ASD data, we use the number of de novo mutations in a control dataset (unaffected siblings) to obtain these relative fractions (see the Methods section of our paper). For LoF mutations, this is 0.074 of the total gene mutation rate, and for mis3 (probably damaging mutations predicted by PolyPhen), this is 0.32 of the gene mutation rate. Next, we estimate the two parameters related to the RR, gamma.mean and beta, for each variant category. This is explained in the demo code, and we encourage the users to read TADA_denovo.pdf for the details of how they should be estimated. The key function is: denovo.mom(n, mu, C, beta, k) N: the number of families. C: the observed number of de novo mutations (for a given category). beta: the beta parameter of the RR distribution. k: the number of susceptibility genes.

The results of this function are: the expected number of genes with more than one de novo function in the given category, or simply multiple-hit genes (the field M ), and the mean relative risk for the given parameters (the field gamma.mean ). The basic strategy of parameter estimation is to run this function at different values of k to choose a value that minimizes the difference between the expected number of multiple-hit genes and the observed number. Finally, we would also need the value of pi0, the prior probability that the null hypothesis is true. This simply follows from the previous step that estimates k, the number of susceptibility genes. The value of k divided by the total number of genes gives (1-pi0). Note that pi0 only needs to be estimated once, for LoF mutations. Simulation In the section Simulation to assess the power of TADA.denovo of the demo code, we illustrate how to use simulation to do power analysis. The main function is: eval.tada.denovo(n, mu, mu.frac, pi, gamma.mean, beta, gamma.mean.est, best.est, FDR=0.1) N: the number of families. mu.frac: the constants multiplied to the total mutation rates. pi: the fraction of susceptibility genes. gamma.mean, beta: the parameters of the RR distribution used in generating the simulation data. gamma.mean.est, beta.est: the parameters used by the TADA-Denovo function. FDR: the desired FDR level. The function returns the expected number of discoveries at the given FDR level. TADA When one has both de novo mutations and inherited data (either from transmitted variants called from sequencing data of families, or from case-control data, or both), TADA is able to take advantage of all the data. We encourage the users to read the section on TADA-Denovo first, as a number of points will be shared between the two, and we believe it s always good to run TADA-Denovo first even if one has the full data. Our experience is that the de novo data is generally more reliable and informative than the inherited data, probably because (1) the de novo mutations tend to have higher relative risks; (2) the case-control data is susceptible to population stratification. We include in this package some code that illustrates the use of TADA. Please see the file TADA_ demo.r. Running TADA The section Application of TADA in the demo file illustrates how to use TADA to obtain BFs of a given set of genes. The main function is: TADA(counts, N, mu, mu.frac, hyperpar)

counts: m x 3K matrix, where m is the number of gene, and K is the number of variant categories. Each category has three numbers: de novo, case and control. N: sample sizes, with three values for de novo, case and control, respectively. mu.frac: a K-dimensional vector, an element of this vector is multiplied to the gene-level mutation rate to obtain the mutation rate of a specific mutational category. hyperpar: 8 x K matrix, where each row is a vector of 8 parameters: (gamma.mean.dn, beta.dn, gamma.mean.cc, beta.cc, rho1, nu1, rho0, nu0), and each column corresponds to one variant category. The eight parameters are: gamma.mean.dn, beta.dn: the parameters of the RR distribution of de novo mutations. The RR of a de novo mutation in a given category follows the distribution: Gamma(gamma.mean.dn*beta.dn, beta.dn). gamma.mean.cc, beta.cc: the parameters of the RR distribution of inherited variants, similar to the de novo parameters defined above. rho1, nu1: the parameters of the q (the frequency of a certain type of variants) distribution under the alternative model (the gene is a risk gene). The prior distribution Gamma(rho1, nu1). rho0, nu0: the parameters of the q distribution under the null model. The results of running this function are the BFs of all input genes, in the same order. The FDR control can be implemented using a Bayesian procedure as explained before. To obtain p-values, we could use a function TADAp(counts, N, mu, mu.frac, hyperpar, l=100). This is similar to the TADAp.denovo() function described in the previous section, except that we also generate randomized inherited data (equivalent to permutation of case-control labels) in addition to randomized de novo mutations. See the relevant part in the previous section about TADA-Denovo. Model parameterization The section Estimation of parameters of the prior distributions of the demo file explains how a user could set the parameters of TADA. Also please read the section of Transmission And De novo Association test (TADA) in the Supplement of our paper (to be added). For the parameter related to de novo mutations, we refer the users to the relevant part of the TADA-Denovo section above. For the RR parameters of the inherited variants, we assume a set of genes known to be involved in the disease of interest is available. Then we simply use the fold-enrichment of the variants in cases vs. controls as the approximate mean RR (gamma.mean.cc). The method is generally not very sensitive to the parameter beta.cc, so we suggest to choose a value so that the prior RR distribution falls in a reasonable range (e.g. most probability mass would be greater than 1, but less than 5). However, if there is no evidence that the inherited variants of a certain category are enriched in cases over controls for the known risk genes (or evidence of transmission disequilibrium), we suggest to simply ignore this type of variants, by setting gamma.mean.cc=1, and beta.cc=1000 (some arbitrarily large number). For the prior parameters of q, we suggest to estimate the mean frequency of a variant category, and this would be equal to the value of rho1/nu1 and rho0/nu0 (we assume they are equal). Then we choose nu1 or nu0 to be some numbers small relative to the sample size, e.g. 100 or 200.

Simulation In the section Simulation to assess the power of TADA. of the demo code, we illustrate how to use simulation to do power analysis. The main function is: eval.tada(n, mu, mu.frac, pi, gamma.mean.dn, beta.dn, gamma.mean.cc, beta.cc, rho1, nu1, rho0, nu0, hyperpar.est, FDR=0.1, tradeoff=true) N: the number of families. mu.frac: the constants multiplied to the total mutation rates. pi: the fraction of susceptibility genes. gamma.mean.dn, beta.dn: the parameters of the RR distribution of de novo mutations used in generating the simulation data. gamma.mean.cc, beta.cc: the parameters of the RR distribution of inherited variants used in generating the simulation data. rho1, nu1, rho0, nu0: the parameters of the q (the frequency of a certain type of variants) distribution. hyperpar.est: the parameters used by the TADA function on the simulated data. FDR: the desired FDR level. tradeoff: whether implements the relationship between q and RR during simulation (i.e. if variants have higher RR, their frequency is likely low). Recommended to be TRUE. See the section of Transmission And De novo Association test (TADA) in the Supplement of our paper. Reference 1. Newton, M.A., et al., Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, 2004. 5(2): p. 155-76. 2. Sanders, S.J., et al., De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature, 2012. 485(7397): p. 237-41.