TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data

Size: px

Start display at page:

Download "TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data"

Augustus Cunningham
5 years ago
Views:

1 TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data Each person inherits mutations from parents, some of which may predispose the person to certain diseases. Meanwhile, new mutations may occur spontaneously during the reproductive process, and if disrupting key genes, such de novo mutations may increase risks of disease. TADA (Transmission And De novo Association test) is a Bayesian model that effectively combines data from de novo mutations, inherited variants in families, and standing variants in the population (identified with case-control studies). This approach significantly increases the power of gene discovery, as we demonstrated through the studies of exome sequencing data of Autism Spectrum Disorder (ASD). Website: Author: Xin He <xinhe2@gmail.com> Lane Center of Computational Biology, Carnegie Mellon University Reference: Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes, Xin He, et al., PLoS Genetics, 2013 TADA-Denovo: It is possible to use TADA to analyze only the de novo mutations from exome sequencing data. This would make it considerably easier to run the analysis: easier to parameterize the program and much faster. We create a specialized version of TADA for this purpose, and call it TADA-Denovo. Below we describe the use of TADA and TADA-Denovo in two separate sections, and you can decide which program best suits your need. The files in the package includes: TADA.R: R functions of TADA. TADA_demo.R: R code demonstrating the use of TADA, using the data of Autism Spectrum Disorder (ASD). TADA_denovo.pdf: explains the advantages of using TADA-Denovo for analyzing de novo mutations. TADA_denovo _demo.r: R code demonstrating the use of TADA-Denovo. ASC_2231trios_1333trans_1601cases_5397controls.csv: the ASD data used for the demonstration code. known_asd_genes.csv: a short list of 20 published ASD genes. TADA_results.csv: the results of running TADA on the ASD data. TADA_denovo_results.csv: the results of running TADA-Denovo on the ASD data. Background In this section, we explain some background you need to understand to use the software. Note that if you plan to use TADA.Denovo only, you can skip the explanations in this section about variant counts in the transmission and case/control data. Variant collapsing and categories:

2 In TADA, all mutations/variants of a given type (e.g. loss-of-function or LoF) of a gene are collapsed, and are effectively treated as a single variant. So we can talk about the relative risk (called gamma in the model) and allele frequency (called q) of this variant. TADA generally considers two types of variants: LoF and missense. In our experiments, we further restrict to those missense variants that are predicted to be "probably-damaging" to the protein function by PolyPhen 2 (denoted as mis3 variants). Variant counts: The main input of TADA function (see below) is the variant counts of a gene to be tested. For LoF variants, the counts of any gene should have three numbers: the number of de novo LoF mutations in trios, the number of LoF variants in cases and the number of LoF variants in controls. The counts of transmission data are readily added in TADA. Basically, the number of transmitted variants is treated the same as that of cases (add to the case count), and similarly, the number of nontransmitted variants is treated as controls (add to the control count). If you do not have transmission data, simply ignore them. In the sample file, ASC_2231trios_1333trans_1601cases_5397controls.csv, each row provides the counts of one gene. The columns are named: dn.lof, case.lof, ctrl.lof. If you have transmission data, before calling TADA function, the number of transmitted alleles and case count should be combined, and similarly, the non-transmitted count and the control count should be combined. The sample size needs to be modified accordingly. TADA-Denovo When one only has de novo mutations from family data, TADA-Denovo is the program to use. The simple approach of analyzing de novo data is the Poison test on the number of de novo mutations in a gene (comparing with the expected number based on the estimated mutation rate). The main benefit of TADA-Denovo is that it can take advantage of the functional annotations of the mutations, for example, a de novo nonsense mutation will be weighted more than a de novo missense mutation. We explain the rationale and the model details of TADA-Denovo in the file, TADA_denovo.pdf. We include in this package some code that illustrates the use of TADA-Denovo. Please see the file TADA_denovo_demo.R. Running TADA-Denovo In the section Application of TADA-Denovo of the demo file, we compute Bayes Factors (BFs) and p- values of a set of genes. This code can be slightly modified for your analysis. The main function is: TADA.denovo(counts, N, mu, mu.frac, gamma.mean, beta) counts: the count data, an m x K matrix, where m is the number of genes, and K is the number of mutational categories. counts[i,j] is the number of de novo mutation in the j-th category of the i-th gene. N: the sample size, i.e. the number of families. mu: the mutation rates of genes (m-dimensional vector). mu.frac: a K-dimensional vector, an element of this vector is multiplied to the gene-level mutation rate to obtain the mutation rate of a specific mutational category. gamma.mean: the mean relative risks (RR), one value per mutational category. beta: the other parameter of the RR distribution. The RR of a gene follows the Gamma distribution: Gamma(gamma.mean*beta, beta).

3 The results of running this function are the BFs of all input genes, in exactly the same order. It is possible to obtain the p-values, though we recommend the Bayesian FDR control procedure described below. The function TADAp.denovo(counts, N, mu, mu.frac, gamma.mean, beta, l=100) computes the p-values by generating random mutational data. In other words, for each gene, we use its mutation rate to sample the number of de novo mutations in this gene, assuming it is not a susceptibility gene. This sampling procedure is repeated l times, and we apply TADA-Denovo to the sampled data to obtain the null distribution of BFs. Typically l = 100 should be sufficient for whole exome sequencing data. The minimum p-value that can be obtained is approximately 1/( ) = (assuming a total of 20,000 human genes). To control for FDR, we use a Bayesian approach, called Direct Posterior Approach [1], which determines the threshold of BFs at a given FDR. We provide code in the software for the convenience of users: Bayesian.FDR(BF, pi0) BF: BFs sorted in the decreasing order. pi0: the prior probability that the null hypothesis is true. The results (in the field FDR ) are the q-values of the input BFs, in the same order. Model parameterization The section Estimation of de novo parameters using Method of Moment approach of the demo file explains how a user could set the parameters of TADA-Denovo. First, the mutation rate of a gene is defined as the total single nucleotide substitution rate. The mutation rates of the input genes should be provided in the input file. In our analysis of ASD data, the mutation rates of all human genes were based on [2]. Of course the users could obtain the rates from some other resources. In addition, since TADA works on each type of mutation (LoF or missense) separately, we need to specify the rate of each type of mutation, as a fraction of the total gene-level mutation rate. In our analysis of ASD data, we use the number of de novo mutations in a control dataset (unaffected siblings) to obtain these relative fractions (see the Methods section of our paper). For LoF mutations, this is of the total gene mutation rate, and for mis3 (probably damaging mutations predicted by PolyPhen), this is 0.32 of the gene mutation rate. Next, we estimate the two parameters related to the RR, gamma.mean and beta, for each variant category. This is explained in the demo code, and we encourage the users to read TADA_denovo.pdf for the details of how they should be estimated. The key function is: denovo.mom(n, mu, C, beta, k) N: the number of families. C: the observed number of de novo mutations (for a given category). beta: the beta parameter of the RR distribution. k: the number of susceptibility genes.

4 The results of this function are: the expected number of genes with more than one de novo function in the given category, or simply multiple-hit genes (the field M ), and the mean relative risk for the given parameters (the field gamma.mean ). The basic strategy of parameter estimation is to run this function at different values of k to choose a value that minimizes the difference between the expected number of multiple-hit genes and the observed number. Finally, we would also need the value of pi0, the prior probability that the null hypothesis is true. This simply follows from the previous step that estimates k, the number of susceptibility genes. The value of k divided by the total number of genes gives (1-pi0). Note that pi0 only needs to be estimated once, for LoF mutations. Simulation In the section Simulation to assess the power of TADA.denovo of the demo code, we illustrate how to use simulation to do power analysis. The main function is: eval.tada.denovo(n, mu, mu.frac, pi, gamma.mean, beta, gamma.mean.est, best.est, FDR=0.1) N: the number of families. mu.frac: the constants multiplied to the total mutation rates. pi: the fraction of susceptibility genes. gamma.mean, beta: the parameters of the RR distribution used in generating the simulation data. gamma.mean.est, beta.est: the parameters used by the TADA-Denovo function. FDR: the desired FDR level. The function returns the expected number of discoveries at the given FDR level. TADA When one has both de novo mutations and inherited data (either from transmitted variants called from sequencing data of families, or from case-control data, or both), TADA is able to take advantage of all the data. We encourage the users to read the section on TADA-Denovo first, as a number of points will be shared between the two, and we believe it s always good to run TADA-Denovo first even if one has the full data. Our experience is that the de novo data is generally more reliable and informative than the inherited data, probably because (1) the de novo mutations tend to have higher relative risks; (2) the case-control data is susceptible to population stratification. We include in this package some code that illustrates the use of TADA. Please see the file TADA_ demo.r. Running TADA The section Application of TADA in the demo file illustrates how to use TADA to obtain BFs of a given set of genes. The main function is: TADA(counts, N, mu, mu.frac, hyperpar)

5 counts: m x 3K matrix, where m is the number of gene, and K is the number of variant categories. Each category has three numbers: de novo, case and control. N: sample sizes, with three values for de novo, case and control, respectively. mu.frac: a K-dimensional vector, an element of this vector is multiplied to the gene-level mutation rate to obtain the mutation rate of a specific mutational category. hyperpar: 8 x K matrix, where each row is a vector of 8 parameters: (gamma.mean.dn, beta.dn, gamma.mean.cc, beta.cc, rho1, nu1, rho0, nu0), and each column corresponds to one variant category. The eight parameters are: gamma.mean.dn, beta.dn: the parameters of the RR distribution of de novo mutations. The RR of a de novo mutation in a given category follows the distribution: Gamma(gamma.mean.dn*beta.dn, beta.dn). gamma.mean.cc, beta.cc: the parameters of the RR distribution of inherited variants, similar to the de novo parameters defined above. rho1, nu1: the parameters of the q (the frequency of a certain type of variants) distribution under the alternative model (the gene is a risk gene). The prior distribution Gamma(rho1, nu1). rho0, nu0: the parameters of the q distribution under the null model. The results of running this function are the BFs of all input genes, in the same order. The FDR control can be implemented using a Bayesian procedure as explained before. To obtain p-values, we could use a function TADAp(counts, N, mu, mu.frac, hyperpar, l=100). This is similar to the TADAp.denovo() function described in the previous section, except that we also generate randomized inherited data (equivalent to permutation of case-control labels) in addition to randomized de novo mutations. See the relevant part in the previous section about TADA-Denovo. Model parameterization The section Estimation of parameters of the prior distributions of the demo file explains how a user could set the parameters of TADA. Also please read the section of Transmission And De novo Association test (TADA) in the Supplement of our paper (to be added). For the parameter related to de novo mutations, we refer the users to the relevant part of the TADA-Denovo section above. For the RR parameters of the inherited variants, we assume a set of genes known to be involved in the disease of interest is available. Then we simply use the fold-enrichment of the variants in cases vs. controls as the approximate mean RR (gamma.mean.cc). The method is generally not very sensitive to the parameter beta.cc, so we suggest to choose a value so that the prior RR distribution falls in a reasonable range (e.g. most probability mass would be greater than 1, but less than 5). However, if there is no evidence that the inherited variants of a certain category are enriched in cases over controls for the known risk genes (or evidence of transmission disequilibrium), we suggest to simply ignore this type of variants, by setting gamma.mean.cc=1, and beta.cc=1000 (some arbitrarily large number). For the prior parameters of q, we suggest to estimate the mean frequency of a variant category, and this would be equal to the value of rho1/nu1 and rho0/nu0 (we assume they are equal). Then we choose nu1 or nu0 to be some numbers small relative to the sample size, e.g. 100 or 200.

6 Simulation In the section Simulation to assess the power of TADA. of the demo code, we illustrate how to use simulation to do power analysis. The main function is: eval.tada(n, mu, mu.frac, pi, gamma.mean.dn, beta.dn, gamma.mean.cc, beta.cc, rho1, nu1, rho0, nu0, hyperpar.est, FDR=0.1, tradeoff=true) N: the number of families. mu.frac: the constants multiplied to the total mutation rates. pi: the fraction of susceptibility genes. gamma.mean.dn, beta.dn: the parameters of the RR distribution of de novo mutations used in generating the simulation data. gamma.mean.cc, beta.cc: the parameters of the RR distribution of inherited variants used in generating the simulation data. rho1, nu1, rho0, nu0: the parameters of the q (the frequency of a certain type of variants) distribution. hyperpar.est: the parameters used by the TADA function on the simulated data. FDR: the desired FDR level. tradeoff: whether implements the relationship between q and RR during simulation (i.e. if variants have higher RR, their frequency is likely low). Recommended to be TRUE. See the section of Transmission And De novo Association test (TADA) in the Supplement of our paper. Reference 1. Newton, M.A., et al., Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, (2): p Sanders, S.J., et al., De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature, (7397): p

Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes

Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes Xin He 1, Stephan J. Sanders 2, Li Liu 3, Silvia De Rubeis 4,5, Elaine T. Lim 6,7, James S. Sutcliffe