TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data

Size: px
Start display at page:

Download "TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data"

Transcription

1 TADA: Analyzing De Novo, Transmission and Case-Control Sequencing Data Each person inherits mutations from parents, some of which may predispose the person to certain diseases. Meanwhile, new mutations may occur spontaneously during the reproductive process, and if disrupting key genes, such de novo mutations may increase risks of disease. TADA (Transmission And De novo Association test) is a Bayesian model that effectively combines data from de novo mutations, inherited variants in families, and standing variants in the population (identified with case-control studies). This approach significantly increases the power of gene discovery, as we demonstrated through the studies of exome sequencing data of Autism Spectrum Disorder (ASD). Website: Author: Xin He <xinhe2@gmail.com> Lane Center of Computational Biology, Carnegie Mellon University Reference: Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes, Xin He, et al., PLoS Genetics, 2013 TADA-Denovo: It is possible to use TADA to analyze only the de novo mutations from exome sequencing data. This would make it considerably easier to run the analysis: easier to parameterize the program and much faster. We create a specialized version of TADA for this purpose, and call it TADA-Denovo. Below we describe the use of TADA and TADA-Denovo in two separate sections, and you can decide which program best suits your need. The files in the package includes: TADA.R: R functions of TADA. TADA_demo.R: R code demonstrating the use of TADA, using the data of Autism Spectrum Disorder (ASD). TADA_denovo.pdf: explains the advantages of using TADA-Denovo for analyzing de novo mutations. TADA_denovo _demo.r: R code demonstrating the use of TADA-Denovo. ASC_2231trios_1333trans_1601cases_5397controls.csv: the ASD data used for the demonstration code. known_asd_genes.csv: a short list of 20 published ASD genes. TADA_results.csv: the results of running TADA on the ASD data. TADA_denovo_results.csv: the results of running TADA-Denovo on the ASD data. Background In this section, we explain some background you need to understand to use the software. Note that if you plan to use TADA.Denovo only, you can skip the explanations in this section about variant counts in the transmission and case/control data. Variant collapsing and categories:

2 In TADA, all mutations/variants of a given type (e.g. loss-of-function or LoF) of a gene are collapsed, and are effectively treated as a single variant. So we can talk about the relative risk (called gamma in the model) and allele frequency (called q) of this variant. TADA generally considers two types of variants: LoF and missense. In our experiments, we further restrict to those missense variants that are predicted to be "probably-damaging" to the protein function by PolyPhen 2 (denoted as mis3 variants). Variant counts: The main input of TADA function (see below) is the variant counts of a gene to be tested. For LoF variants, the counts of any gene should have three numbers: the number of de novo LoF mutations in trios, the number of LoF variants in cases and the number of LoF variants in controls. The counts of transmission data are readily added in TADA. Basically, the number of transmitted variants is treated the same as that of cases (add to the case count), and similarly, the number of nontransmitted variants is treated as controls (add to the control count). If you do not have transmission data, simply ignore them. In the sample file, ASC_2231trios_1333trans_1601cases_5397controls.csv, each row provides the counts of one gene. The columns are named: dn.lof, case.lof, ctrl.lof. If you have transmission data, before calling TADA function, the number of transmitted alleles and case count should be combined, and similarly, the non-transmitted count and the control count should be combined. The sample size needs to be modified accordingly. TADA-Denovo When one only has de novo mutations from family data, TADA-Denovo is the program to use. The simple approach of analyzing de novo data is the Poison test on the number of de novo mutations in a gene (comparing with the expected number based on the estimated mutation rate). The main benefit of TADA-Denovo is that it can take advantage of the functional annotations of the mutations, for example, a de novo nonsense mutation will be weighted more than a de novo missense mutation. We explain the rationale and the model details of TADA-Denovo in the file, TADA_denovo.pdf. We include in this package some code that illustrates the use of TADA-Denovo. Please see the file TADA_denovo_demo.R. Running TADA-Denovo In the section Application of TADA-Denovo of the demo file, we compute Bayes Factors (BFs) and p- values of a set of genes. This code can be slightly modified for your analysis. The main function is: TADA.denovo(counts, N, mu, mu.frac, gamma.mean, beta) counts: the count data, an m x K matrix, where m is the number of genes, and K is the number of mutational categories. counts[i,j] is the number of de novo mutation in the j-th category of the i-th gene. N: the sample size, i.e. the number of families. mu: the mutation rates of genes (m-dimensional vector). mu.frac: a K-dimensional vector, an element of this vector is multiplied to the gene-level mutation rate to obtain the mutation rate of a specific mutational category. gamma.mean: the mean relative risks (RR), one value per mutational category. beta: the other parameter of the RR distribution. The RR of a gene follows the Gamma distribution: Gamma(gamma.mean*beta, beta).

3 The results of running this function are the BFs of all input genes, in exactly the same order. It is possible to obtain the p-values, though we recommend the Bayesian FDR control procedure described below. The function TADAp.denovo(counts, N, mu, mu.frac, gamma.mean, beta, l=100) computes the p-values by generating random mutational data. In other words, for each gene, we use its mutation rate to sample the number of de novo mutations in this gene, assuming it is not a susceptibility gene. This sampling procedure is repeated l times, and we apply TADA-Denovo to the sampled data to obtain the null distribution of BFs. Typically l = 100 should be sufficient for whole exome sequencing data. The minimum p-value that can be obtained is approximately 1/( ) = (assuming a total of 20,000 human genes). To control for FDR, we use a Bayesian approach, called Direct Posterior Approach [1], which determines the threshold of BFs at a given FDR. We provide code in the software for the convenience of users: Bayesian.FDR(BF, pi0) BF: BFs sorted in the decreasing order. pi0: the prior probability that the null hypothesis is true. The results (in the field FDR ) are the q-values of the input BFs, in the same order. Model parameterization The section Estimation of de novo parameters using Method of Moment approach of the demo file explains how a user could set the parameters of TADA-Denovo. First, the mutation rate of a gene is defined as the total single nucleotide substitution rate. The mutation rates of the input genes should be provided in the input file. In our analysis of ASD data, the mutation rates of all human genes were based on [2]. Of course the users could obtain the rates from some other resources. In addition, since TADA works on each type of mutation (LoF or missense) separately, we need to specify the rate of each type of mutation, as a fraction of the total gene-level mutation rate. In our analysis of ASD data, we use the number of de novo mutations in a control dataset (unaffected siblings) to obtain these relative fractions (see the Methods section of our paper). For LoF mutations, this is of the total gene mutation rate, and for mis3 (probably damaging mutations predicted by PolyPhen), this is 0.32 of the gene mutation rate. Next, we estimate the two parameters related to the RR, gamma.mean and beta, for each variant category. This is explained in the demo code, and we encourage the users to read TADA_denovo.pdf for the details of how they should be estimated. The key function is: denovo.mom(n, mu, C, beta, k) N: the number of families. C: the observed number of de novo mutations (for a given category). beta: the beta parameter of the RR distribution. k: the number of susceptibility genes.

4 The results of this function are: the expected number of genes with more than one de novo function in the given category, or simply multiple-hit genes (the field M ), and the mean relative risk for the given parameters (the field gamma.mean ). The basic strategy of parameter estimation is to run this function at different values of k to choose a value that minimizes the difference between the expected number of multiple-hit genes and the observed number. Finally, we would also need the value of pi0, the prior probability that the null hypothesis is true. This simply follows from the previous step that estimates k, the number of susceptibility genes. The value of k divided by the total number of genes gives (1-pi0). Note that pi0 only needs to be estimated once, for LoF mutations. Simulation In the section Simulation to assess the power of TADA.denovo of the demo code, we illustrate how to use simulation to do power analysis. The main function is: eval.tada.denovo(n, mu, mu.frac, pi, gamma.mean, beta, gamma.mean.est, best.est, FDR=0.1) N: the number of families. mu.frac: the constants multiplied to the total mutation rates. pi: the fraction of susceptibility genes. gamma.mean, beta: the parameters of the RR distribution used in generating the simulation data. gamma.mean.est, beta.est: the parameters used by the TADA-Denovo function. FDR: the desired FDR level. The function returns the expected number of discoveries at the given FDR level. TADA When one has both de novo mutations and inherited data (either from transmitted variants called from sequencing data of families, or from case-control data, or both), TADA is able to take advantage of all the data. We encourage the users to read the section on TADA-Denovo first, as a number of points will be shared between the two, and we believe it s always good to run TADA-Denovo first even if one has the full data. Our experience is that the de novo data is generally more reliable and informative than the inherited data, probably because (1) the de novo mutations tend to have higher relative risks; (2) the case-control data is susceptible to population stratification. We include in this package some code that illustrates the use of TADA. Please see the file TADA_ demo.r. Running TADA The section Application of TADA in the demo file illustrates how to use TADA to obtain BFs of a given set of genes. The main function is: TADA(counts, N, mu, mu.frac, hyperpar)

5 counts: m x 3K matrix, where m is the number of gene, and K is the number of variant categories. Each category has three numbers: de novo, case and control. N: sample sizes, with three values for de novo, case and control, respectively. mu.frac: a K-dimensional vector, an element of this vector is multiplied to the gene-level mutation rate to obtain the mutation rate of a specific mutational category. hyperpar: 8 x K matrix, where each row is a vector of 8 parameters: (gamma.mean.dn, beta.dn, gamma.mean.cc, beta.cc, rho1, nu1, rho0, nu0), and each column corresponds to one variant category. The eight parameters are: gamma.mean.dn, beta.dn: the parameters of the RR distribution of de novo mutations. The RR of a de novo mutation in a given category follows the distribution: Gamma(gamma.mean.dn*beta.dn, beta.dn). gamma.mean.cc, beta.cc: the parameters of the RR distribution of inherited variants, similar to the de novo parameters defined above. rho1, nu1: the parameters of the q (the frequency of a certain type of variants) distribution under the alternative model (the gene is a risk gene). The prior distribution Gamma(rho1, nu1). rho0, nu0: the parameters of the q distribution under the null model. The results of running this function are the BFs of all input genes, in the same order. The FDR control can be implemented using a Bayesian procedure as explained before. To obtain p-values, we could use a function TADAp(counts, N, mu, mu.frac, hyperpar, l=100). This is similar to the TADAp.denovo() function described in the previous section, except that we also generate randomized inherited data (equivalent to permutation of case-control labels) in addition to randomized de novo mutations. See the relevant part in the previous section about TADA-Denovo. Model parameterization The section Estimation of parameters of the prior distributions of the demo file explains how a user could set the parameters of TADA. Also please read the section of Transmission And De novo Association test (TADA) in the Supplement of our paper (to be added). For the parameter related to de novo mutations, we refer the users to the relevant part of the TADA-Denovo section above. For the RR parameters of the inherited variants, we assume a set of genes known to be involved in the disease of interest is available. Then we simply use the fold-enrichment of the variants in cases vs. controls as the approximate mean RR (gamma.mean.cc). The method is generally not very sensitive to the parameter beta.cc, so we suggest to choose a value so that the prior RR distribution falls in a reasonable range (e.g. most probability mass would be greater than 1, but less than 5). However, if there is no evidence that the inherited variants of a certain category are enriched in cases over controls for the known risk genes (or evidence of transmission disequilibrium), we suggest to simply ignore this type of variants, by setting gamma.mean.cc=1, and beta.cc=1000 (some arbitrarily large number). For the prior parameters of q, we suggest to estimate the mean frequency of a variant category, and this would be equal to the value of rho1/nu1 and rho0/nu0 (we assume they are equal). Then we choose nu1 or nu0 to be some numbers small relative to the sample size, e.g. 100 or 200.

6 Simulation In the section Simulation to assess the power of TADA. of the demo code, we illustrate how to use simulation to do power analysis. The main function is: eval.tada(n, mu, mu.frac, pi, gamma.mean.dn, beta.dn, gamma.mean.cc, beta.cc, rho1, nu1, rho0, nu0, hyperpar.est, FDR=0.1, tradeoff=true) N: the number of families. mu.frac: the constants multiplied to the total mutation rates. pi: the fraction of susceptibility genes. gamma.mean.dn, beta.dn: the parameters of the RR distribution of de novo mutations used in generating the simulation data. gamma.mean.cc, beta.cc: the parameters of the RR distribution of inherited variants used in generating the simulation data. rho1, nu1, rho0, nu0: the parameters of the q (the frequency of a certain type of variants) distribution. hyperpar.est: the parameters used by the TADA function on the simulated data. FDR: the desired FDR level. tradeoff: whether implements the relationship between q and RR during simulation (i.e. if variants have higher RR, their frequency is likely low). Recommended to be TRUE. See the section of Transmission And De novo Association test (TADA) in the Supplement of our paper. Reference 1. Newton, M.A., et al., Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics, (2): p Sanders, S.J., et al., De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature, (7397): p

Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes

Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes Integrated Model of De Novo and Inherited Genetic Variants Yields Greater Power to Identify Risk Genes Xin He 1, Stephan J. Sanders 2, Li Liu 3, Silvia De Rubeis 4,5, Elaine T. Lim 6,7, James S. Sutcliffe

More information

Rare Variant Burden Tests. Biostatistics 666

Rare Variant Burden Tests. Biostatistics 666 Rare Variant Burden Tests Biostatistics 666 Last Lecture Analysis of Short Read Sequence Data Low pass sequencing approaches Modeling haplotype sharing between individuals allows accurate variant calls

More information

Nature Genetics: doi: /ng Supplementary Figure 1

Nature Genetics: doi: /ng Supplementary Figure 1 Supplementary Figure 1 Illustrative example of ptdt using height The expected value of a child s polygenic risk score (PRS) for a trait is the average of maternal and paternal PRS values. For example,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature13908 Supplementary Tables Supplementary Table 1: Families in this study (.xlsx) All families included in the study are listed. For each family, we show: the genders of the probands and

More information

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Holger Höfling Gad Getz Robert Tibshirani June 26, 2007 1 Introduction Identifying genes that are involved

More information

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project Introduction RNA splicing is a critical step in eukaryotic gene

More information

Variant Detection & Interpretation in a diagnostic context. Christian Gilissen

Variant Detection & Interpretation in a diagnostic context. Christian Gilissen Variant Detection & Interpretation in a diagnostic context Christian Gilissen c.gilissen@gen.umcn.nl 28-05-2013 So far Sequencing Johan den Dunnen Marja Jakobs Ewart de Bruijn Mapping Victor Guryev Variant

More information

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data. Supplementary Figure 1 PCA for ancestry in SNV data. (a) EIGENSTRAT principal-component analysis (PCA) of SNV genotype data on all samples. (b) PCA of only proband SNV genotype data. (c) PCA of SNV genotype

More information

Statistical power and significance testing in large-scale genetic studies

Statistical power and significance testing in large-scale genetic studies STUDY DESIGNS Statistical power and significance testing in large-scale genetic studies Pak C. Sham 1 and Shaun M. Purcell 2,3 Abstract Significance testing was developed as an objective method for summarizing

More information

Analysis with SureCall 2.1

Analysis with SureCall 2.1 Analysis with SureCall 2.1 Danielle Fletcher Field Application Scientist July 2014 1 Stages of NGS Analysis Primary analysis, base calling Control Software FASTQ file reads + quality 2 Stages of NGS Analysis

More information

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc. Variant Classification Author: Mike Thiesen, Golden Helix, Inc. Overview Sequencing pipelines are able to identify rare variants not found in catalogs such as dbsnp. As a result, variants in these datasets

More information

Nature Methods: doi: /nmeth.3115

Nature Methods: doi: /nmeth.3115 Supplementary Figure 1 Analysis of DNA methylation in a cancer cohort based on Infinium 450K data. RnBeads was used to rediscover a clinically distinct subgroup of glioblastoma patients characterized by

More information

Using large-scale human genetic variation to inform variant prioritization in neuropsychiatric disorders

Using large-scale human genetic variation to inform variant prioritization in neuropsychiatric disorders Using large-scale human genetic variation to inform variant prioritization in neuropsychiatric disorders Kaitlin E. Samocha Hurles lab, Wellcome Trust Sanger Institute ACGS Summer Scientific Meeting 27

More information

A Quick-Start Guide for rseqdiff

A Quick-Start Guide for rseqdiff A Quick-Start Guide for rseqdiff Yang Shi (email: shyboy@umich.edu) and Hui Jiang (email: jianghui@umich.edu) 09/05/2013 Introduction rseqdiff is an R package that can detect differential gene and isoform

More information

Strength of functional signature correlates with effect size in autism

Strength of functional signature correlates with effect size in autism Ballouz and Gillis Genome Medicine (217) 9:64 DOI 1.1186/s1373-17-455-8 RESEARCH Open Access Strength of functional signature correlates with effect size in autism Sara Ballouz and Jesse Gillis * Abstract

More information

Sequencing studies implicate inherited mutations in autism

Sequencing studies implicate inherited mutations in autism NEWS Sequencing studies implicate inherited mutations in autism BY EMILY SINGER 23 JANUARY 2013 1 / 5 Unusual inheritance: Researchers have found a relatively mild mutation in a gene linked to Cohen syndrome,

More information

Math Released Item Grade 3. Find the Area and Identify Equal Areas 1749-M23082

Math Released Item Grade 3. Find the Area and Identify Equal Areas 1749-M23082 Math Released Item 2018 Grade 3 Find the Area and Identify Equal Areas 1749-M23082 Anchor Set A1 A6 With Annotations Prompt 1749-M23082 Rubric Part A Score Description 1 This part of the item is machine

More information

caspa Comparison and Analysis of Special Pupil Attainment

caspa Comparison and Analysis of Special Pupil Attainment caspa Comparison and Analysis of Special Pupil Attainment Analysis and bench-marking in CASPA This document describes of the analysis and bench-marking features in CASPA and an explanation of the analysis

More information

Tutorial on Genome-Wide Association Studies

Tutorial on Genome-Wide Association Studies Tutorial on Genome-Wide Association Studies Assistant Professor Institute for Computational Biology Department of Epidemiology and Biostatistics Case Western Reserve University Acknowledgements Dana Crawford

More information

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington

More information

Package CancerMutationAnalysis

Package CancerMutationAnalysis Type Package Package CancerMutationAnalysis Title Cancer mutation analysis Version 1.2.1 Author Giovanni Parmigiani, Simina M. Boca March 25, 2013 Maintainer Simina M. Boca Imports

More information

Nature Neuroscience: doi: /nn Supplementary Figure 1

Nature Neuroscience: doi: /nn Supplementary Figure 1 Supplementary Figure 1 Illustration of the working of network-based SVM to confidently predict a new (and now confirmed) ASD gene. Gene CTNND2 s brain network neighborhood that enabled its prediction by

More information

Metabolomic Data Analysis with MetaboAnalyst

Metabolomic Data Analysis with MetaboAnalyst Metabolomic Data Analysis with MetaboAnalyst User ID: guest6501 April 16, 2009 1 Data Processing and Normalization 1.1 Reading and Processing the Raw Data MetaboAnalyst accepts a variety of data types

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature13772 Supplementary Methods Samples The goal of the ASC 1 is to leverage all existing and ongoing whole exome studies, as well as whole genome sequencing studies as they become available.

More information

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012 Statistical Tests for X Chromosome Association Study with Simulations Jian Wang July 10, 2012 Statistical Tests Zheng G, et al. 2007. Testing association for markers on the X chromosome. Genetic Epidemiology

More information

Design for Targeted Therapies: Statistical Considerations

Design for Targeted Therapies: Statistical Considerations Design for Targeted Therapies: Statistical Considerations J. Jack Lee, Ph.D. Department of Biostatistics University of Texas M. D. Anderson Cancer Center Outline Premise General Review of Statistical Designs

More information

Package BUScorrect. September 16, 2018

Package BUScorrect. September 16, 2018 Type Package Package BUScorrect September 16, 2018 Title Batch Effects Correction with Unknown Subtypes Version 0.99.12 Date 2018-06-07 Author , Yingying Wei Maintainer

More information

Title: Prediction of HIV-1 virus-host protein interactions using virus and host sequence motifs

Title: Prediction of HIV-1 virus-host protein interactions using virus and host sequence motifs Author's response to reviews Title: Prediction of HIV-1 virus-host protein interactions using virus and host sequence motifs Authors: Perry PE Evans (evansjp@mail.med.upenn.edu) Will WD Dampier (wnd22@drexel.edu)

More information

Dan Koller, Ph.D. Medical and Molecular Genetics

Dan Koller, Ph.D. Medical and Molecular Genetics Design of Genetic Studies Dan Koller, Ph.D. Research Assistant Professor Medical and Molecular Genetics Genetics and Medicine Over the past decade, advances from genetics have permeated medicine Identification

More information

Ascertainment Through Family History of Disease Often Decreases the Power of Family-based Association Studies

Ascertainment Through Family History of Disease Often Decreases the Power of Family-based Association Studies Behav Genet (2007) 37:631 636 DOI 17/s10519-007-9149-0 ORIGINAL PAPER Ascertainment Through Family History of Disease Often Decreases the Power of Family-based Association Studies Manuel A. R. Ferreira

More information

How many disease-causing variants in a normal person? Matthew Hurles

How many disease-causing variants in a normal person? Matthew Hurles How many disease-causing variants in a normal person? Matthew Hurles Summary What is in a genome? What is normal? Depends on age What is a disease-causing variant? Different classes of variation Final

More information

Introduction to the Genetics of Complex Disease

Introduction to the Genetics of Complex Disease Introduction to the Genetics of Complex Disease Jeremiah M. Scharf, MD, PhD Departments of Neurology, Psychiatry and Center for Human Genetic Research Massachusetts General Hospital Breakthroughs in Genome

More information

De novo mutational profile in RB1 clarified using a mutation rate modeling algorithm

De novo mutational profile in RB1 clarified using a mutation rate modeling algorithm Aggarwala et al. BMC Genomics (2017) 18:155 DOI 10.1186/s12864-017-3522-z RESEARCH ARTICLE Open Access De novo mutational profile in RB1 clarified using a mutation rate modeling algorithm Varun Aggarwala

More information

User Guide. Association analysis. Input

User Guide. Association analysis. Input User Guide TFEA.ChIP is a tool to estimate transcription factor enrichment in a set of differentially expressed genes using data from ChIP-Seq experiments performed in different tissues and conditions.

More information

A Case Study: Two-sample categorical data

A Case Study: Two-sample categorical data A Case Study: Two-sample categorical data Patrick Breheny January 31 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/43 Introduction Model specification Continuous vs. mixture priors Choice

More information

Package AbsFilterGSEA

Package AbsFilterGSEA Type Package Package AbsFilterGSEA September 21, 2017 Title Improved False Positive Control of Gene-Permuting GSEA with Absolute Filtering Version 1.5.1 Author Sora Yoon Maintainer

More information

For general queries, contact

For general queries, contact Much of the work in Bayesian econometrics has focused on showing the value of Bayesian methods for parametric models (see, for example, Geweke (2005), Koop (2003), Li and Tobias (2011), and Rossi, Allenby,

More information

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16 PGAR: ASD Candidate Gene Prioritization System Using Expression Patterns Steven Cogill and Liangjiang Wang Department of Genetics and

More information

Congenital Heart Disease How much of it is genetic?

Congenital Heart Disease How much of it is genetic? Congenital Heart Disease How much of it is genetic? Stephen Robertson Curekids Professor of Paediatric Genetics Dunedin School of Medicine University of Otago Congenital Heart Disease The most common survivable

More information

The University of Texas MD Anderson Cancer Center Division of Quantitative Sciences Department of Biostatistics. CRM Suite. User s Guide Version 1.0.

The University of Texas MD Anderson Cancer Center Division of Quantitative Sciences Department of Biostatistics. CRM Suite. User s Guide Version 1.0. The University of Texas MD Anderson Cancer Center Division of Quantitative Sciences Department of Biostatistics CRM Suite User s Guide Version 1.0.0 Clift Norris, John Venier, Ying Yuan, and Lin Zhang

More information

1 in 68 in US. Autism Update: New research, evidence-based intervention. 1 in 45 in NJ. Selected New References. Autism Prevalence CDC 2014

1 in 68 in US. Autism Update: New research, evidence-based intervention. 1 in 45 in NJ. Selected New References. Autism Prevalence CDC 2014 Autism Update: New research, evidence-based intervention Martha S. Burns, Ph.D. Joint Appointment Professor Northwestern University. 1 Selected New References Bourgeron, Thomas (2015) From the genetic

More information

Module Overview. What is a Marker? Part 1 Overview

Module Overview. What is a Marker? Part 1 Overview SISCR Module 7 Part I: Introduction Basic Concepts for Binary Classification Tools and Continuous Biomarkers Kathleen Kerr, Ph.D. Associate Professor Department of Biostatistics University of Washington

More information

LTA Analysis of HapMap Genotype Data

LTA Analysis of HapMap Genotype Data LTA Analysis of HapMap Genotype Data Introduction. This supplement to Global variation in copy number in the human genome, by Redon et al., describes the details of the LTA analysis used to screen HapMap

More information

Types of Modifications

Types of Modifications Modifications 1 Types of Modifications Post-translational Phosphorylation, acetylation Artefacts Oxidation, acetylation Derivatisation Alkylation of cysteine, ICAT, SILAC Sequence variants Errors, SNP

More information

Supplementary Information. Data Identifies FAN1 at 15q13.3 as a Susceptibility. Gene for Schizophrenia and Autism

Supplementary Information. Data Identifies FAN1 at 15q13.3 as a Susceptibility. Gene for Schizophrenia and Autism Supplementary Information A Scan-Statistic Based Analysis of Exome Sequencing Data Identifies FAN1 at 15q13.3 as a Susceptibility Gene for Schizophrenia and Autism Iuliana Ionita-Laza 1,, Bin Xu 2, Vlad

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

CITATION FILE CONTENT/FORMAT

CITATION FILE CONTENT/FORMAT CITATION For any resultant publications using please cite: Matthew A. Field, Vicky Cho, T. Daniel Andrews, and Chris C. Goodnow (2015). "Reliably detecting clinically important variants requires both combined

More information

Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples

Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples Jack A. Kosmicki, Massachusetts General Hospital Kaitlin E. Samocha, Massachusetts

More information

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,

More information

IMPaLA tutorial.

IMPaLA tutorial. IMPaLA tutorial http://impala.molgen.mpg.de/ 1. Introduction IMPaLA is a web tool, developed for integrated pathway analysis of metabolomics data alongside gene expression or protein abundance data. It

More information

Journal: Nature Methods

Journal: Nature Methods Journal: Nature Methods Article Title: Network-based stratification of tumor mutations Corresponding Author: Trey Ideker Supplementary Item Supplementary Figure 1 Supplementary Figure 2 Supplementary Figure

More information

Clustering Autism Cases on Social Functioning

Clustering Autism Cases on Social Functioning Clustering Autism Cases on Social Functioning Nelson Ray and Praveen Bommannavar 1 Introduction Autism is a highly heterogeneous disorder with wide variability in social functioning. Many diagnostic and

More information

Supplementary Figure 1: Features of IGLL5 Mutations in CLL: a) Representative IGV screenshot of first

Supplementary Figure 1: Features of IGLL5 Mutations in CLL: a) Representative IGV screenshot of first Supplementary Figure 1: Features of IGLL5 Mutations in CLL: a) Representative IGV screenshot of first intron IGLL5 mutation depicting biallelic mutations. Red arrows highlight the presence of out of phase

More information

Population Genetics Simulation Lab

Population Genetics Simulation Lab Name Period Assignment # Pre-lab: annotate each paragraph Population Genetics Simulation Lab Evolution occurs in populations of organisms and involves variation in the population, heredity, and differential

More information

Package xseq. R topics documented: September 11, 2015

Package xseq. R topics documented: September 11, 2015 Package xseq September 11, 2015 Title Assessing Functional Impact on Gene Expression of Mutations in Cancer Version 0.2.1 Date 2015-08-25 Author Jiarui Ding, Sohrab Shah Maintainer Jiarui Ding

More information

A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families

A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families A Likelihood-Based Framework for Variant Calling and De Novo Mutation Detection in Families Bingshan Li 1 *, Wei Chen 2, Xiaowei Zhan 3, Fabio Busonero 3,4, Serena Sanna 4, Carlo Sidore 4, Francesco Cucca

More information

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5 PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science Homework 5 Due: 21 Dec 2016 (late homeworks penalized 10% per day) See the course web site for submission details.

More information

Integrated Bayesian analysis of rare exonic variants to identify risk genes for schizophrenia and neurodevelopmental disorders

Integrated Bayesian analysis of rare exonic variants to identify risk genes for schizophrenia and neurodevelopmental disorders Nguyen et al. Genome Medicine (217) 9:114 DOI 1.1186/s1373-17-497-y RESEARCH Open Access Integrated Bayesian analysis of rare exonic variants to identify risk genes for schizophrenia and neurodevelopmental

More information

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder Introduction to linkage and family based designs to study the genetic epidemiology of complex traits Harold Snieder Overview of presentation Designs: population vs. family based Mendelian vs. complex diseases/traits

More information

Reporting TP53 gene analysis results in CLL

Reporting TP53 gene analysis results in CLL Reporting TP53 gene analysis results in CLL Mutations in TP53 - From discovery to clinical practice in CLL Discovery Validation Clinical practice Variant diversity *Leroy at al, Cancer Research Review

More information

SubLasso:a feature selection and classification R package with a. fixed feature subset

SubLasso:a feature selection and classification R package with a. fixed feature subset SubLasso:a feature selection and classification R package with a fixed feature subset Youxi Luo,3,*, Qinghan Meng,2,*, Ruiquan Ge,2, Guoqin Mai, Jikui Liu, Fengfeng Zhou,#. Shenzhen Institutes of Advanced

More information

Introduction to Bayesian Analysis 1

Introduction to Bayesian Analysis 1 Biostats VHM 801/802 Courses Fall 2005, Atlantic Veterinary College, PEI Henrik Stryhn Introduction to Bayesian Analysis 1 Little known outside the statistical science, there exist two different approaches

More information

Package cssam. February 19, 2015

Package cssam. February 19, 2015 Type Package Package cssam February 19, 2015 Title cssam - cell-specific Significance Analysis of Microarrays Version 1.2.4 Date 2011-10-08 Author Shai Shen-Orr, Rob Tibshirani, Narasimhan Balasubramanian,

More information

Naïve Bayes classification in R

Naïve Bayes classification in R Big-data Clinical Trial Column age 1 of 5 Naïve Bayes classification in R Zhongheng Zhang Department of Critical Care Medicine, Jinhua Municipal Central Hospital, Jinhua Hospital of Zhejiang University,

More information

Answers to end of chapter questions

Answers to end of chapter questions Answers to end of chapter questions Chapter 1 What are the three most important characteristics of QCA as a method of data analysis? QCA is (1) systematic, (2) flexible, and (3) it reduces data. What are

More information

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process Research Methods in Forest Sciences: Learning Diary Yoko Lu 285122 9 December 2016 1. Research process It is important to pursue and apply knowledge and understand the world under both natural and social

More information

Burning debate: What s the best way to nab real autism genes?

Burning debate: What s the best way to nab real autism genes? OPINION, VIEWPOINT Burning debate: What s the best way to nab real autism genes? BY BRIAN O'ROAK 27 JUNE 2017 Over the past 10 years researchers have made tremendous progress in understanding the genetic

More information

Mediation Analysis With Principal Stratification

Mediation Analysis With Principal Stratification University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 3-30-009 Mediation Analysis With Principal Stratification Robert Gallop Dylan S. Small University of Pennsylvania

More information

Epigenetics. Jenny van Dongen Vrije Universiteit (VU) Amsterdam Boulder, Friday march 10, 2017

Epigenetics. Jenny van Dongen Vrije Universiteit (VU) Amsterdam Boulder, Friday march 10, 2017 Epigenetics Jenny van Dongen Vrije Universiteit (VU) Amsterdam j.van.dongen@vu.nl Boulder, Friday march 10, 2017 Epigenetics Epigenetics= The study of molecular mechanisms that influence the activity of

More information

Hands-On Ten The BRCA1 Gene and Protein

Hands-On Ten The BRCA1 Gene and Protein Hands-On Ten The BRCA1 Gene and Protein Objective: To review transcription, translation, reading frames, mutations, and reading files from GenBank, and to review some of the bioinformatics tools, such

More information

1. Create a mutation rate table from intergenic SNPs for all possible trinucleotide to trinucleotide changes

1. Create a mutation rate table from intergenic SNPs for all possible trinucleotide to trinucleotide changes 1. Create a mutation rate table from intergenic SNPs for all possible trinucleotide to trinucleotide changes ATCGGCTGG ATCGACTGG CCTAGCTAA CCTGGCTAA CTCACCGGA CTCACTGGA Change AAA ACA AAA AGA AAA ATA AAC

More information

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials Practical Bayesian Design and Analysis for Drug and Device Clinical Trials p. 1/2 Practical Bayesian Design and Analysis for Drug and Device Clinical Trials Brian P. Hobbs Plan B Advisor: Bradley P. Carlin

More information

4. Model evaluation & selection

4. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2017 4. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

Chapter 8: Two Dichotomous Variables

Chapter 8: Two Dichotomous Variables Chapter 8: Two Dichotomous Variables On the surface, the topic of this chapter seems similar to what we studied in Chapter 7. There are some subtle, yet important, differences. As in Chapter 5, we have

More information

PSSV User Manual (V2.1)

PSSV User Manual (V2.1) PSSV User Manual (V2.1) 1. Introduction A novel pattern-based probabilistic approach, PSSV, is developed to identify somatic structural variations from WGS data. Specifically, discordant and concordant

More information

Transmission Disequilibrium Methods for Family-Based Studies Daniel J. Schaid Technical Report #72 July, 2004

Transmission Disequilibrium Methods for Family-Based Studies Daniel J. Schaid Technical Report #72 July, 2004 Transmission Disequilibrium Methods for Family-Based Studies Daniel J. Schaid Technical Report #72 July, 2004 Correspondence to: Daniel J. Schaid, Ph.D., Harwick 775, Division of Biostatistics Mayo Clinic/Foundation,

More information

Quantitative genetics: traits controlled by alleles at many loci

Quantitative genetics: traits controlled by alleles at many loci Quantitative genetics: traits controlled by alleles at many loci Human phenotypic adaptations and diseases commonly involve the effects of many genes, each will small effect Quantitative genetics allows

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma.

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma. Supplementary Figure 1 Mutational signatures in BCC compared to melanoma. (a) The effect of transcription-coupled repair as a function of gene expression in BCC. Tumor type specific gene expression levels

More information

Gene Expression Analysis Web Forum. Jonathan Gerstenhaber Field Application Specialist

Gene Expression Analysis Web Forum. Jonathan Gerstenhaber Field Application Specialist Gene Expression Analysis Web Forum Jonathan Gerstenhaber Field Application Specialist Our plan today: Import Preliminary Analysis Statistical Analysis Additional Analysis Downstream Analysis 2 Copyright

More information

Bayesian Prediction Tree Models

Bayesian Prediction Tree Models Bayesian Prediction Tree Models Statistical Prediction Tree Modelling for Clinico-Genomics Clinical gene expression data - expression signatures, profiling Tree models for predictive sub-typing Combining

More information

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research Application Note Authors John McGuigan, Megan Manion,

More information

Section 6: Analysing Relationships Between Variables

Section 6: Analysing Relationships Between Variables 6. 1 Analysing Relationships Between Variables Section 6: Analysing Relationships Between Variables Choosing a Technique The Crosstabs Procedure The Chi Square Test The Means Procedure The Correlations

More information

Field wide development of analytic approaches for sequence data

Field wide development of analytic approaches for sequence data Benjamin Neale Field wide development of analytic approaches for sequence data Cohort Allelic Sum Test (CAST; Hobbs, Cohen and others) Li and Leal (AJHG) Madsen and Browning (PLoS Genetics) C alpha and

More information

Lecture 20. Disease Genetics

Lecture 20. Disease Genetics Lecture 20. Disease Genetics Michael Schatz April 12 2018 JHU 600.749: Applied Comparative Genomics Part 1: Pre-genome Era Sickle Cell Anaemia Sickle-cell anaemia (SCA) is an abnormality in the oxygen-carrying

More information

IN SILICO EVALUATION OF DNA-POOLED ALLELOTYPING VERSUS INDIVIDUAL GENOTYPING FOR GENOME-WIDE ASSOCIATION STUDIES OF COMPLEX DISEASE.

IN SILICO EVALUATION OF DNA-POOLED ALLELOTYPING VERSUS INDIVIDUAL GENOTYPING FOR GENOME-WIDE ASSOCIATION STUDIES OF COMPLEX DISEASE. IN SILICO EVALUATION OF DNA-POOLED ALLELOTYPING VERSUS INDIVIDUAL GENOTYPING FOR GENOME-WIDE ASSOCIATION STUDIES OF COMPLEX DISEASE By Siddharth Pratap Thesis Submitted to the Faculty of the Graduate School

More information

Package HAP.ROR. R topics documented: February 19, 2015

Package HAP.ROR. R topics documented: February 19, 2015 Type Package Title Recursive Organizer (ROR) Version 1.0 Date 2013-03-23 Author Lue Ping Zhao and Xin Huang Package HAP.ROR February 19, 2015 Maintainer Xin Huang Depends R (>=

More information

Asingle inherited mutant gene may be enough to

Asingle inherited mutant gene may be enough to 396 Cancer Inheritance STEVEN A. FRANK Asingle inherited mutant gene may be enough to cause a very high cancer risk. Single-mutation cases have provided much insight into the genetic basis of carcinogenesis,

More information

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Ivan Arreola and Dr. David Han Department of Management of Science and Statistics, University

More information

ARTICLE RESEARCH. Macmillan Publishers Limited. All rights reserved

ARTICLE RESEARCH. Macmillan Publishers Limited. All rights reserved Extended Data Figure 6 Annotation of drivers based on clinical characteristics and co-occurrence patterns. a, Putative drivers affecting greater than 10 patients were assessed for enrichment in IGHV mutated

More information

In-house* validation of Qualitative Methods

In-house* validation of Qualitative Methods Example from Gilbert de Roy In-house* validation of Qualitative Methods Aspects from a non forensic but analytical chemist *In-house in your own laboratory Presented at ENFSI, European Paint group meeting

More information

(ii) The effective population size may be lower than expected due to variability between individuals in infectiousness.

(ii) The effective population size may be lower than expected due to variability between individuals in infectiousness. Supplementary methods Details of timepoints Caió sequences were derived from: HIV-2 gag (n = 86) 16 sequences from 1996, 10 from 2003, 45 from 2006, 13 from 2007 and two from 2008. HIV-2 env (n = 70) 21

More information

Bayes Factors for t tests and one way Analysis of Variance; in R

Bayes Factors for t tests and one way Analysis of Variance; in R Bayes Factors for t tests and one way Analysis of Variance; in R Dr. Jon Starkweather It may seem like small potatoes, but the Bayesian approach offers advantages even when the analysis to be run is not

More information

What can genetic studies tell us about ADHD? Dr Joanna Martin, Cardiff University

What can genetic studies tell us about ADHD? Dr Joanna Martin, Cardiff University What can genetic studies tell us about ADHD? Dr Joanna Martin, Cardiff University Outline of talk What do we know about causes of ADHD? Traditional family studies Modern molecular genetic studies How can

More information

Lab 5: Testing Hypotheses about Patterns of Inheritance

Lab 5: Testing Hypotheses about Patterns of Inheritance Lab 5: Testing Hypotheses about Patterns of Inheritance How do we talk about genetic information? Each cell in living organisms contains DNA. DNA is made of nucleotide subunits arranged in very long strands.

More information

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis

BST227 Introduction to Statistical Genetics. Lecture 4: Introduction to linkage and association analysis BST227 Introduction to Statistical Genetics Lecture 4: Introduction to linkage and association analysis 1 Housekeeping Homework #1 due today Homework #2 posted (due Monday) Lab at 5:30PM today (FXB G13)

More information

CS2220 Introduction to Computational Biology

CS2220 Introduction to Computational Biology CS2220 Introduction to Computational Biology WEEK 8: GENOME-WIDE ASSOCIATION STUDIES (GWAS) 1 Dr. Mengling FENG Institute for Infocomm Research Massachusetts Institute of Technology mfeng@mit.edu PLANS

More information

Reducing INDEL calling errors in whole genome and exome sequencing data.

Reducing INDEL calling errors in whole genome and exome sequencing data. Reducing INDEL calling errors in whole genome and exome sequencing data. Han Fang November 8, 2014 CSHL Biological Data Science Meeting Acknowledgments Lyon Lab Yiyang Wu Jason O Rawe Laura J Barron Max

More information

Introduction to Computational Neuroscience

Introduction to Computational Neuroscience Introduction to Computational Neuroscience Lecture 11: Attention & Decision making Lesson Title 1 Introduction 2 Structure and Function of the NS 3 Windows to the Brain 4 Data analysis 5 Data analysis

More information

New Enhancements: GWAS Workflows with SVS

New Enhancements: GWAS Workflows with SVS New Enhancements: GWAS Workflows with SVS August 9 th, 2017 Gabe Rudy VP Product & Engineering 20 most promising Biotech Technology Providers Top 10 Analytics Solution Providers Hype Cycle for Life sciences

More information