Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

Similar documents
Application of Resampling Methods in Microarray Data Analysis

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Gene Ontology 2 Function/Pathway Enrichment. Biol4559 Thurs, April 12, 2018 Bill Pearson Pinn 6-057

Predication-based Bayesian network analysis of gene sets and knowledge-based SNP abstractions

Sample Size Estimation for Microarray Experiments

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

New Enhancements: GWAS Workflows with SVS

SUPPLEMENTARY INFORMATION

White Paper Guidelines on Vetting Genetic Associations

Introduction to Gene Sets Analysis

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:

Statistical power and significance testing in large-scale genetic studies

Nature Immunology: doi: /ni Supplementary Figure 1. Transcriptional program of the TE and MP CD8 + T cell subsets.

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

High-Throughput Sequencing Course

Rare Variant Burden Tests. Biostatistics 666

Integration of high-throughput biological data

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Chapter 2 Organizing and Summarizing Data. Chapter 3 Numerically Summarizing Data. Chapter 4 Describing the Relation between Two Variables

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

A Statistical Framework for Classification of Tumor Type from microrna Data

SNPrints: Defining SNP signatures for prediction of onset in complex diseases

Cancer outlier differential gene expression detection

Statistical Tests for X Chromosome Association Study. with Simulations. Jian Wang July 10, 2012

STA 3024 Spring 2013 EXAM 3 Test Form Code A UF ID #

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed.

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022

A Review of Multiple Hypothesis Testing in Otolaryngology Literature

Supplementary Materials

Gene and Pathway Analysis of Metabolic Traits in Dairy Cows

During the hyperinsulinemic-euglycemic clamp [1], a priming dose of human insulin (Novolin,

Not IN Our Genes - A Different Kind of Inheritance.! Christopher Phiel, Ph.D. University of Colorado Denver Mini-STEM School February 4, 2014

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Predicting disease associations via biological network analysis

EXPression ANalyzer and DisplayER

Practical Experience in the Analysis of Gene Expression Data

LTA Analysis of HapMap Genotype Data

Whole liver transcriptome analysis for the metabolic adaptation of dairy cows

Genes, Aging and Skin. Helen Knaggs Vice President, Nu Skin Global R&D

Tutorial on Genome-Wide Association Studies

Nature Neuroscience: doi: /nn Supplementary Figure 1

SUMMER 2011 RE-EXAM PSYF11STAT - STATISTIK

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Package AbsFilterGSEA

CS2220 Introduction to Computational Biology

Lecture 20. Disease Genetics

ChIP-seq data analysis

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Molecular Markers. Marcie Riches, MD, MS Associate Professor University of North Carolina Scientific Director, Infection and Immune Reconstitution WC

Genetic Approaches to Alcoholism, Alcohol Abuse Susceptibility, and Therapeutic Response. Ray White Verona March

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Demonstrating Client Improvement to Yourself and Others

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

IMPaLA tutorial.

Heritability enrichment of differentially expressed genes. Hilary Finucane PGC Statistical Analysis Call January 26, 2016

BST227: Introduction to Statistical Genetics

Supplementary Figures

Releasing SNP Data and GWAS Results with Guaranteed Privacy Protection

Supplementary information for: Human micrornas co-silence in well-separated groups and have different essentialities

Chi-Square Goodness-of-Fit Test

Supplementary Figures

Gene-microRNA network module analysis for ovarian cancer

Advanced ANOVA Procedures

By Surya Prasad Poudel. A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science In Epidemiology

Computational Identification and Prediction of Tissue-Specific Alternative Splicing in H. Sapiens. Eric Van Nostrand CS229 Final Project

Introduction to linkage and family based designs to study the genetic epidemiology of complex traits. Harold Snieder

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Expanded View Figures

Introduction to LOH and Allele Specific Copy Number User Forum

Assignment #6. Chapter 10: 14, 15 Chapter 11: 14, 18. Due tomorrow Nov. 6 th by 2pm in your TA s homework box

Genetics and Genomics in Medicine Chapter 8 Questions

Host Genomics of HIV-1

appstats26.notebook April 17, 2015

Imaging Genetics: Heritability, Linkage & Association

A Method for Utilizing Bivariate Efficacy Outcome Measures to Screen Agents for Activity in 2-Stage Phase II Clinical Trials

Overview of Non-Parametric Statistics

Biomedical Machine Learning

Genome-wide Association Studies (GWAS) Pasieka, Science Photo Library

Research Article Power Estimation for Gene-Longevity Association Analysis Using Concordant Twins

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2011 Formula Grant

The genetics of complex traits Amazing progress (much by ppl in this room)

BIMM 143. RNA sequencing overview. Genome Informatics II. Barry Grant. Lecture In vivo. In vitro.

QTs IV: miraculous and missing heritability

Power Calculation for Testing If Disease is Associated with Marker in a Case-Control Study Using the GeneticsDesign Package

Analysis of gene expression in blood before diagnosis of ovarian cancer

Supplementary Material

Title:Mixed-strain Housing for Female C57BL/6, DBA/2, and BALB/c Mice: Validating a Split-plot Design that promotes Refinement and Reduction

Statistics as a Tool. A set of tools for collecting, organizing, presenting and analyzing numerical facts or observations.

Nature Genetics: doi: /ng Supplementary Figure 1

4. Model evaluation & selection

Expanded View Figures

A Brief (very brief) Overview of Biostatistics. Jody Kreiman, PhD Bureau of Glottal Affairs

저작권법에따른이용자의권리는위의내용에의하여영향을받지않습니다.

Overview of Animal Breeding

GWAS mega-analysis. Speakers: Michael Metzker, Douglas Blackwood, Andrew Feinberg, Nicholas Schork, Benjamin Pickard

Transcription:

High-Throughput Sequencing Course Gene-Set Analysis Biostatistics and Bioinformatics Summer 28 Section Introduction What is Gene Set Analysis? Many names for gene set analysis: Pathway analysis Gene set enrichment analysis Go-term analysis Gene list enrichment analysis

Single SNP/Gene Analysis SNP/Gene: X, X 2,, X p Phenotype Y Study the relationship between X i and Y or Y = β i + β i X i + Z logit{p (Y = )} = β i + β i X i or other GLMs Obtain the p-value P i corresponding to the significance level of β i Threshold p-values Typical Results of GWAS Analysis (Single SNP Approach) Figure: An example from Dumitrescu et al (2) Typical Results of GWAS Analysis (Single SNP Approach) Figure: An example from Gibson (2)

Gene Set Analysis (GSA) An analysis to investigate the relationship between a disease phenotype and a set of genes on the basis of shared biological or functional properties Gene set: a set of genes Genes involved in a pathway Genes corresponding to a Gene Ontology term Genes mentioned in a paper to have certain similarities Goal of GSA Goal: give one number to measure the significance of a gene set as a whole Are many genes in the pathway differentially expressed (up-regulated/down-regulated)? What is the probability of observing these changes just by chance? Why GSA? Single SNP approach: List top 2-5 most-significant SNPs and their neighboring genes GSA approach: List the pathways that have genes in the pathway have consistent trend to affect the phenotype

Why GSA? Single SNP approach: List top 2-5 most-significant SNPs and their neighboring genes Assumption : Single gene work solely to largely increase the disease susceptibility GSA approach: List the pathways that have genes in the pathway have consistent trend to affect the phenotype Assumption : Multiple Genes in the same pathway work together to confer disease susceptibility Why GSA? Single SNP approach: List top 2-5 most-significant SNPs and their neighboring genes Assumption : Single gene work solely to largely increase the disease susceptibility Assumption 2: The most associated gene is the best candidate for therapeutic intervention GSA approach: List the pathways that have genes in the pathway have consistent trend to affect the phenotype Assumption : Multiple Genes in the same pathway work together to confer disease susceptibility Assumption 2: Targeting susceptibility pathways have clinical implications for finding additional drug targets Why GSA? Interpretation of genome-wide results Gene-sets are (typically) fewer than all the genes and have more descriptive names Difficult to manage a long list of significant genes Integrates external information into the analysis Less prone to false-positives on the gene-level Top genes might not be the interesting ones, several coordinated smaller changes Detect patterns that would be difficult to discern simply by manually going through, eg, the list of differentially expressed genes

Section 2 Statistical Issues Two Types of Nulls Self-contained analysis: None of those genes in the gene set are associated with the phenotype Competitive analysis: None of those genes in the gene set are associated with the phenotype Two Types of Nulls Figure: Schematic of the two-tier structures of GSA Leeuw et al (26)

Underlying Mechanism Leeuw et al, 26 Self-contained Tests Inflate Type I Error Section 3 Method: Gen-Gen/GSEA

Gen-Gen/GSEA Gen-Gen: Kai Wang, Mingyao Li, and Maja Bucan (Dec 27) Pathway-based approaches for analysis of genomewide association studies In: Am J Hum Genet 86, pp 278 83 doi: 86/522374 GSEA: Aravind Subramanian et al (Oct 25) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles In: Proc Natl Acad Sci U S A 243, pp 5545 5 doi: 73/pnas56582 Microarray Data Subject ID Disease phenotype Normalized gene expression 2 n_ 2 n_2 42 2-23 -64 22 2 Calculation Chi-Square Statistic Large chi-square statistics indicate stronger association Single Nucleotide Polymorphism Data Subject ID Disease phenotype SNP genotype 2 2 2 Calculation Chi-Square Statistic Large chi-square statistics indicate stronger association

Summarize SNP Associate on One Gene Map SNP V i to gene j (G j ) if the SNP is located within the gene or if the gene is the closest gene to the SNP In total N genes When one SNP is located within shared regions of two overlapping genes, the SNP is mapped to both genes For each gene, assign the highest statistic value among all SNPs mapped to the gene as the statistic value of the gene, r j = max vi G j t i Enrichment Score A given gene set S, Card(S) = N H Calculate association chi-square statistics r j, j =,, N The larger the r j is, the more associated gene Oj with the phenotype Rank the association statistics from the largest to the smallest, denoted by r () r (2) r (N) Calculate a weighted Kolmogrov-Smirnov like running sum statistic r (j es(s) = max ) p j N N R N N H, j S, j j where N R = j S r (j ) p j S, j j Enrichment Score Weighted Kolmogrov-Smirnov like running sum statistic r (j es(s) = max ) p j N N R N N H, j S, j j j S, j j where N R = j S r (j ) p p is a parameter that gives higher weight to genes with extreme statistics Common choice p = p = leads to regular KS statistic, usually not as powerful as p =

Normalized Enrichment Score The enrichment score es(s) relies on the maximum statistic, so that a larger gene set S tends to produce larger es(s) Two-step normalization procedure: Permute the phenotype label of all samples 2 During each permutation π, repeat the calculation of the enrichment score es(s, π) Then nes(s) = es(s) mean{es(s, π)} sd{es(s, π)} The nes adjusts for different sizes of genes THE nes preserves correlations between SNPs on the same gene Type I Error Rate H l : Gene set S l is not associated with the phenotype, l =,, m Claim significant Claim non-significant Total True nulls N N m False nulls N N m Total R m R m fdr = E(N /(R )) fwer = P(N ) Control fdr nes : the normalized enrichment score in the observed data fdr = % of all (S, π) with nes(s, π) nes % of observed S with nes(s) nes Rationale fdr = E{N /(R )} N /m: Estimated by % of all (S, π) with nes(s, π) nes R/m: Estimated by % of observed S with nes(s) nes Larger nes corresponds to smaller fdr If fdr α, claim the corresponding gene set significant

Control fwer nes : the normalized enrichment score in the observed data fwer = % of all π with the highest nes(s, π) nes Rationale: fwer = P(N ) = E{I(N )} Each permutation π can be viewed as a realization of the event If the highest nes(s, π) nes, then there is a false rejection Larger nes corresponds to smaller fwer If fwer α, claim the corresponding gene set significant Section 4 Method: GAGE GAGE Luo29GAGE Gene expression data: RNA-Seq or Microarrary

GAGE Method Overview Setting Gene: i {,, N} Condition/Phenotype: s, Paired (-on-): eg, one condition vs another condition: Unpaired (grp-on-grp): eg, one phenotype vs another phenotype: Subject: Paired: k {,, K} Unpaired: k {,, K } for cases and k {,, K } for controls Gene expression: { Transcription level of gene i Microarray G s,k,i = Read counts of gene i/total counts RNA-Seq log 2 fold change Compare the gene expressions between two conditions or two phenotypes Paired (-on-): X k,i = G,k,i /G,k,i Unpaired (grp-on-grp): X i = Ḡ,i/Ḡ,i Efficient but not recommended (-on-grp): X k,i = G,k,i /Ḡ,i

Gene Set and T-statistic Gene set of interest S mean fold change: m = mean i S (X i ) (gene set) vs M = mean i {,,N} (X i ) (all genes) standard deviation folde change: s = sd i S (X i ) (gene set) vs S = sd i {,,N} (X i ) (all genes) number of genes: n (gene set) vs N (all genes) T-statistic: T = (m M)/ s 2 /n + S 2 /n Remark: This is a two sample t-test between the interesting gene set containing n genes and a virtual random set of the same size derived from the background Subscript k is left out for simplicity We will discuss -on- setting (with subscript k) later P -Value Degree of freedom of T under the null df = (n ) s2 + S 2 s 4 + S 4 P -value: Two sided: pathway set (genes may be het erogeneously regulated in either direction) One sided: experimental set (genes are regulated in the same direction) Alternative choice of T : rank-based test (Wilcoxon Mann-Whitney test) Summarizing P -Values Recall that for -on- (paired) setting, the P -value for gene set S and subject k is P k (S) X(S) = k log P k (S) Under the null, P k (S) independently follows Unif(, ), and then X(S) follows Gamma(K, )

Controlling fdr If multiple gene sets are of interest, multiple testing methods are applied to control fdr fdrtool: Korbinian Strimmer (July 28) A unified approach to false discovery rate estimation In: BMC Bioinformatics 9, p 33 doi: 86/47-25-9-33 Benjamini and Hochberg (BH) procedure: Y Benjamini and Y Hochberg (995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing In: Journal of the Royal Statistical Society Series B (Methodological) 57, pp 289 3 issn: 359246 url: http://wwwjstororg/stable/2346 Section 5 References Benjamini, Y and Y Hochberg (995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing In: Journal of the Royal Statistical Society Series B (Methodological) 57, pp 289 3 issn: 359246 url: http://wwwjstororg/stable/2346 Dumitrescu, Logan et al (June 2) Genetic determinants of lipid traits in diverse populations from the population architecture using genomics and epidemiology (PAGE) study In: PLoS Genet 76, e238 doi: 37/journalpgen238 Gibson, Greg (July 2) Hints of hidden heritability in GWAS In: Nat Genet 427, pp 558 6 doi: 38/ng7-558 Leeuw, Christiaan A de et al (June 26) The statistical properties of gene-set analysis In: Nature Reviews Genetics 76, pp 353 364 issn: 47-64 doi:

38/nrg2629 url: http://dxdoiorg/38/nrg2629 Strimmer, Korbinian (July 28) A unified approach to false discovery rate estimation In: BMC Bioinformatics 9, p 33 doi: 86/47-25-9-33 Subramanian, Aravind et al (Oct 25) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles In: Proc Natl Acad Sci U S A 243, pp 5545 5 doi: 73/pnas56582 Wang, Kai, Mingyao Li, and Maja Bucan (Dec 27) Pathway-based approaches for analysis of genomewide association studies In: Am J Hum Genet 86, pp 278 83 doi: 86/522374