Predication-based Bayesian network analysis of gene sets and knowledge-based SNP abstractions

Similar documents
Single SNP/Gene Analysis. Typical Results of GWAS Analysis (Single SNP Approach) Typical Results of GWAS Analysis (Single SNP Approach)

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

A Versatile Algorithm for Finding Patterns in Large Cancer Cell Line Data Sets

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Mapping evolutionary pathways of HIV-1 drug resistance using conditional selection pressure. Christopher Lee, UCLA

Expanded View Figures

Application of Resampling Methods in Microarray Data Analysis

Study the Evolution of the Avian Influenza Virus

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Reliability of Ordination Analyses

Pathway analysis of bladder cancer genome-wide association study identifies novel pathways involved in bladder cancer development Chen et al

CS2220 Introduction to Computational Biology

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Evolutionary Programming

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

Evaluating Classifiers for Disease Gene Discovery

Title: Human breast cancer associated fibroblasts exhibit subtype specific gene expression profiles

Chapter 1. Introduction

Personalized Physiological Models in Intensive-Care Medicine

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

New Enhancements: GWAS Workflows with SVS

Introduction to Computational Neuroscience

National Academies Next Generation SAMPLE Researchers TITLE Initiative HERE

Nutrigenomics and Personalised Nutrition. John Hesketh

Personalized, Evidence-based, Outcome-driven Healthcare Empowered by IBM Cognitive Computing Technologies. Guotong Xie IBM Research - China

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

Overcoming challenges in DNA sample collection

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

T. R. Golub, D. K. Slonim & Others 1999

Title: Vitamin D Receptor Gene Associations with Pulmonary Tuberculosis in a Tibetan Chinese population

Gene and Pathway Analysis of Metabolic Traits in Dairy Cows

Lung Cancer Screening

Probabilistic Reasoning for Medical Decision Support. Omolola Ogunyemi, PhD Director, Center for Biomedical Informatics Charles Drew University

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

Screening for novel oncology biomarker panels using both DNA and protein microarrays. John Anson, PhD VP Biomarker Discovery

Sample Size Estimation for Microarray Experiments

ChIP-seq data analysis

Detecting gene signature activation in breast cancer in an absolute, single-patient manner

AI for Health: A Partnership between Clinician, Patient & Data

2) Cases and controls were genotyped on different platforms. The comparability of the platforms should be discussed.

New methods for discovering common and rare genetic variants in human disease

SNPrints: Defining SNP signatures for prediction of onset in complex diseases

3. Model evaluation & selection

Gene Ontology 2 Function/Pathway Enrichment. Biol4559 Thurs, April 12, 2018 Bill Pearson Pinn 6-057

4. Model evaluation & selection

Introduction to Computational Neuroscience

Nature Methods: doi: /nmeth.3115

Biostatistical modelling in genomics for clinical cancer studies

Identifying the Zygosity Status of Twins Using Bayes Network and Estimation- Maximization Methodology

Package AbsFilterGSEA

On Missing Data and Genotyping Errors in Association Studies

Digitizing the Proteomes From Big Tissue Biobanks

Title: Pinpointing resilience in Bipolar Disorder

NGS IN ONCOLOGY: FDA S PERSPECTIVE

Research in Hepatology What is hot and what is not!

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

* To whom correspondence should be addressed.

Comparing Complex Diagnoses: A Formative Evaluation of the Heart Disease Program

Six Sigma Glossary Lean 6 Society

Dan Koller, Ph.D. Medical and Molecular Genetics

Detection of copy number variations in PCR-enriched targeted sequencing data

MULTIFACTORIAL DISEASES. MG L-10 July 7 th 2014

Metabolomic Data Analysis with MetaboAnalystR

Cancer Gene Panels. Dr. Andreas Scherer. Dr. Andreas Scherer President and CEO Golden Helix, Inc. Twitter: andreasscherer

Modeling gene expression using five histone modifications

Progress in Risk Science and Causality

Supplementary Methods

The Foundations of Personalized Medicine

Innovation from the Foundation Sonya Dumanis, PhD

Large-scale identity-by-descent mapping discovers rare haplotypes of large effect. Suyash Shringarpure 23andMe, Inc. ASHG 2017

Edith T. Moricz COVER PAGE. MBA Founder & CEO of BeyondSuccessOnline.com

Introduction to the Partners Biobank Portal. December 2016

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

SAPLING: A Tool for Gene Network Analysis focusing on Psychiatric Genetics

Use Survey Happiness Complete. SAV file on Blackboard Assigned to Mistake 8

The use of diagnostic FFPE material in cancer epidemiology research

Mutational Impact on Diagnostic and Prognostic Evaluation of MDS

Predicting Breast Cancer Survivability Rates

Chapter 1 : Genetics 101

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

Review: Genome assembly Reads

What can we contribute to cancer research and treatment from Computer Science or Mathematics? How do we adapt our expertise for them

Mayuri Takore 1, Prof.R.R. Shelke 2 1 ME First Yr. (CSE), 2 Assistant Professor Computer Science & Engg, Department

Research and Innovation Roadmap Annual Projects

Epigenetics. Jenny van Dongen Vrije Universiteit (VU) Amsterdam Boulder, Friday march 10, 2017

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

Challenges and opportunities for humanmachine collaboration at beyond human scales

fasting blood glucose [mg/dl]

Identification of Tissue Independent Cancer Driver Genes

Mammogram Analysis: Tumor Classification

Research Strategy: 1. Background and Significance

NGS ONCOPANELS: FDA S PERSPECTIVE

Genetics/Genomics: role of genes in diagnosis and/risk and in personalised medicine

Transcription:

Predication-based Bayesian network analysis of gene sets and knowledge-based SNP abstractions Skanda Koppula Second Annual MIT PRIMES Conference May 20th, 2012 Mentors: Dr. Gil Alterovitz and Dr. Amin Zollanvari

Outline of Presentation 1. Introduction A Predictive Model for Disease Diagnosis Alcohol Dependency and Lung Cancer 2. Methodology An Overview of Prediction Based Analysis SNP to Gene Mapping Gene Set and Training Data + Parallelization 3. Results AUROCS for Alcohol Dependence AUROCS for Lung Cancer 4. Discussion 5. Future Work Rational Methodology Results Discussion Future Work 2

A Predictive Model of Disease Diagnosis Top-Level Goals for Genome Wide Association Studies (GWAS) Understand underlying mechanisms behind disease Infer diagnosis from patient s biological data Introduction Methodology Results Discussion Future Work 3

A Predictive Model of Disease Diagnosis Top-Level Goals for Genome Wide Association Studies (GWAS) Understand underlying mechanisms behind disease Infer diagnosis from patient s biological data Methods Gene Set Enrichment Analysis [1] Finds genes in set S (~cellular pathway) in top/bottom of ranked list of genes ordered by importance in classifying Predictive-Based Gene Set Analysis Finds predictive accuracy of S via probabilistic relations between of S to disease (Bayesian Network) Is this accuracy is significant compared to random prediction? If so, network can be used in disease diagnosis. Introduction Methodology Results Discussion Future Work 4

Alcohol Dependence and Lung Cancer Goal Use predictive-based analysis for both gene and SNP expression data Analyze alcohol dependency and lung cancer Introduction Methodology Results Discussion Future Work 5

Alcohol Dependence and Lung Cancer Goal Use predictive-based analysis for both gene and SNP expression data Analyze alcohol dependency and lung cancer Alcohol Dependence (Alcoholism) Underlying biological pathways not identified Difficult to overcome once dependence is initiated: Affects 140 million people Lung Cancer Other diagnostic methods may be invasive Early accurate diagnosis improves chances of survival Causes of death for 160,000 people per year Introduction Methodology Results Discussion Future Work 6

Alcohol Dependence and Lung Cancer Goal Use predictive-based analysis for both gene and SNP expression data Analyze alcohol dependency and lung cancer Alcohol Dependence (Alcoholism) Underlying biological pathways not identified Difficult to overcome once dependence is initiated: Affects 140 million people Lung Cancer Other diagnostic methods may be invasive Early accurate diagnosis improves chances of survival Causes of death for 160,000 people per year Focus on robustness and accuracy same sets identified as significant, across independently collected clinical data Introduction Methodology Results Discussion Future Work 7

An Overview of Prediction Based Analysis General Algorithm Input Training Data 1 Generate predictive [Bayesian] model for gene set S based on training data 2 Estimate AUROC (~accuracy) for model Determine AUROCs of models made by randomly permuting gene labels Compute p-values, comparing the predictive model s AUROC with a distribution of AUROCS from random permutations Apply multi-test correction factors 3 to limit the multiple attribute testing error [5] Obtain significant gene/snp sets and corresponding diagnostic models 1 Patients biological data. Ex. the set of gene expression values for each patient 2 Network creation from training data done via machine learning tool WEKA 3 Applied corrections: Benjamini-Hochberg FDR, Bonferroni, and Storey FDR Implementation of algorithm and WEKA in Java Rationale Methodology Results Discussion Future Work 8

A SNP to Gene Mapping Can we analyze SNP data to create diagnostic models? SNP is a single nucleotide polymorphism: two character DNA mutation We can determine specific SNP profiles that imply disease Yes: consider SNP sets to be analogous to gene sets and apply previous algorithm implementation Rationale Methodology Results Discussion Future Work 9

A SNP to Gene Mapping Can we analyze SNP data to create diagnostic models? SNP is a single nucleotide polymorphism: two character DNA mutation We can determine specific SNP profiles that imply disease Yes: consider SNP sets to be analogous to gene sets and apply previous algorithm implementation To create such sets, we can develop a 1:1 map from a SNP to a particular gene The mapping pairs each SNP to its closest gene (character distance in DNA string sequence) Rationale Methodology Results Discussion Future Work 10

A SNP to Gene Mapping Can we analyze SNP data to create diagnostic models? SNP is a single nucleotide polymorphism: two character DNA mutation We can determine specific SNP profiles that imply disease Yes: consider SNP sets to be analogous to gene sets and apply previous algorithm implementation To create such sets, we can develop a 1:1 map from a SNP to a particular gene The mapping pairs each SNP to its closest gene (character distance in DNA string sequence) Possible biological meaning of SNP to gene mapping: the value of the SNP may affect the function and expression of the gene closest to it. Relevance of being able to analyze SNP data: Only data for a disease may be SNP data Insight on the biological significance of SNPs Rationale Methodology Results Discussion Future Work 11

Gene Set and Training Data + Parallelization We used three sets the source of the tested gene-sets from the KEGG, GO, and the curated set used by GSEA s creators (as a comparison) Alcoholism training data: COGA - Collaborative Study on Genetics of Alcoholism COGEND - Collaborative Genetic Study of Nicotine Dependence Lung cancer training data: Boston study - National Medicine Labs Michigan study - National Medicine Labs Rationale Methodology Results Discussion Future Work 12

Gene Set and Training Data + Parallelization We used three sets the source of the tested gene-sets from the KEGG, GO, and the curated set used by GSEA s creators (as a comparison) Alcoholism training data: COGA - Collaborative Study on Genetics of Alcoholism COGEND - Collaborative Genetic Study of Nicotine Dependence Lung cancer training data: Boston study - National Medicine Labs Michigan study - National Medicine Labs In COGA and COGEND: Each patient had values for each of one million SNPs Total of 3,600 patients To avoid having to reduce data (time and computing limitations), we parallelized the creation of the SNP-to-gene mapping and partially parallelized segments of the prediction based algorithm implementation Rationale Methodology Results Discussion Future Work 13

True positive rate Results: Alcohol Dependency Example NaiveBayesian network for one identified significant SNP set: BIOGENIC_AMINE_METABOLIC_PROCESS Area Under Receive-Operating-Curve ~ Accuracy ~ 70.2% False positive rate Rational Methodology Results Discussion Future Work 14

Results: Alcohol Dependence A total of 15 significant shared gene/snp sets were found (significant using both the COGA and COGEND training sets) Gene Set Name Mean AUROC Mean Significance Value BRAIN_DEVELOPMENT 0.651562 0.011 IMMUNE_EFFECTOR_PROCESS 0.654478 0.022 INTERFERON_GAMMA_BIOSYNTHETIC_PROCESS 0.666999 0.033 DEFENSE_RESPONSE_TO_VIRUS 0.664685 0.016 INTERFERON_GAMMA_PRODUCTION 0.667297 0.024 AMINO_ACID_DERIVATIVE_METABOLIC_PROCESS 0.661525 0.011 BIOGENIC_AMINE_METABOLIC_PROCESS 0.702950 0.008 REGULATION_OF_INTERFERON_GAMMA_BIOSYNTHETIC_PR OCESS 0.666999 0.033 ALCOHOL_METABOLIC_PROCESS 0.657533 0.001 CENTRAL_NERVOUS_SYSTEM_DEVELOPMENT 0.651999 0.016 INORGANIC_ANION_TRANSPORT 0.658397 0.005 POSITIVE_REGULATION_OF_CYTOKINE_BIOSYNTHETIC_PRO CESS 0.666273 0.019 PEPTIDE_METABOLIC_PROCESS 0.660818 0.007 KEGG_DILATED_CARDIOMYOPATHY 0.667531 0.014 KEGG_ARGININE_AND_PROLINE_METABOLISM 0.659727 0.001 KEGG_VIRAL_MYOCARDITIS 0.718438 0.001 KEGG_PROXIMAL_TUBULE_BICARBONATE_RECLAMATION 0.654189 0.029 GO Gene Set Repository KEGG Rationale Methodology Results Discussion Future Work 15

Results: Alcohol Dependence Some are biologically interesting Gene Set Name BRAIN_DEVELOPMENT Note Association confirmed from in-vivo by Maier et al. for fetal[6] IMMUNE_EFFECTOR_PROCESS Kronfol et al.[7] INTERFERON_GAMMA_BIOSYNTHETIC_PROCESS Jeong et al. [8] KEGG_VIRAL_MYOCARDITIS Wilke et al. [9] KEGG_DILATED_CARDIOMYOPATHY KEGG_ARGININE_AND_PROLINE_METABOLISM KEGG_PROXIMAL_TUBULE_BICARBONATE_RECLAMATION ALCOHOL_METABOLIC_PROCESS PEPTIDE_METABOLIC_PROCESS INORGANIC_ANION_TRANSPORT Dilated cardiomyopathy defined to be caused by alcoholism New association? New association? Associated by generality New association? Related to PROXIMAL_TUBULE Median AUROC of all 15 sets 0.6615 Median COGA AUROC... 0.6854 Median COGEND AUROC. 0.6588 Number of significant sets: From COGA: 28 From COGEND: 35 Rationale Methodology Results Discussion Future Work 16

Results: Lung Cancer Some of these 15 significant common* gene/snp sets are biologically interesting. Gene Set Name M. AUROC M. Significance Value ACTIN_FILAMENT_BASED_MOVEMENT 0.653367 0.03 G1_PHASE 0.662136 0.0265 INTRACELLULAR_SIGNALING_CASCADE 0.657008 0.033 DEVELOPMENTAL_MATURATION 0.640078 0.0445 ACTIN_FILAMENT_ORGANIZATION 0.695239 0.013 KEGG_CIRCADIAN_RHYTHM_MAMMAL 0.658049 0.03 KEGG_GLYCOLYSIS_GLUCONEOGENESIS 0.702448 0.011 MAP00010_Glycolysis_Gluconeogenesis 0.683253 0.0175 P53_DOWN 0.649584 0.041 P53_UP 0.675317 0.0215 GO KEGG GSEA* Median AUROC of all 10 sets 0.6609 *Uses the set of gene sets used by GSEA (standard for comparison) [1] Number significant sets: Boston: 95 Michigan: 74 Rational Methodology Results Discussion Future Work 17

True positive rate Results: Lung Cancer If we create a network encompassing all found lung pathways False positive rate Rational Methodology Results Discussion Future Work 18

Conclusions Moderately high AUROCs (0.6 to 0.75) and significance of sets good accuracy of diagnostic models individual and combined [general AUROC metric] Rational Methodology Results Discussion Future Work 19

Conclusions Moderately high AUROCs (0.6 to 0.75) and significance of sets good accuracy of diagnostic models individual and combined [general AUROC metric] Identified associations confirmed. e.g. VIRAL_MYOCARDITIS [9] New ones found. e.g. ARGININE_AND_PROLINE_METABOLISM Alchoholism: identified 15 significant, robust (data-independent) pathways Lung Cancer: identified 10 significant, robust pathways The gene/snp sets cellular pathways; significance in disease prediction role in disease s biological mechanism Rational Methodology Results Discussion Future Work 20

Conclusions Moderately high AUROCs (0.6 to 0.75) and significance of sets good accuracy of diagnostic models individual and combined [general AUROC metric] Identified associations confirmed. e.g. VIRAL_MYOCARDITIS [9] New ones found. e.g. ARGININE_AND_PROLINE_METABOLISM Alchoholism: identified 15 significant, robust (data-independent) pathways Lung Cancer: identified 10 significant, robust pathways The gene/snp sets cellular pathways; significance in disease prediction role in disease s biological mechanism Good robustness for select data of predictive based analysis (higher number of significant pathways shared by COGA and COGEND) 40, 41 significant pathways vs. the <8, 11 pathways identified by GSEA [1] Robustness (num. common pathways) similar in value Rational Methodology Results Discussion Future Work 21

Future Work In vivo testing of biologically significant pathways Develop new gene-protein interaction and other bio. hypothesis Rational Methodology Results Discussion Future Work 22

Future Work In vivo testing of biologically significant pathways Develop new gene-protein interaction and other bio. hypothesis Consider the results from other training data sets; further confirm and statistically quantify method robustness Model other diseases Rational Methodology Results Discussion Future Work 23

Future Work In vivo testing of biologically significant pathways Develop new gene-protein interaction and other bio. hypothesis Consider the results from other training data sets; further confirm and statistically quantify method robustness Model other diseases Optimize the mechanics and implementation of prediction based algorithm Factor in the importance of cliques within the networks Establish gene-interdependency - networks such as Augmented NaïveBayes Rational Methodology Results Discussion Future Work 24

Acknowledgements Thank you: Dr. Gil Alterovitz for his insightful guidance on the project s direction and challenges Dr. Amin Zollanvari for his great mentorship - guiding me to discover answers to my many questions and helping me when I had difficulties Dr. Tanya Khovanova for her useful project guidance and advice My parents for always being supportive Rational Procedure Results Discussion Future Work 25

References [1] Gene set enrichment analysis: A knowledge based approach for interpreting genome-wide expression profiles. Subramanian et al. 2005. [2] Prediction-based Bayesian Network Analysis of Gene Sets for Genome-wide Association and Expression Studies. Zollanvari and Alterovitz. 2012. [3] Letter from the WHO European Ministerial Conference on Young People and Alcohol. Brundtland and World Health Organization. 2001. [4] Lung cancer (small cell) overview. American Cancer Society. 2012. [5] Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Benjamini and Hochberg. 1995. [6] Fetal Alcohol Exposure and Temporal Vulnerability: Regional Differences in Cell Loss as a Function of the Timing of Binge-Like Alcohol Exposure During Brain Development. Maier et al. 1999. [7] Immune Function in Alcoholism: A Controlled Study. Kronfol et al. 1993. [8] Abrogation of the antifibrotic effects of natural killer cells/interferongamma contributes to alcohol acceleration of liver fibrosis. Jeong et al. 2008. [9] [Alcohol and myocarditis] Wilke et al. 1996. Thank you for your attention! Rational Procedure Results Discussion Future Work 26