Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Similar documents
ChromHMM Tutorial. Jason Ernst Assistant Professor University of California, Los Angeles

Chromatin marks identify critical cell-types for fine-mapping complex trait variants

Nature Structural & Molecular Biology: doi: /nsmb.2419

Comparison of open chromatin regions between dentate granule cells and other tissues and neural cell types.

Patterns of Histone Methylation and Chromatin Organization in Grapevine Leaf. Rachel Schwope EPIGEN May 24-27, 2016

Supporting Information Identification of Amino Acids with Sensitive Nanoporous MoS 2 : Towards Machine Learning-Based Prediction

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Nature Genetics: doi: /ng Supplementary Figure 1

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Supplemental Figures. Figure S1: 2-component Gaussian mixture model of Bourdeau et al. s fold-change distribution

Computational aspects of ChIP-seq. John Marioni Research Group Leader European Bioinformatics Institute European Molecular Biology Laboratory

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

Modeling gene expression using five histone modifications

Supplemental Figure S1. Tertiles of FKBP5 promoter methylation and internal regulatory region

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

The Epigenome Tools 2: ChIP-Seq and Data Analysis

MIR retrotransposon sequences provide insulators to the human genome

Yingying Wei George Wu Hongkai Ji

Peak-calling for ChIP-seq and ATAC-seq

Raymond Auerbach PhD Candidate, Yale University Gerstein and Snyder Labs August 30, 2012

Use Case 9: Coordinated Changes of Epigenomic Marks Across Tissue Types. Epigenome Informatics Workshop Bioinformatics Research Laboratory

Neurons and neural networks II. Hopfield network

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Introduction to Computational Neuroscience

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Neuron, Volume 63 Spatial attention decorrelates intrinsic activity fluctuations in Macaque area V4.

Small Sample Bayesian Factor Analysis. PhUSE 2014 Paper SP03 Dirk Heerwegh

Statistical analysis of RIM data (retroviral insertional mutagenesis) Bioinformatics and Statistics The Netherlands Cancer Institute Amsterdam

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

Nature Methods: doi: /nmeth.3115

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Histone Modifications Are Associated with Transcript Isoform Diversity in Normal and Cancer Cells

Assignment 5: Integrative epigenomics analysis

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Internal & External Validity

BayesRandomForest: An R

3. Model evaluation & selection

Discrimination and Generalization in Pattern Categorization: A Case for Elemental Associative Learning

Discovering Meaningful Cut-points to Predict High HbA1c Variation

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

ChIP-seq hands-on. Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

Supplementary Figure S1. Gene expression analysis of epidermal marker genes and TP63.

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

CTCF-Mediated Functional Chromatin Interactome in Pluripotent Cells

Part-II: Statistical analysis of ChIP-seq data

Pre-mRNA Secondary Structure Prediction Aids Splice Site Recognition

EPIGENOMICS PROFILING SERVICES

High Throughput Sequence (HTS) data analysis. Lei Zhou

Natural Scene Statistics and Perception. W.S. Geisler

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

Predicting Breast Cancer Survivability Rates

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects

Predication-based Bayesian network analysis of gene sets and knowledge-based SNP abstractions

Supplemental Figure 1. Genes showing ectopic H3K9 dimethylation in this study are DNA hypermethylated in Lister et al. study.

ChIP-seq data analysis

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Modeling Sentiment with Ridge Regression

The role of sampling assumptions in generalization with multiple categories

CS221 / Autumn 2017 / Liang & Ermon. Lecture 19: Conclusion

The Effects of Autocorrelated Noise and Biased HRF in fmri Analysis Error Rates

Reconstruction of enhancer-target networks in 935 samples of human primary cells, tissues and cell lines

Complex Trait Genetics in Animal Models. Will Valdar Oxford University

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

10CS664: PATTERN RECOGNITION QUESTION BANK

Session 6: Integration of epigenetic data. Peter J Park Department of Biomedical Informatics Harvard Medical School July 18-19, 2016

Progress in Risk Science and Causality

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

ChIP-seq analysis. J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant, M. Thomas-Chollier, O.Sand

Neuroinformatics. Ilmari Kurki, Urs Köster, Jukka Perkiö, (Shohei Shimizu) Interdisciplinary and interdepartmental

BayesOpt: Extensions and applications

Audiovisual to Sign Language Translator

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Exercises: Differential Methylation

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

Rachael E. Jack, Caroline Blais, Christoph Scheepers, Philippe G. Schyns, and Roberto Caldara

High-order chromatin architecture determines the landscape of chromosomal alterations in cancer

From Biostatistics Using JMP: A Practical Guide. Full book available for purchase here. Chapter 1: Introduction... 1

Motivation: Attention: Focusing on specific parts of the input. Inspired by neuroscience.

Inference Methods for First Few Hundred Studies

Heintzman, ND, Stuart, RK, Hon, G, Fu, Y, Ching, CW, Hawkins, RD, Barrera, LO, Van Calcar, S, Qu, C, Ching, KA, Wang, W, Weng, Z, Green, RD,

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Predicting chromatin organization using histone marks

Lung Met 1 Lung Met 2 Lung Met Lung Met H3K4me1. Lung Met H3K27ac Primary H3K4me1

Hierarchical Convolutional Features for Visual Tracking

4. Model evaluation & selection

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Sudin Bhattacharya Institute for Integrative Toxicology

Applied Machine Learning, Lecture 11: Ethical and legal considerations; domain effects and domain adaptation

SUPPLEMENTARY INFORMATION In format provided by Javier DeFelipe et al. (MARCH 2013)

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz

Nature Immunology: doi: /ni Supplementary Figure 1. Characteristics of SEs in T reg and T conv cells.

Statistical Genetics. Matthew Stephens. Statistics Retreat, October 26th 2012

Natural Selection Simulation: Predation and Coloration

Clustering mass spectrometry data using order statistics

Learning Utility for Behavior Acquisition and Intention Inference of Other Agent

Transcription:

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features Tyler Derr @ Yue Lab tsd5037@psu.edu

Background Hi-C is a chromosome conformation capture (3C) based technology, which outputs the number of interactions between loci at the genome-wide scale. [3]

Background Recent 3D prediction softwares such as BACH[1] and PASTIS[3] exist that can use Hi-C data to produce 3D genome structures BACH: It utilizes a Poisson model that better fits the count data generated from Hi-C experiments than the Gaussian model used in MCMC5C[4]... [1] PASTIS: In [3] they present 4 methods and 2 of these are based upon a Poisson model. Thanks to the recent efforts of the ENCODE and Roadmap Epigenomics projects we have access to the following data per region (40kb resolution): GC content, mappability, number of HindIII cut sites, Pol II, and 6 histone modifications such as H3k36me3

Basis of our research Current softwares such as BACH[1] and PASTIS[3] that can predict 3D genome structures based on Hi-C data have trouble dealing with the bias induced by the techniques to gather the data Hi-C data collection is time consuming, expensive, and have known biases It seems that Dr. Ming Hu (the creator of BACH[1]) had attempted to address the biases by taking into account the enzyme cutting frequency, GC content, and sequence uniqueness when making his 3D predictions However Dr. Hu has recently stated that due to a recent Nature paper the assumptions on a Poisson distribution (which is crucial to BACH) is not appropriate for Hi-C data and therefore invalidating any approach using a Poisson distribution assumption. [2] Can we use Machine Learning techniques to not only alleviate the bias, but also perhaps predict the Hi-C data?

Predicting Hi-C We present two methods: Method 1: Using the entire Hi-C matrix as training data for a single Random Forest (RF) and also a single Artificial Neural Network (ANN) Method 2: Creating a separate RF for each diagonal of the matrix (i.e. Any given RF will only be trained on region pairs of a fixed distance.) (e.g. RF_2 will be trained on all region pairs that are 2 regions away, 80kb) The reasoning behind Method 2 is that it will provide us with knowledge into what features are more meaningful for prediction at different distances.

Predicting Hi-C We use mesc mm9 chrs to train and validate our models Data Features used to Learn Hi-C: 10 for each 40kb region GC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF Method 1: Using RF and ANN Training input for predicting the interaction of two regions ri and rj consists of the 10 features for both the regions plus an additional feature of the distance between the regions [ri.gc, ri.hindiii,, ri.ctcf, rj.gc, rj.hindiii,,rj.ctcf, distance] Attempting to use the above features to predict the Hi-C interaction value between the two given regions ri and rj for all pairs of regions in the chr.

Predicting Hi-C We use mesc mm9 chrs to train and validate our models Data Features used to Learn Hi-C: 10 for each 40kb region GC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF Method 2: Using RF Attempting to use the above features to predict the Hi-C interaction value between the two given regions ri and rj for all pairs of regions for a specific distance in the chr. e.g. For training model RF_2 we use all pairs of regions which are 80kb in distance Input for predicting the interaction of two regions ri and rj consists of the 10 features for both the regions (and not using the distance)

What we have so far... Method 1: Training on all pairs of regions from chr1 and testing our model with all pairs of chr2 RMSE=2.309 & R-squared=0.869 What we have planned for the near future: Performing a leave-one-out cross validation with using all the mesc mm9 chrs Using higher resolution 1kb region sizes

300 Real Interaction Values 600 Scatter Plot of Real vs Predicted Hi-C Data 300 Predicted Interaction Values 600

3D Structure of mesc mm9 chr2 Using Predicted Hi-C 3D models generated using PASTIS (MDS) 3D prediction software [3] Using raw Hi-C Coloration corresponds to the distance from the starting point of the chr (blue, cyan, green, yellow, orange, red)[2]

Predicted Data Real Data Hi-C Heatmaps of mm9 Chr2 - (Entire Chr)

Predicted Data Real Data Hi-C Heatmaps of mm9 Chr2 - (0-40Mbp)

Feature Importances Another part of our project is to attempt at determining which of the 10 features are more meaningful in determining the interaction between the loci regions Question: Are there differences in which features are more significant for the Hi-C values of paired regions that are close compared to far away interactions?

Feature Importances Using Method 2: Feature importances (in sorted order) for predicting the interaction between regions which are 40kb vs 2Mbp in distance 40kb 2Mbp H3k36me3_norm = 0.3571 HindIII = 0.238 HindIII = 0.2871 Map = 0.1686 Map = 0.1062 H3k27me3_norm = 0.0944 H3k27ac_norm = 0.0505 POL2_norm = 0.0862 POL2_norm = 0.0453 GC = 0.0794 H3k4me1_norm = 0.0359 H3k36me3_norm = 0.0721 H3k27me3_norm = 0.0358 CTCF_norm = 0.0711 GC = 0.0295 H3k4me3_norm = 0.0642 CTCF_norm = 0.0269 H3k4me1_norm = 0.0632 H3k4me3_norm = 0.0258 H3k27ac_norm = 0.0606 Note: These values are obtained by analysis on the Decision Trees in a Random Forest model. The feature importances are calculated by randomly permuting the values for a single feature among the training instances. The more the variation in prediction accuracy when using the correct feature values vs the permuted values imply that the feature is more meaningful/important for the prediction.

Feature Importances

Feature Importances Future Work Idea: Use data mining techniques to determine more information behind the correlation of features (and also pairs of features) to the Hi-C interaction values

Thank you

References [1] Hu, Ming, et al. "Bayesian inference of spatial organizations of chromosomes."plos computational biology 9.1 (2013): e1002893. [2] Kuang, Simon 2014 Google Science Fair Poster [3] Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." science 326.5950 (2009): 289-293. [4]Rousseau, Mathieu, et al. "Three-dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling."bmc bioinformatics 12.1 (2011): 414. [5] Varoquaux, Nelle, et al. "A statistical approach for inferring the 3D structure of the genome." Bioinformatics 30.12 (2014): i26-i33.