Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features Tyler Derr @ Yue Lab tsd5037@psu.edu

Background Hi-C is a chromosome conformation capture (3C) based technology, which outputs the number of interactions between loci at the genome-wide scale. [3]

Background Recent 3D prediction softwares such as BACH[1] and PASTIS[3] exist that can use Hi-C data to produce 3D genome structures BACH: It utilizes a Poisson model that better fits the count data generated from Hi-C experiments than the Gaussian model used in MCMC5C[4]... [1] PASTIS: In [3] they present 4 methods and 2 of these are based upon a Poisson model. Thanks to the recent efforts of the ENCODE and Roadmap Epigenomics projects we have access to the following data per region (40kb resolution): GC content, mappability, number of HindIII cut sites, Pol II, and 6 histone modifications such as H3k36me3

Basis of our research Current softwares such as BACH[1] and PASTIS[3] that can predict 3D genome structures based on Hi-C data have trouble dealing with the bias induced by the techniques to gather the data Hi-C data collection is time consuming, expensive, and have known biases It seems that Dr. Ming Hu (the creator of BACH[1]) had attempted to address the biases by taking into account the enzyme cutting frequency, GC content, and sequence uniqueness when making his 3D predictions However Dr. Hu has recently stated that due to a recent Nature paper the assumptions on a Poisson distribution (which is crucial to BACH) is not appropriate for Hi-C data and therefore invalidating any approach using a Poisson distribution assumption. [2] Can we use Machine Learning techniques to not only alleviate the bias, but also perhaps predict the Hi-C data?

Predicting Hi-C We present two methods: Method 1: Using the entire Hi-C matrix as training data for a single Random Forest (RF) and also a single Artificial Neural Network (ANN) Method 2: Creating a separate RF for each diagonal of the matrix (i.e. Any given RF will only be trained on region pairs of a fixed distance.) (e.g. RF_2 will be trained on all region pairs that are 2 regions away, 80kb) The reasoning behind Method 2 is that it will provide us with knowledge into what features are more meaningful for prediction at different distances.

Predicting Hi-C We use mesc mm9 chrs to train and validate our models Data Features used to Learn Hi-C: 10 for each 40kb region GC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF Method 1: Using RF and ANN Training input for predicting the interaction of two regions ri and rj consists of the 10 features for both the regions plus an additional feature of the distance between the regions [ri.gc, ri.hindiii,, ri.ctcf, rj.gc, rj.hindiii,,rj.ctcf, distance] Attempting to use the above features to predict the Hi-C interaction value between the two given regions ri and rj for all pairs of regions in the chr.

Predicting Hi-C We use mesc mm9 chrs to train and validate our models Data Features used to Learn Hi-C: 10 for each 40kb region GC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF Method 2: Using RF Attempting to use the above features to predict the Hi-C interaction value between the two given regions ri and rj for all pairs of regions for a specific distance in the chr. e.g. For training model RF_2 we use all pairs of regions which are 80kb in distance Input for predicting the interaction of two regions ri and rj consists of the 10 features for both the regions (and not using the distance)

What we have so far... Method 1: Training on all pairs of regions from chr1 and testing our model with all pairs of chr2 RMSE=2.309 & R-squared=0.869 What we have planned for the near future: Performing a leave-one-out cross validation with using all the mesc mm9 chrs Using higher resolution 1kb region sizes

300 Real Interaction Values 600 Scatter Plot of Real vs Predicted Hi-C Data 300 Predicted Interaction Values 600

3D Structure of mesc mm9 chr2 Using Predicted Hi-C 3D models generated using PASTIS (MDS) 3D prediction software [3] Using raw Hi-C Coloration corresponds to the distance from the starting point of the chr (blue, cyan, green, yellow, orange, red)[2]

Predicted Data Real Data Hi-C Heatmaps of mm9 Chr2 - (Entire Chr)

Predicted Data Real Data Hi-C Heatmaps of mm9 Chr2 - (0-40Mbp)

Feature Importances Another part of our project is to attempt at determining which of the 10 features are more meaningful in determining the interaction between the loci regions Question: Are there differences in which features are more significant for the Hi-C values of paired regions that are close compared to far away interactions?

Feature Importances Using Method 2: Feature importances (in sorted order) for predicting the interaction between regions which are 40kb vs 2Mbp in distance 40kb 2Mbp H3k36me3_norm = 0.3571 HindIII = 0.238 HindIII = 0.2871 Map = 0.1686 Map = 0.1062 H3k27me3_norm = 0.0944 H3k27ac_norm = 0.0505 POL2_norm = 0.0862 POL2_norm = 0.0453 GC = 0.0794 H3k4me1_norm = 0.0359 H3k36me3_norm = 0.0721 H3k27me3_norm = 0.0358 CTCF_norm = 0.0711 GC = 0.0295 H3k4me3_norm = 0.0642 CTCF_norm = 0.0269 H3k4me1_norm = 0.0632 H3k4me3_norm = 0.0258 H3k27ac_norm = 0.0606 Note: These values are obtained by analysis on the Decision Trees in a Random Forest model. The feature importances are calculated by randomly permuting the values for a single feature among the training instances. The more the variation in prediction accuracy when using the correct feature values vs the permuted values imply that the feature is more meaningful/important for the prediction.

Feature Importances

Feature Importances Future Work Idea: Use data mining techniques to determine more information behind the correlation of features (and also pairs of features) to the Hi-C interaction values

Thank you

References [1] Hu, Ming, et al. "Bayesian inference of spatial organizations of chromosomes."plos computational biology 9.1 (2013): e1002893. [2] Kuang, Simon 2014 Google Science Fair Poster [3] Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." science 326.5950 (2009): 289-293. [4]Rousseau, Mathieu, et al. "Three-dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling."bmc bioinformatics 12.1 (2011): 414. [5] Varoquaux, Nelle, et al. "A statistical approach for inferring the 3D structure of the genome." Bioinformatics 30.12 (2014): i26-i33.