Processing, integrating and analysing chromatin immunoprecipitation followed by sequencing (ChIP-seq) data Bioinformatics methods, models and applications to disease Alex Essebier
ChIP-seq experiment To determine protein binding sites in the genome Snapshot of in vivo sites occupied by protein Improve understanding of regulation in genome Improve understanding of epigenetics Transcription factors TFs Histone modifications HMs To tails of histone proteins forming nucleosomes
ChIP-seq data processing
ChIP-seq principles Wet lab Extract DNA bound by protein of interest
Was ChIP-seq successful? ChIP-seq principles Sequence depth Depends on size of genome and type of protein Mammalian TF 20 million reads
Was ChIP-seq successful? Sequence quality control High quality FastQC to analyse ChIP-seq principles
Was ChIP-seq successful? Alignment quality control Uniquely aligned reads ChIP-seq principles
Was ChIP-seq successful? ChIP-seq creates bimodal pattern of reads at peak Strand cross correlation analysis SCCA ChIP-seq principles
Basic principles of peak calling Sample Exposed to antibody Input No antibody exposure Peak With statistical significance Compared to To generate
The problem with peak calling Choice of peak caller depends on problem Based on statistical or probabilistic models Omic Tools reports 51 ChIP-seq tools In-house tools e.g. stalled or transient
Comparing peak callers - TFs HOMER and SPP fixed size peak 262bps and 470bps respectively MACS2 variable size peak Avg. 328bps, mode 140-180bps Peak caller Total % Unique MACS2 42,536 12% HOMER 45,044 19% SPP 19,474 0.7%
Number Peak quality control How active is the protein? Read coverage Are peak locations enriched for reads? Fraction of reads in peaks (FRiP) > 1% Generally observe > 10% E.g. below 6/50 reads in peak -> 12%
Replicate datasets Biological replicates can vary significantly Call peaks for replicates individually Compare/overlap to achieve golden standard Comparisons are dominated by poor replicate
PEAK ANALYSIS Exploring the peaks generated from ChIP-seq
Transcription Factors Confirm in vitro and in silico results Overlapping peaks with motifs Identify consensus motif For TFs which do not have an existing/known motif To identify variations in motif Differential peak binding To identify differences in binding patterns Compare cell types or time points
Histone Modifications Epigenetic analysis Generate epigenetic profiles Identify chromatin states genome wide E.g. ChromHMM Identify regulatory modules E.g. promoters or enhancers Differential peak binding Identify differences in epigenetic patterns
INTEGRATING DATA Combining data sets to improve outcomes
Data integration Experiments capture dependent regulatory events ChIP-seq regulatory elements DNase I hypersensitivity (DHS) chromatin accessibility RNA-seq expression patterns Consider multiple datasets to: Improve confidence and understanding Support hypotheses
Supporting HMs Explore chromatin environment Layered HMs DHS chromatin accessibility
ChIP-seq complications Possible to observe multiple states at one location False negatives Can t detect small sub-populations False positives General non-specific chromatin being pulled down Bias not removed by input
Supporting TFs Assumption: TFs bind open/active chromatin Preferentially bind regulatory regions E.g. promoters or enhancers
ChIP-seq complications ChIP-seq generates peaks for all of these events
TF target genes using RNA-seq RNA-seq on knock-out of TF Identify genes with changes in expression Gene 1 is down-regulated Direct target of TF
PRACTICAL EXAMPLE The role of Math1 in differentiation of cerebellum
Role of Math1 in differentiation Aim: to identify genes targeted by Math1 Approach: integrate available data Dataset Data type Called peaks Math1 ChIP-seq 8,804 H3K4me1 ChIP-seq 11,270 H3K4me3 ChIP-seq 15,894 DHS DNase I hypersensitivity 73,682 Math1_KO RNA-seq NA
Combining replicates Two replicates for H3K4me1 Two peak callers: MACS2 HOMER Data set Peaks Overlap MACS2_rep1 8,183 MACS2_rep2 9,789 5,269 HOMER_rep1 71,534 HOMER_rep2 70,469 48,661 H3K4me1 rep1 H3K4me1 rep2 IgG Control MACS2_rep1 HOMER_rep1 MACS2_rep2 HOMER_rep2
Combining replicates Two replicates Two peak callers: MACS2 HOMER Generate high quality merged output Requires called peak in 3 of 4 data sets 11,270 peaks in total H3K4me1 rep1 H3K4me1 rep2 IgG Control MACS2_rep1 HOMER_rep1 MACS2_rep2 HOMER_rep2 Merged_out
Identify regulatory regions Three outputs from epigenetic data: H3K4me1_DHS sites putative enhancers H3K4me3_DHS sites putative promoters H3K4me1_H3K4me3_DHS sites other Comparison Sites Overlap H3K4me1_DHS 9,011 80% of H3K4me1 H3K4me3_DHS 15,098 95% of H3K4me3 H3K4me1_H3K4me3_DHS 919
Bound Math1 Identify regulatory regions bound by Math1 Math1 binds preferentially to putative enhancer >50% Math1 binding sites do not overlap a defined regulatory region Putative Enhancer Putative Promoter No Overlap
% of total % of total Distance profiles Binding by Math1 selects for distal regulatory regions (>2,000bps from TSS) 100 100 80 60 40 20 H3K4me1 DHS H3K4me1 DHS Math1 80 60 40 20 H3K4me3 DHS H3K4me3 DHS Math1 0 Proximal Distal 0 Proximal Distal
Long distance regulation How to identify genes regulated by an enhancer?
RNA-seq Proximal putative promoter bound by Math1 81 8 18 Up regulated Down regulated No significant change Distal putative enhancer bound by Math1 176 312 Up regulated Down regulated CisMapper for long distance interactions 1562 No significant change
System complexity Small number of differentially expressed genes are bound by Math1 System redundancy Indirect changes in expression 326 Up regulated genes 182 63 2170 172 2693 Full RNA Math1 H3K4me1 Math1 H3K4me3 Down regulated genes Full RNA Math1 H3K4me1 Math1 H3K4me3
Take home messages Understand your data and how best to use it Quality control Peak calling Use multiple where possible Keep up to date with advances Data integration Use all available data to gain a more complete picture
Data Resources Klisch, T. J., Xi, Y., Flora, A., Wang, L., Li, W., & Zoghbi, H. Y. (2011). In vivo Atoh1 targetome reveals how a proneural transcription factor regulates cerebellar development. Proceedings of the National Academy of Sciences,108(8), 3288-3293. Frank, C. L., Liu, F., Wijayatunge, R., Song, L., Biegler, M. T., Yang, M. G.,... & West, A. E. (2015). Regulation of chromatin accessibility and Zic binding at enhancers in the developing cerebellum. Nature neuroscience, 18(5), 647-656. Useful papers Bailey, T., Krajewski, P., Ladunga, I., Lefebvre, C., Li, Q., Liu, T.,... & Zhang, J. (2013). Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol, 9(11), e1003326. Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S.,... & Chen, Y. (2012). ChIP-seq guidelines and practices of the ENCODE and modencode consortia. Genome research, 22(9), 1813-1831. Farnham, P. J. (2009). Insights from genomic profiling of transcription factors.nature Reviews Genetics, 10(9), 605-616. Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E.,... & Liu, X. S. (2008). Model-based analysis of ChIP- Seq (MACS).Genome biology, 9(9), 1. Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P.,... & Glass, C. K. (2010). Simple combinations of lineagedetermining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular cell, 38(4), 576-589.