Heintzman, ND, Stuart, RK, Hon, G, Fu, Y, Ching, CW, Hawkins, RD, Barrera, LO, Van Calcar, S, Qu, C, Ching, KA, Wang, W, Weng, Z, Green, RD, Crawford, GE, Ren, B (2007) Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet, 39:311 318.
The problem Different cells have different patterns of gene expression. Cells respond to their environment by changing the expression of genes. Enhancers are the DNA sequences used to drive the specific patterns of expression. People want to use a genomics approach to identify regulatory circuitry that controls patterns of gene expression. Once, we thought that if you knew the DNA sequence of a single enhancer that specified a pattern of expression that you could use this to identify other examples. E gen e E
The problem The DNA sequence of an enhancer is a poor predictor of function. It only identifies a tiny subset of enhancers. An early demonstration of this fact: Impey, S, McCorkle, SR, Cha-Molstad, H, Dwyer, JM, Yochum, GS, Boss, JM, McWeeney, S, Dunn, JJ, Mandel, G, Goodman, RH (2004) Defining the CREB regulon: a genome-wide analysis of transcription factor regulatory regions. Cell, 119:1041 1054. Canonical palindromic binding site TGACGTCA Approx freq of this seq is 1/65kb (one in 4^8), genome is 3X10^9, so in theory 46,000 sites, actual count is just over 32,000 ChIP:Saco = A PCR/sequencing technique Genome wide 6302 sites, many are NOT canonical Seq analysis id s too many and many real ones don t match the motif. Novel = new, not predicted. For chromosome 10
The problems Enhancers may not be near the promoters they regulate and so how do you identify their targets? Some experiments indicate that only ~20% of enhancers are near the promoter (~1kb). In addition, the presence of the DNA sequence for an enhancer does not mean that the enhancer is being used.
The problem The situation is a bit better for promoters but not much.
A possible solution? Epigenetic modifications may provide patterns that identify enhancers and promoters They may also provide patterns that say NOT enhancer or NOT promoter.
Now they do the entire genome but they began with data from a consortium of scientists. 30 megabases (1% of the genome) 44 genomic regions Known genes that are ON. Known genes that are OFF. Known NOT genes.
Nat Genet, 39:311 318. (2007) The over-arching hypothesis is that the transcription machinery uses histone modifications as a first identifier of where promoters and enhancers lie AND that cells modify these histone marks as a way to regulate gene expression. The goal of this paper is to see if histone marks can be used by the investigator to identify active genes. To figure out if a new technique can work it is nice to know what the answer is.
Nat Genet, 39:311 318. (2007) First we start with a circular process: Take KNOWN promoters and KNOWN enhancers and see if they are covered by recognizable patterns of histone modifications. Then they ask if these patterns alone can be used to identify these promoters and enhancers in the genome. Wait! - we must have had a way to identify KNOWN promoters and enhancers because they start with them. So why bother? Because these KNOWN promoters were identified because of substantial labor and effort by a large number of people over a long period of time. We need an easier way.
Nat Genet, 39:311 318. (2007) Known promoters that are on versus Known promoters that are off versus non promoter 1. What histone marks do they have? Are these marks a reliable way to identify them? Known promoters that are on vs Known promoters that are off vs non promoter Eventually ask if the new method can discover NEW promoters!
Nat Genet, 39:311 318. (2007) Known enhancers that are on versus Known enhancers that are off versus non enhancer 1. What histone marks do they have? Are these marks a reliable way to identify them? Known enhancers that are on vs Known enhancers that are off vs non enhancer Eventually ask if the new method can discover NEW enhancers!
Performed chromatin immunoprecipitation Used antibodies that recognized five different different histone modifications. H4ac H3ac H3K4me1 H3K4me2 H3K4me2 H3K4me3 H3 H4ac H3ac 4 logr -2 H3K4me1 H3K4me2 H3K4me3 RFX5 Analyze by ChIP:Chip using an ENCODE array.
Now they do the entire genome but they began with data from a consortium of scientists. 30 megabases (1% of the genome) 44 genomic regions Known genes that are ON. Known genes that are OFF. Known Enhancers. Known NOT genes and Known Not Enhancers.
What material was used? DNA chip ENCODE fragments: 30 Mb database of transcriptionally active sequences in humans. 38 bp resolution Chromatin will be prepared from HeLa cells. 10 kb blocks around each landmark. Landmarks are known promoters and known enhancers Start with 104 TSSs (core promoters) defined as active promoters in HeLa cells 104 TSSs defined as inactive in HeLa cells
TSS (transcription start site) is synonymous with the core promoters
Conceptual explanation of simple clustering and heat maps
Align so that transcription start sites are centered gene 1 gene 2 gene 3 gene 4 gene 5 gene 6 gene 7 gene 8-5 kb +5 kb
Use genomic tiling array to determine position of each mark. gene 1 gene 2 gene 3 gene 4 gene 5 gene 6 gene 7 gene 8-5 kb +5 kb X X X X X XXXXXX X X X XX X X X
gene 1-5 kb +5 kb X X Sort the genes into self-similar groups gene 2 gene 3 gene 4 gene 5 X X X XXXXXX X X similar to one another gene 6 X X gene 7 gene 8 X XX X
Sort the genes into self-similar groups gene 4 gene 7 gene 5 gene 6 gene 1 gene 2 gene 8 gene 3-5 kb +5 kb X X X X X X X X XX X X X XXXXXX
Compact so that many genes can be viewed in a small space. X X X X X X X X XX X X X XXXXXX
Compact so that many genes can be viewed in a small space.
Compact so that many genes can be viewed in a small space. Group 1 Group II Group III Group IV Called a heat map black = average red = greater than average green = less than average Color scheme can vary. In papers you should double check what the colors mean.
Compact so that many genes can be viewed in a small space. Group 1 Group II Group III Group IV Called a heat map black = average red = greater than average green = less than average Each line is very thin. About 200 lines here. Each line is like a track in Human Epigenome Browser. Color scheme can vary. In papers you should double check what the colors mean.
a 208 promoters (RefSeq TSSs) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 % active Centered on known TSSs. P1 P2 18/102 24/41 P3 21/23 2007 Nature Publishing Group http://www.nature.com/naturegenetics Centered on p300 binding sites. b c d logr logr P4 2.5 0 0.5 5 kb TSS +5 kb TSS TSS TSS TSS TSS TSS TSS TSS Class P1 Class P2 Class P3 Class P4 logr 3 0 3 E1 E2 E3 2.5 0 0.5 74 enhancers (distal p300 binding sites) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 5 kb Peak +5 kb Peak Peak Peak Peak Peak Peak Peak Peak Class E1 Class E2 Class E3 logr 3 0 3 Figure 1 Features of human transcriptional promoters and enhancers. ChIP-chip was performed against six histone markers and three general transcription factors in the ENCODE regions, and the data were clustered to reveal patterns at annotated promoters (a) and distal p300 binding sites (c). Promoter clustering was performed with 10-kb windows centered on RefSeq TSSs; enhancer windows were centered on promoter-distal p300 binding peaks. Average profiles for each marker within each class are shown below the clusters (b,d); each class is represented by a different color. The proportion of expressed genes ( % active ) in each promoter class is presented to the right of the cluster, illustrating the relationship between histone modification patterns and gene expression. Comparison of the clusters shows that active promoters and enhancers are similarly marked by nucleosome depletion (column H3) but distinctly marked by mono- and trimethylation of histone H3K4 (columns H3K4me1 and H3K4me3). Note the depletion of H3K4me1 and the peak of H3K4me3 at the TSS in promoters, compared with the enrichment of H3K4me1 and lack of H3K4me3 at enhancers. The presence of histone methylation and acetylation upstream and downstream of the TSS at promoters is distinct from the primarily downstream localization of these markers observed at yeast promoters. The same procedure was performed on data from treated HeLa cells, yielding similar results (Supplementary Fig. 2). H4ac, acetylated H4; H3ac, acetylated H3. 41/42
organized into classes a TAF II p250 208 promoters (RefSeq TSSs) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 % active Centered on known TSSs. P1 P2 18/102 24/41 Increasing activity wrt transcription 2007 Nature Publishing Group http://www.nature.com/naturegenetics Centered on p300 binding sites. b c d logr logr P3 P4 2.5 0 0.5 5 kb TSS +5 kb TSS TSS TSS TSS TSS TSS TSS TSS Class P1 Class P2 Class P3 Class P4 logr 3 0 3 E1 E2 E3 2.5 0 0.5 74 enhancers (distal p300 binding sites) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 How do we align enhancers? 5 kb Peak +5 kb Peak Peak Peak Peak Peak Peak Peak Peak Class E1 Class E2 Class E3 logr 3 0 3 Figure 1 Features of human transcriptional promoters and enhancers. ChIP-chip was performed against six histone markers and three general transcription factors in the ENCODE regions, and the data were clustered to reveal patterns at annotated promoters (a) and distal p300 binding sites (c). Promoter clustering was performed with 10-kb windows centered on RefSeq TSSs; enhancer windows were centered on promoter-distal p300 binding peaks. Average profiles for each marker within each class are shown below the clusters (b,d); each class is represented by a different color. The proportion of expressed genes ( % active ) in each promoter class is presented to the right of the cluster, illustrating the relationship between histone modification patterns and gene expression. Comparison of the clusters shows that active promoters and enhancers are similarly marked by nucleosome depletion (column H3) but distinctly marked by mono- and trimethylation of histone H3K4 (columns H3K4me1 and H3K4me3). Note the depletion of H3K4me1 and the peak of H3K4me3 at the TSS in promoters, compared with the enrichment of H3K4me1 and lack of H3K4me3 at enhancers. The presence of histone methylation and acetylation upstream and downstream of the TSS at promoters is distinct from the primarily downstream localization of these markers observed at yeast promoters. The same procedure was performed on data from treated HeLa cells, yielding similar results (Supplementary Fig. 2). H4ac, acetylated H4; H3ac, acetylated H3. 21/23 41/42
208 promoters (RefSeq TSSs) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 % active P1 18/102 P2 P3 P4 2.5 24/41 21/23 41/42 E1 E2 E3 2.5 74 enhancers (distal p300 binding sites) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 0 0.5 5 kb TSS +5 kb TSS TSS TSS TSS TSS TSS TSS TSS Class P1 Class P2 Class P3 Class P4 0 0.5 5 kb Peak +5 kb Peak Peak Peak Peak Peak Peak Peak Peak Class E1 Class E2 Class E3 logr 3 0 3 1. Don t use P1 promoters because there is no positive signal. These are thought to be promoters that are off. Other non-promoter regions might look like this. Probably can eventually use an off promoter mark. 2. Shape of the peak and intensity are going to be the what we are looking for.
208 promoters (RefSeq TSSs) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 % active P1 18/102 P2 P3 P4 2.5 24/41 21/23 41/42 E1 E2 E3 2.5 74 enhancers (distal p300 binding sites) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 0 0.5 5 kb TSS +5 kb TSS TSS TSS TSS TSS TSS TSS TSS Class P1 Class P2 Class P3 Class P4 0 0.5 5 kb Peak +5 kb Peak Peak Peak Peak Peak Peak Peak Peak Class E1 Class E2 Class E3 logr 3 0 3 Weak H3K4me1 and Strong H3K4me3 is most strongly correlated with an active core promoter. The shape is also important. Notice the nucleosome-free region at +1.
208 promoters (RefSeq TSSs) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 % active P1 18/102 P2 P3 P4 2.5 24/41 21/23 41/42 E1 E2 E3 2.5 74 enhancers (distal p300 binding sites) H4ac H3ac H3K4me1 H3K4me2 H3K4me3 H3 RNAPII TAF1 p300 0 0.5 5 kb TSS +5 kb TSS TSS TSS TSS TSS TSS TSS TSS Class P1 Class P2 Class P3 Class P4 0 0.5 5 kb Peak +5 kb Peak Peak Peak Peak Peak Peak Peak Peak Class E1 Class E2 Class E3 logr 3 0 3 The opposite is found to be most tightly correlated with enhancers. Strong H3K4me1 and Weak H3K4me3 is most strongly correlated with an active enhancer. The shape is also important. Notice that the shape is different than the promoter mark shape.
44 genomic regions - 208 well described genes ENCODE fragments are discrete regions. This was used to generate the previous heat maps. Representing them as a line. 30 mbases of genomic DNA ENCODE fragments Want to use this information to directly ID promoters and enhancers. Really trying to catagorize sequence as 1. promoter 2. enhancer 3. not 1 or 2.
How do we figure out what to do? TRAINING SET Calculate the average profile for a mark. e.g. P4 H3ac could be one training set. 30 Megabases of genomic DNA
How do we figure out what to do? TRAINING SET Calculate the average profile for a mark. 30 Megabases of genomic DNA TEST SET is all overlapping 10 kb windows from the same 30 Megabase region Could we ID promoters with this average profile? Need something to examine but the answer must be known.
How do we figure out what to do? TRAINING SET Calculate the average profile for a mark. 30 Megabases of genomic DNA TEST SET is all overlapping 10 kb windows from the same 30 Megabase region. There are more windows here that in the training set. For each member of the TEST SET (-) determine how closely the histone mod profile matches the shape of one of the six TRAINING SET histone mods. Do this by determining the Euclidian distance.
Euclidian distance Test 1 Plot the peak on a coordinate system. Determine how many transformations are required to convert a test plot into the average. Test 2 Average Done for all six histone marks.
How do we figure out what to do? TRAINING SET Calculate the average profile for a mark. 30 Megabases of genomic DNA TEST SET is all overlapping 10 kb windows from the same 30 Megabase region. There are more windows here that in the training set. For each member of the TEST SET (-) determine how closely the histone mod profile matches the shape of one of the six TRAINING SET histone mods. Do this by determining the Euclidian distance.
How do we figure out what to do? TRAINING SET Calculate the average profile for a mark. 30 Megabases of genomic DNA TRAINING SET TEST SET is all overlapping 10 kb windows from the same 30 Megabase reagion For each member of the TEST SET (-) determine how closely the histone mod profile matches the shape of one of the six TRAINING SET histone mods. Do this by determining the Euclidian distance. Discrimination Filter The identified member must have at least a 0.4 correlation statistical correlation with one training set. It must also more closely match enhancer or promoter - not both. If either test fails then it is discarded.
So what!!! This shows that a training set can identify its own members!!! Big deal. How can we determine if it can determine members that ARE NOT in the training set!!! This is what they want. TRAINING SET Calculate the average profile for a mark. TRAINING SET TEST SET is all overlapping 10 kb windows from the same 30 Megabase reagion For each member of the TEST SET (-) determine how closely the histone mod profile matches the shape of one of the six TRAINING SET histone mods. Do this by determining the Euclidian distance. Discrimination Filter The identified member must have at least a 0.4 correlation statistical correlation with one training set. It must also more closely match enhancer or promoter - not both. If either test fails then it is discarded.
Demonstrate this by making training sets that are missing some members. TRAINING SET MINUS RANDOM 10% that is missing a random 10% of the promoters TRAINING SET TEST SET is all overlapping 10 kb windows from the same 30 Megabase reagion Search the test set and see how well it does in retrieving the missing 10%. Repeat this process with a different randonmly selected 10%. H3K4me3 is good for promoters H3K4me1 is good for enhancers but together they work better than either one separately. Decide that you need to use the output from two histone modifications for best predictability.
How well does it work to identify the 10% omitted from the TRAINING SET? Number of correct & incorrect predictions Each group differs by 10% of its members. Height is different because the outliers are different. This is just for the P3 Promoter class. Number of predictions needed to achieve these results.
But how can we tell if it really works? Functionally test a predicted enhancer that was located 6 kb upstream of SLC22A5 - little is known about its transcriptional regulation.
Carnitine transporter a Cloned regions of the SLC22A5 locus (chr5) 0 1 Distance, kb SLC22A5 Predicted enhancer (E) Promoter (P S ) Entire locus (L) Locus - E (L E) Test for enhancer properties in transfected HeLa cells. Showed that this approach successfully identified KNOWN promoters and enhancers AND identified NEW enhancers. These worked as enhancers when physically tested. b Relative intensity (arbitrary units) luciferase L L E 20 15 10 5 0 pgl3-b L L E Relative intensity (arbitrary units) P S Luciferase reporter activity in untreated HeLa cells Entire region 2.0 1.5 1.0 0.5 0 pgl3-b plasmid by itself plasmid by itself P S P S E P S E Entire region with deleted promoter. Promoter + pred enh Promoter w/o predicted enh..
Heintzman, et al (2009) Histone modifications at human enhancers reflect global celltype-specific gene expression. Nature, 459:108 112. Take home: Chromatin modifications at enhancers are globally related to cell-type-specific gene expression 42
H3K4me1, H3K4me3, H3K27ac Surprise! Chromatin marks at promoters are similar in all cell types. CTCF occupancy is same across all cell types. CTCF is an insulator bindng protein. Action is at the enhancers H3K4me1 and H3K27 Variability is greatest between cell types 43
5 human immortalized cell lines cervical carcinoma HeLa cells immortalized lymphoblast GM06690 (GM) leukaemia K562 embryonic stem cells (ES) BMP4-induced ES cells (des) * 55,454 enhancers examined
At core promoters the marks are the same in most tissue types. This was a big suprise. We are not shown it but expression of these genes differ. Example TSS a HeLa GM K562 ES des H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac 5 kb +5 kb logr 3 0 +3 Enhancers b HeLa GM K562 ES des H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac 5 kb +5 kb logr 3 0 +3 45
At core promoters the marks are the same in most tissue types. This was a big suprise. We are not shown it but expression of these genes differ. But enhancer marks are diagnostic for the tissue type. Example TSS a HeLa GM K562 ES des H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac 5 kb +5 kb logr 3 0 +3 Enhancers b HeLa GM K562 ES des H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac H3K4me1 H3K4me3 H3K27ac 5 kb +5 kb logr 3 0 +3 46
Chromatin modifications at enhancers are globally related to cell-type-specific gene expression Example 55,454 enhancers examined 5 cell line types d HeLa K562 H3K4me1 H3K4me3 H3K4me1 H3K4me3 0 c 5 kb +5 kb logr 3 0 +3 47
An enhancer mark can indicated increased responsivity Interferon activates Stat1 binding. Cell type TF % at HeLa enhancer MCF7 ER 32.6% (P < 1 10 300 ) HCT116 p53 21.4% (P = 2.3 10 30 ) ME180 p63 25.8% (P < 1 10 300 ) HeLa-IFN-γ STAT1 25.3% (P < 4.5 10 142 ) c Genes upregulated (%) 40 30 20 10 P = 5.8 10 8 STAT1 group I Genes in domains with STAT1 Random genes P = 5.4 10 1 STAT1 group II + IFN-γ IFN-γ STAT1 H3K4me1 H3K4me3 When you add interferon the ones with the enhancer mark (H3K4me1) turn on even both groups show STAT 1 binding. 0 STAT1 group I STAT1 group II 5 kb +5 kb logr 3 0 +3
Of course the work continues Table 3. Distinctive Chromatin Features of Genomic Elements Functional Annotation Histone Marks References Promoters H3K4me3 Bernstein et al., 2005; Kim et al., 2005; Pokholok et al., 2005 Bivalent/Poised Promoter H3K4me3/H3K27me3 Bernstein et al., 2006 Transcribed Gene Body H3K36me3 Barski et al., 2007 Enhancer (both active and poised) H3K4me1 Heintzman et al., 2007 Poised Developmental Enhancer H3K4me1/H3K27me3 Creyghton et al., 2010; Rada-Iglesias et al., 2011 Active Enhancer H3K4me1/H3K27ac Creyghton et al., 2010; Heintzman et al., 2009; Rada-Iglesias et al., 2011 Polycomb Repressed Regions H3K27me3 Bernstein et al., 2006; Lee et al., 2006 Heterochromatin H3K9me3 Mikkelsen et al., 2007 Rivera, C.M. & Ren, B. (2013) Mapping human epigenomes. Cell, 155, 39-55. Zhu, Y. et al. (2013) Predicting enhancer transcription and activity from chromatin modifications. Nucleic Acids Res, 41, 10032-10043. Zentner,G.E., Tesar,P.J. and Scacheri,P.C. (2011) Epigenetic signatures distinguish multiple classes of enhancers with distinct cellular functions. Genome Res., 21, 1273 1283. Rada-Iglesias,A., Bajpai,R., Swigut,T., Brugmann,S.A.,Flynn,R.A. and Wysocka,J. (2011) A unique chromatin signature uncovers early developmental enhancers in humans. Nature, 470, 279 283. Bonn,S., Zinzen,R.P., Girardot,C., Gustafson,E.H., Perez- Gonzalez,A., Delhomme,N., Ghavi-Helm,Y., Wilczynski,B., Riddell,A. and Furlong,E.E. (2012) Tissue-specific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development. Nat. Genet., 44, 148 156. 49