The Epigenome Tools 2: ChIP-Seq and Data Analysis Chongzhi Zang zang@virginia.edu http://zanglab.com PHS5705: Public Health Genomics March 20, 2017 1
Outline Epigenome: basics review ChIP-seq overview ChIP-seq data analysis 2
Epigenome histone nucleosome The epigenome is a multitude of chemical compounds that can tell the genome what to do. The epigenome is made up of chemical compounds and proteins that can attach to DNA and direct such actions as turning genes on or off, controlling the production of proteins in particular cells. -- from genome.gov Original figure from ENCODE, Darryl Leja (NHGRI), Ian Dunham (EBI) 3
Epigenomic marks DNA methylation Histone marks Covalent modifications Histone variants Chromatin regulators Histone modifying enzymes Chromatin remodeling complexes * Transcription factors 4
Histone modifications Nucleosome Core Particles Core Histones: H2A, H2B, H3, H4 Covalent modifications on histone tails include: methylation (me), acetylation (ac), phosphorylation Histone variants Histone modifications are implicated in influencing gene expression. Notation: H3K4me3 Allis C. et al. Epigenetics. 2006 5
Histone modifications associate with regulation of gene expression Differential expression log2 (fold-change) 0.35 Fractions of enhancers 0.30 0.25 0.20 0.15 0.10 0.05 0 H2AK5ac H2AK9ac H2A.Z H2BK5ac H2BK5me1 H2BK12ac H2BK20ac H2BK120ac H3K4ac H3K4me1 H3K4me2 H3K4me3 H3K9ac H3K9me1 H3K9me2 H3K9me3 H3K14ac H3K18ac H3K23ac H3K27ac H3K27me1 e H3K27me2 H3K27me3 H3K36ac H3K36me1 H3K36me3 H3K79me1 H3K79me2 H3K79me3 H3R2me1 H3R2me2 H4K5ac H4K8ac H4K12ac H4K16ac H4K20me1 H4K20me3 H4K91ac Wang, Zang et al. Nat Genet 2008 6
Functions of histone marks Table 3. Distinctive Chromatin Features of Genomic Elements Functional Annotation Histone Marks Promoters H3K4me3 Bivalent/Poised Promoter Transcribed Gene Body Enhancer (both active and poised) Poised Developmental Enhancer Active Enhancer H3K4me3/H3K27me3 H3K36me3 H3K4me1 H3K4me1/H3K27me3 H3K4me1/H3K27ac Polycomb Repressed Regions Heterochromatin H3K27me3 H3K9me3 Rivera & Ren Cell 2013 7
H3K4me3/H3K27me3 Bivalent Domain H3K4me3 Repressed H3K27me3 Remained Poised Induced From: https://pubs.niaaa.nih.gov/publications/arcr351/77-85.htm 8
ChIP-seq: Profiling epigenomes with sequencing histone nucleosome ATAC-seq Original figure from ENCODE, Darryl Leja (NHGRI), Ian Dunham (EBI) 9
Published ChIP-seq datasets are skyrocketing We are entering the Big Data era Number of ChIP-seq datasets on GEO 3000 2500 2000 1 1500 1000 500 0! Mei et al. Nucleic Acids Research 2016 10
Chromatin ImmunoPrecipitation (ChIP) 11
Protein-DNA crosslinking in vivo (for TF) 12
Chop the chromatin using sonication (TF) or micrococal nuclease (MNase) digestion (histone) 13
Specific factor-targeting antibody 14
Immunoprecipitation 15
DNA purification 16
PCR amplification and sequencing 17
ChIP-seq data analysis overview Scale chr19: 500 bases hg19 15,308,000 15,308,100 15,308,200 15,308,300 15,308,400 15,308,500 15,308,600 15,308,700 15,308,800 15,308,900 15,309,000 15,309,100 15,309,200 User Supplied Track @ILLUMINA-8879DC:231:KK:3:1:1070:945 1:Y:0: NNNAATACAGTCAGAAACATATCATATTGGAGAATA #################################### @ILLUMINA-8879DC:231:KK:3:1:1153:945 1:Y:0: NNNAAGCACACAGAAGATAACTAAACAATCAAGTAG #################################### @ILLUMINA-8879DC:231:KK:3:1:1222:945 1:Y:0: NNNAAGGGTCTTGAGAAGAAATCATTCTGGATGGCA #################################### @ILLUMINA-8879DC:231:KK:3:1:1304:939 1:Y:0: NNNCCAGGCTCCCGCGATTCTCCTGCCTCAGCTTCT #################################### @ILLUMINA-8879DC:231:KK:3:1:1354:945 1:Y:0: NNNCTCTTCCTTAGCTAAACTTTCAACTAAGCCAAA #################################### @ILLUMINA-8879DC:231:KK:3:1:1411:932 1:Y:0: NNNGTAGGACCATTGGCGTTGCGACACAAAAAATTT #################################### @ILLUMINA-8879DC:231:KK:3:1:1496:937 1:Y:0: NNNTTCATCGGGTTGAGAGTCCCCTTGTTGCATGCA #################################### @ILLUMINA-8879DC:231:KK:3:1:1533:939 1:Y:0: NNNATTTTCCCGTTCCAGGTCGCAATTTCCGCCGTT #################################### @ILLUMINA-8879DC:231:KK:3:1:1573:940 1:Y:0: NNNGGGGTGCGCCTTTAGTCCCAGCTACTCAGGAAC #################################### 18
ChIP-seq data analysis overview Where in the genome do these sequence reads come from? - Sequence alignment and quality control What does the enrichment of sequences mean? - Peak calling What can we learn from these data? Downstream analysis and integration 19
ChIP-seq data analysis: basic processing alignment of each sequence read: bowtie or BWA cannot map to the reference genome can map to multiple loci in the genome can map to a unique location in the genome redundancy control: Langmead et al. 2009, Zang et al. 2009 20
ChIP-seq data analysis: Peak calling DNA fragment size estimation pile-up profiling peak model cross-correlation d Percentage 0.05 0.10 0.15 0.20 0.25 0.30 0.35 forward tags reverse tags 600 400 200 0 200 400 600 s 0.055 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 50 100 150 200 250 300 350 400 Peak/signal detection Distance to the middle 21
ChIP-seq data analysis: Peak calling Sharp peaks transcription factor binding, DNase, ATAC-seq MACS (Zhang, 2008) dynamic background Poisson model Broad peaks Histone modifications, super-enhancers Diffuse SICER (Zang, 2009) Spatial clustering of localized weak signal and integrative Poisson model NOTCH1 H3K27ac 22 Wang, Zang et al. 2014
MACS Model-based Analysis for ChIP-Seq Tag distribution along the genome ~ Poisson distribution (λ BG = total tag / genome size) ChIP-seq show local biases in the genome Chromatin and sequencing bias 200-300bp control windows have to few tags ChIP But can look further Dynamic λ local = max(λ BG, [λ ctrl, λ 1k,] λ 5k, λ 10k ) Control 300bp 1kb 5kb 10kb http://liulab.dfci.harvard.edu/macs/ Zhang et al, Genome Bio, 2008
SICER Spatial-clustering Identification of ChIP-Enriched Regions 5kb 10kb omictools.com Zang et al. Bioinformatics 2009 24
ChIP-seq peak calling: Parameters Parameter Genome Effective genome rate DNA fragment size Window size Gap size P-value cut-off False discovery rate (FDR) cut-off Remarks Species and reference genome version, e.g. hg38, hg18, mm10, mm9 Fraction of the mappable genome, vary in species, read length, etc. Estimated by default; can specify otherwise Data resolution, usually nucleosome periodicity length, i.e. 200bp (for SICER only) Allowable gaps between eligible windows, usually 2 or 3 windows Threshold for peak calling, from model Threshold for peak calling, BH correction from p-value. 25
ChIP-seq data analysis: Review 1. Read mapping (sequence alignment) 2. Peak calling: MACS or SICER 1. QC 2. DNA fragment size estimation (for Single-end) 3. Pile-up profile generation 4. Peak/signal detection 3. Downstream analysis/integration 26
Data formats fastq: raw sequences BED: chr11 10344210 10344260 255 0 - chr4 76649430 76649480 255 0 + chr3 77858754 77858804 255 0 + chr16 62688333 62688383 255 0 + chr22 33031123 33031173 255 0 - SAM/BAM: aligned sequencing reads bedgraph, Wig, bigwig: pile-up profiles for browser visualization 27
Data flow Raw sequence reads fastq Bowtie/BWA Reference genome Aligned reads MACS/SICER BAM/BED Profile; Peaks bedgraph/wig/bigwig BED 28
Galaxy: web-interface analysis platform https://usegalaxy.org/ 29
Run MACS on Cistrome, a Galaxy-based platform http://cistrome.org/ap/ 30
Run SICER on Galaxy-based platforms http://services.cbib.u-bordeaux.fr/galaxy/ 31
ChIP-seq: Downstream analysis Data visualization UCSC genome browser: http://genome.ucsc.edu/ WashU epigenome browser: http://epigenomegateway.wustl.edu/ IGV: http://software.broadinstitute.org/software/igv/ Meta analysis CEAS: http://liulab.dfci.harvard.edu/ceas/ Integration with gene expression BETA: http://cistrome.org/beta/ MARGE: http://cistrome.org/marge/ Integration with other epigenomic data GREAT: http://great.stanford.edu ENCODE SCREEN: http://screen.umassmed.edu/ MANCIE: https://cran.r-project.org/package=mancie Cistrome DB: http://cistrome.org/db/ 32
BETA: Binding Expression Target Analysis Regulatory Potential TSS P (g i )= j S(i) exp ( ) ij λ j i 33
MARGE: A big data driven, integrative regression and semisupervised approach for predicting functional enhancers samples sample samples enhancer samples selection prediction Wang, Zang et al. Genome Res 2016 34
https://www.encodeproject.org/ ENCODE 35
Cistrome Data Browser http://cistrome.org/db/ 36
ChIP-seq data analysis: Review 1. Read mapping (sequence alignment) 2. Peak calling: MACS or SICER 1. QC 2. DNA fragment size estimation (for Single-end) 3. Pile-up profile generation 4. Peak/signal detection 3. Downstream analysis/integration 37
Summary ChIP-seq is used to profile epigenomes ChIP-seq data analysis MACS for narrow peaks SICER for broad peaks Online tools and resources 38
Further Reading The cancer epigenome: Concepts, challenges, and therapeutic opportunities Science 17 Mar 2017: Vol. 355, Issue 6330, pp.1147-1152 http://science.sciencemag.org/content/355/6330/1147 Heterochromatin Euchromatin Heterochromatin Writer Reader Eraser Nucleosome DNA Histones Nucleus DNMT EZH2 MMSET DOT1L SETD7 MLL1 PRMT1 PRMT3 PRMT5 SMYD2 G9a/GLP CREBBP EP300 MOZ BRD2/3/4 SMARCA2 SMARCA4 BAZ2A PB1 BRPF1 ATAD2 L3MBTL3 WDR5 BRD9 BRD7 CBX7 HDAC JMJD3/UTX LSD1 IDH1* IDH2* Therapies targeting: DNA modifications Histone modifications DNA and histone modifications 39
Thank you very much! zang@virginia.edu http://zanglab.com 40