ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017
Contents Background ChIP-seq protocol ChIP-seq data analysis
Transcriptional regulation Transcriptional regulation is largely controlled by protein-dna interactions Chromatin Distal TFBS Co-activator complex Transcription initiation complex Transcription initiation CRM Proximal TFBS Figure from (Wasserman & Sandelin, 2004)
Transcriptional regulation Transcriptional regulation is largely controlled by protein-dna interactions Chromatin Distal TFBS Co-activator complex Transcription initiation complex Transcription initiation CRM Proximal TFBS Figure from (Wasserman & Sandelin, 2004)
Protein-DNA binding A transcription factor is a protein that binds to specific DNA sequences, recruits other co-factors and controls gene transcription Can function alone or with other proteins Can activate or repress gene expression Transcription factors contain DNA-binding domain(s) (DBDs)
Protein-DNA binding Proteins DNA-binding specificities are encoded in their DNA-binding domains that specialize them to recognize and bind specific types of binding sites Figure from (Kissinger et al., 1990)
Modeling transcriptional regulation The goal An accurate method to quantify protein-dna interactions Challenges Human genome contains about 3.2 billion (3.2 10 9!) nucleotides Human genome is physically about 2 meters long, packed in a cell of average diameter of 10µm
Modeling transcriptional regulation The goal An accurate method to quantify protein-dna interactions Challenges Human genome contains about 3.2 billion (3.2 10 9!) nucleotides Human genome is physically about 2 meters long, packed in a cell of average diameter of 10µm Protein-DNA binding can be studied using e.g. Probabilistic models for biological sequences Biological experiments + statistical analysis: ChIP-seq, protein binding microarray, high-throughput SELEX, chromatin accessibility Biophysics: all atom-level modeling
Contents Background ChIP-seq protocol ChIP-seq data analysis
ChIP-seq So, for any given condition, how do we find the genomic locations where DNA binding proteins bind? Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is the current state-of-the-art method The basic principle: Use a specific antibody to detect a protein of interest (or a post-translationally modified histone protein in nucleosomes) ChIP-seq procedure enriches DNA fragments that are bound to these proteins or nucleosomes
ChIP-seq protocol ChIP-seq steps: Crosslink DNA-binding proteins with DNA in vivo Shear the chromatin into small fragments (200-600 bp) amenable for sequencing (sonication) Immunoprecipitate the DNA-protein complex with a specific antibody Reverse the crosslinks Assay enriched DNA to determine the sequences bound by the protein of interest Figure from (Visel et al., 2009)
ChIP-seq protocol again 152 Biometrics, March 2011 Figure 1. Details of a ChIPseq experiment. Figure A DNA-binding from (Zhang proteinet isal., cross-linked 2011) to its in vivo genomic DNA targets, and the chromatin (a complex of DNA and protein) is isolated (1). The DNA with the bound proteins is extracted from the cells,
Poisson model A probability distribution that aligned reads. Owing to the large number of reads, the is often used to model the use of conventional alignment algorithms can take hundreds or thousands of processor visualization hours; therefore, a Strand specificity andnumber of read random events in a density new fixed interval. Given an average number of events in the interval, the probability of a alignment, as all subsequent results are based on the generation of aligners has been developed 57, and more are expected soon. Every aligner is a balance between accuracy, speed, memory and flexibility, and no aligner can be best suited for all applications. Alignment for given number of occurrences can be calculated. A data view of protein-dna binding Protein or nucleosome of interest 5 Positive strand 3 3 Negative strand 5 5 ends of fragments are sequenced ChIP seq should allow for a small number of mismatches due to sequencing errors, SNPs and indels or the difference between the genome of interest and the reference genome. This is simpler than in RNA seq, for example, in which large gaps corresponding to introns must be considered. Popular aligners include: Eland, an efficient and fast aligner for short reads that was developed by Illumina and is the default aligner on that platform; Mapping and Assembly with Qualities (MAQ) 58, a widely used aligner with a more exhaustive algorithm and excellent capabilities for detecting SNPs; and Bowtie 59, an extremely fast mapper that is based on an algorithm that was originally developed for file compression. These methods use the quality score that accompanies each base call to indicate its reliability. For the SOLiD di-base sequencing technology, in which two consecutive bases are read at a time, modified aligners have been developed 60,61. Many current analysis pipelines discard non-unique tags, but studies involving the repetitive regions of the genome 27,62 64 require careful handling of these non-unique tags. Short reads are aligned Distribution of tags Reference genome is computed Profile is generated from combined tags For example, each mapped Peak identification location is extended can be performed with a fragment of on either profile estimated size Fragments are added Figure 5 Strand-specific profiles at enriched sites. DNA fragments from a chromatin immunoprecipitation Figure experiment from (Park, are sequenced 2009) from the 5 end. Therefore, the alignment of these tags to the genome results in two peaks (one on Identification of enriched regions. After sequenced reads are aligned to the genome, the next step is to identify regions that are enriched in the ChIP sample relative to the control with statistical significance. Several peak callers that scan along the genome to identify the enriched regions are currently available 24,26,38,48,65 70. In early algorithms, regions were scored by the number of tags in a window of a given size and then assessed by a set of criteria based on factors such as enrichment over the control and minimum tag density. Subsequent algorithms take advantage of the directionality of the reads 71. As shown in FIG. 5, the fragments are sequenced at the 5 end, and the locations of mapped reads should form two distributions, one on the positive strand and the other on the negative strand, with a consistent distance between the peaks of the distributions. In these methods, a smoothed profile of each strand is constructed 65,72 and the combined profile is calculated either by shifting each distribution towards the centre or by extending each mapped position into an appropriately oriented fragment and then adding the fragments together. The latter approach should result in a more accurate profile with respect to the width of the binding, but it requires an estimate of the fragment size as well as the assumption that fragment size is uniform. Given a combined profile, peaks can be scored in several ways. A simple fold ratio of the signal for the ChIP sample relative to that of the control sample around the peak (FIG. 3B) provides important information, but it is not adequate. A fold ratio of 5 estimated from 50 and 10 tags (from the ChIP and control experiments, respectively) has a different statistical significance to the same ratio estimated from, for example, 500 and 100 tags. A Poisson model for the tag distribution is an effective
Contents Background ChIP-seq protocol ChIP-seq data analysis
Identification of binding sites from ChIP-seq data First steps in ChIP-seq data analysis: quality control, short read alignment Given a read density (or read densities on both strands) along genome, the actual data analysis task involves identification of the protein binding sites Given the above information, we should expect to see two signal peaks on opposite DNA strands within a proper distance This analysis is often called peak detection
Identification of binding sites from ChIP-seq data First steps in ChIP-seq data analysis: quality control, short read alignment Given a read density (or read densities on both strands) along genome, the actual data analysis task involves identification of the protein binding sites Given the above information, we should expect to see two signal peaks on opposite DNA strands within a proper distance This analysis is often called peak detection But how much signal (how many reads) are considered enough to call a protein-dna interaction site? What affects the signal strength? Protein binding in the first place Sequencing efficiency and depth Chromatin accessibility Fragmentation efficiency Mappability (i.e., uniqueness) of a local genomic region
ChIP-seq controls The best way to assess significance of a signal at putative binding sites is to use a control for ChIP-seq Input-DNA: sequencing data of the genomic DNA. More/Less accessible regions are sequenced more/less ChIP-seq experiment with an unspecific antibody which does not detect any specific protein ChIP-seq controls can be used to account for many of the biases which affect the signal strength Input-DNA is currently considered the best control
Detecting binding sites from ChIP-seq data S A Fraction of peaks recovered Early methods used a single cut-off for signal strength or a Ba Not statistically significant log-fold enrichment (for a given putative binding Enrichment ratio: 1.5 region/window) Considering all statistically significant peaks 1 with fold enrichment above a # threshold ChIP-seqControl reads 10in a window score = log # Input DNA reads in a window Ba Considering only peaks Not statistically significant Enrichment ratio: 1.5 ChIP 15 Considering only peaks with fold enrichment 5 above a threshold 10 Control 100 0 Control 0 Figure1 from (Park, 2009) Fraction of reads sampled from the data Bb Statistically significant Figure Current 3 Depth of sequencing. state-of-the-art A To determine methods whether enough are tags have probabilistic been sequenced, a simulation can be carried out to characterize the fraction of the peaks that would be recovered if a smaller number of tags had been Enrichment Enrichment sequenced. In many cases, new statistically significant ratio: peaks 4 are discovered at ratio: a steady 1.5 rate with an increasing number of tags (solid curve) that is, there is no saturation of binding sites. However, when a minimum threshold is imposed for the enrichment ratio between chromatin ChIP immunoprecipitation 20 (ChIP) 150 and input DNA peaks, the rate at which new peaks are discovered slows down (dashed curve) that is, saturation of detected binding sites can occur when only sufficiently prominent binding positions are considered. For a given data set, multiple curves corresponding to different thresholds can be examined to identify 5 the threshold at which the curve becomes sufficiently flat to meet the Control 100 Bb ChIP ChIP 15 Statistically significant 20 Enrichment ratio: 4 150 Enrichment ratio: 1.5
Model-based Analysis of ChIP-Seq (MACS) A commonlyfigure used S6. method Workflow chart for of MACS. detecting TF binding sites from ChIP-seq data: MACS (Zhang et al, 2008) Workflow: Figure from (Zhang et al., 2008)
Model-based Analysis of ChIP-Seq (MACS) Find model peaks: mfoldlow and mfold high : high confidence fold-enrichment parameters MACS slides 2 bandwidth window across the genome to find regions with tags (=reads) more than mfold low (but smaller than mfold high to avoid artefacts) enriched relative to a random tag genome distribution bandwidth = assumed sonicated fragment size
Model-based Analysis of ChIP-Seq (MACS) Model the shift size of ChIP-seq tags Take 1000 high-quality peaks randomly Separate Watson and Crick tags Aligns the tags by the mid point Find d: distance between the modes of the Watson and Crick peaks in the alignment http://genomebiology.com/2008/9/9/r137 Genome Biology 2008, Volume 9, Iss (a) (b) Tag percentage (%) 300 200 100 0 100 200 300 alized) / 10 kb 0 500 600 ontrol / kb 2.7 2.8 0.0 0.1 0.2 0.3 0.4 0.5 d = 126 bp Watson tags Crick tags Tag percentage (%) 0.1 0.2 0.3 0.4 0.5 0.6 0.0 d = 122 bp 300 200 100 0 (c) Location with respect to the center of Watson and Crick peaks (bp) Figure from (Zhang et al., 2008) (d) Location with respect to F
Model-based Analysis of ChIP-Seq (MACS) Shift all the tags by d/2 toward the 3 ends to the most likely protein-dna interaction sites Remove redundant tags: Sometimes the same tag can be sequenced repeatedly, more than expected from a random genome-wide tag distribution Such tags might arise from biases during ChIP-DNA amplification and sequencing library preparation (PCR duplicates) These are likely to add noise to the final peak calls MACS removes duplicate tags in excess of what is warranted by the sequencing depth (binomial distribution p-value < 10 5 ) For example, for the 3.9 million ChIP-seq tags, MACS allows each genomic position to contain no more than one tag and removes all the redundancies
Model-based Analysis of ChIP-Seq (MACS) Identifying the most likely binding sites Counting process: if reads were sampled independently from a population with given, fixed fractions of genomic locations, the read counts x in each genomic location/window would follow a multinomial distribution (binomial for a single location), which can be approximated by the Poisson distribution
Multinomial and binomial distributions n independent trials Each trial leads to a success for exactly one of k categories Each category has a fixed success probability p k Multinomial gives probability of any particular combination of numbers of successes for the k categories Binomial is a special case of 2 categories; can be approximated with a Poisson for larger n
Model-based Analysis of ChIP-Seq (MACS) Let x denote the number of sequencing reads in a window (genomic region) Analysing each genomic window independently, if all genomic regions would be equally likely, then x Poisson( λ BG ) = λx BG x! e λ BG, x = 0, 1, 2,... where λ BG is the rate of observing reads in the control sample For experiments with a control, MACS linearly scales the total control tag count to be the same as the total ChIP tag count
Model-based Analysis of ChIP-Seq (MACS) Let x denote the number of sequencing reads in a window (genomic region) Analysing each genomic window independently, if all genomic regions would be equally likely, then x Poisson( λ BG ) = λx BG x! e λ BG, x = 0, 1, 2,... where λ BG is the rate of observing reads in the control sample For experiments with a control, MACS linearly scales the total control tag count to be the same as the total ChIP tag count Because ChIP-seq data has several bias sources which vary across the genome, it is better to model the data using a local/dynamic Poisson λ local = max(λ BG, [λ 1K ], λ 5K, λ 10K ) λ X K is estimated from the X K window of the control sample (e.g. input-dna) centered at a specific location
Model-based Analysis of ChIP-Seq (MACS) Assessing statistical significance of x reads (in a genomic region i) using hypothesis testing H 0 : the ith location is not a binding site H 1 : the ith location is a binding site The p-value is the probability of observing x many reads or more, assuming the null hypothesis is true: p value = Poisson(x λ local ) k=x
Model-based Analysis of ChIP-Seq (MACS) Assessing statistical significance of x reads (in a genomic region i) using hypothesis testing H 0 : the ith location is not a binding site H 1 : the ith location is a binding site The p-value is the probability of observing x many reads or more, assuming the null hypothesis is true: p value = Poisson(x λ local ) k=x The location with the highest fragment pileup (summit) is predicted as the precise binding location The ratio between the ChIP-seq tag count and λ local is reported as the fold enrichment
Tag percentage (%) 0.0 0.1 0.2 0.3 0 Model-based Analysis of ChIP-Seq (MACS) The read count in ChIP-seq versus control in 10 kb windows across the genome. Each dot represents a 10 kb window; red dots are windows 300 200 100 0 100 200 300 Location with respect to the center of Watson and Crick peaks (bp) containing ChIP peaks and black dots are windows containing (c) (d) control peaks used for FDR calculation. Control tag number (normalized) / 10 kb 0 100 200 300 400 500 600 Average tag number in control / kb Tag percentage (%) 2.3 2.4 2.5 2.6 2.7 2.8 0.1 0.2 0.3 0.4 0.0 300 200 100 Location wit 0 200 400 600 800 1,000 20 10 (e) FoxA1 ChIP Seq tag number / 10 kb Figure from (Zhang et al., 2008) Motif occurrence in peak centers for FoxA1 ChIP-Seq (f) Distance Spatial resol 70 MACS Without local lambda Without tag shifting 5 MACS Without local lambda Without tag shifting
ChIP-seq peak: Illustration An illustration of a strong TF binding site Figure from http://www.nature.com/nmeth/journal/v6/n4/images/nmeth.f.247-f2.jpg
Multiple correction in MACS For a ChIP-seq experiment with controls, MACS empirically estimates the false discovery rate (FDR) At each p-value, MACS uses the same parameters to find ChIP-seq peaks over control, and Control peaks over ChIP-seq (i.e., a sample swap) The empirical FDR is defined as empirical FDR = #control peaks #ChIP peaks
Differential binding MACS can also be applied to differential binding between two conditions by treating one of the samples as the control Differential binding analysis in MACS will only work with two samples, i.e., one biological replicate per condition Empirical FDR control will not work in such a scenario
Summary ChIP-seq is a powerful way to detect TF binding sites (or e.g. enriched regions of histone modifications; next lecture) ChIP-seq approaches are limited in that Only a subset of all TFs have a chip-grade antibody None of the antibodies are perfect A single experiment will profile a single protein
References Metzker ML (2010) Sequencing technologies - the next generation, Nat Rev Genet. 11(1):31-46. Park PJ (2009) ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet. 10(10):669-80. Axel Visel, Edward M. Rubin & Len A. Pennacchio (2009) Genomic views of distant-acting enhancers, Nature 461, 199-205. Zhang Y et al. (2008), Model-based analysis of ChIP-Seq (MACS), Genome Biol. 9(9):R137. Zhang X et al. (2011) PICS: probabilistic inference for ChIP-seq, Biometrics, 67(1): 151-163. Wyeth W. Wasserman & Albin Sandelin, Applied bioinformatics for the identification of regulatory elements, Nature Reviews Genetics 5, 276-287, 2004.