ChIP-seq data analysis

Similar documents
Peak-calling for ChIP-seq and ATAC-seq

ChIP-seq analysis. J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant, M. Thomas-Chollier, O.Sand

The Epigenome Tools 2: ChIP-Seq and Data Analysis

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Processing, integrating and analysing chromatin immunoprecipitation followed by sequencing (ChIP-seq) data

Computational aspects of ChIP-seq. John Marioni Research Group Leader European Bioinformatics Institute European Molecular Biology Laboratory

Part-II: Statistical analysis of ChIP-seq data

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

DNA Sequence Bioinformatics Analysis with the Galaxy Platform

High-Throughput Sequencing Course

ChIP-seq hands-on. Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

Statistical analysis of ChIP-seq data

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Session 6: Integration of epigenetic data. Peter J Park Department of Biomedical Informatics Harvard Medical School July 18-19, 2016

High Throughput Sequence (HTS) data analysis. Lei Zhou

Raymond Auerbach PhD Candidate, Yale University Gerstein and Snyder Labs August 30, 2012

Patterns of Histone Methylation and Chromatin Organization in Grapevine Leaf. Rachel Schwope EPIGEN May 24-27, 2016

Chip Seq Peak Calling in Galaxy

Genome-wide Association Studies (GWAS) Pasieka, Science Photo Library

Introduction to Systems Biology of Cancer Lecture 2

Not IN Our Genes - A Different Kind of Inheritance.! Christopher Phiel, Ph.D. University of Colorado Denver Mini-STEM School February 4, 2014

Nature Genetics: doi: /ng Supplementary Figure 1

Colorspace & Matching

DNA-seq Bioinformatics Analysis: Copy Number Variation

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

AVENIO family of NGS oncology assays ctdna and Tumor Tissue Analysis Kits

Supplementary Figure 1 IL-27 IL

Supplementary Figures

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

RNA-seq Introduction

Mutation Detection and CNV Analysis for Illumina Sequencing data from HaloPlex Target Enrichment Panels using NextGENe Software for Clinical Research

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

Nature Structural & Molecular Biology: doi: /nsmb.2419

Genome-wide relationship between histone H3 lysine 4 mono- and tri-methylation and transcription factor binding

Transcript reconstruction

Advance Your Genomic Research Using Targeted Resequencing with SeqCap EZ Library

Mechanisms of alternative splicing regulation

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation

Lecture 8 Understanding Transcription RNA-seq analysis. Foundations of Computational Systems Biology David K. Gifford

cis-regulatory enrichment analysis in human, mouse and fly

Genome-Wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-Point Method with ChIP-seq Data

ChipSeq. Technique and science. The genome wide dynamics of the binding of ldb1 complexes during erythroid differentiation

User Guide. Association analysis. Input

Package NarrowPeaks. August 3, Version Date Type Package

SUPPLEMENTARY INFORMATION

Yingying Wei George Wu Hongkai Ji

MATS: a Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq data

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Exploring chromatin regulation by ChIP-Sequencing

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

P. Tang ( 鄧致剛 ); PJ Huang ( 黄栢榕 ) g( ); g ( ) Bioinformatics Center, Chang Gung University.

ChIPSeq. Technique and science. The genome wide dynamics of the binding of ldb1 complexes during erythroid differentiation

Chapter 1. Introduction

MIR retrotransposon sequences provide insulators to the human genome

Golden Helix s End-to-End Solution for Clinical Labs

Obstacles and challenges in the analysis of microrna sequencing data

Transcript-indexed ATAC-seq for immune profiling

Simple, rapid, and reliable RNA sequencing

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

Phylogenomics. Antonis Rokas Department of Biological Sciences Vanderbilt University.

Tutorial. ChIP Sequencing. Sample to Insight. September 15, 2016

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz

STAT1 regulates microrna transcription in interferon γ stimulated HeLa cells

EPIGENOMICS PROFILING SERVICES

cn.mops - Mixture of Poissons for CNV detection in NGS data Günter Klambauer Institute of Bioinformatics, Johannes Kepler University Linz

A Practical Guide to Integrative Genomics by RNA-seq and ChIP-seq Analysis

High-resolution characterization of sequence signatures due to non-random cleavage of cell-free DNA

fl/+ KRas;Atg5 fl/+ KRas;Atg5 fl/fl KRas;Atg5 fl/fl KRas;Atg5 Supplementary Figure 1. Gene set enrichment analyses. (a) (b)

Transcriptional control in Eukaryotes: (chapter 13 pp276) Chromatin structure affects gene expression. Chromatin Array of nuc

SUPPLEMENTAL INFORMATION

Sequence and chromatin determinants of cell-type specific transcription factor binding

Genomic structural variation

Analysis of the peroxisome proliferator-activated receptor-β/δ (PPARβ/δ) cistrome reveals novel co-regulatory role of ATF4

Supplemental Figure S1. Tertiles of FKBP5 promoter methylation and internal regulatory region

Hands-On Ten The BRCA1 Gene and Protein

Transcriptome Analysis

Eukaryotic small RNA Small RNAseq data analysis for mirna identification

Lecture #4: Overabundance Analysis and Class Discovery

EXPression ANalyzer and DisplayER

Statistical analysis of RIM data (retroviral insertional mutagenesis) Bioinformatics and Statistics The Netherlands Cancer Institute Amsterdam

Measuring DNA Methylation with the MinION. Winston Timp Department of Biomedical Engineering Johns Hopkins University 12/1/16

CTCF-Mediated Functional Chromatin Interactome in Pluripotent Cells

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

Neuron, Volume 63 Spatial attention decorrelates intrinsic activity fluctuations in Macaque area V4.

A complete next-generation sequencing workfl ow for circulating cell-free DNA isolation and analysis

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Below, we included the point-to-point response to the comments of both reviewers.

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

Selective depletion of abundant RNAs to enable transcriptome analysis of lowinput and highly-degraded RNA from FFPE breast cancer samples

The Importance of Coverage Uniformity Over On-Target Rate for Efficient Targeted NGS

Hao D. H., Ma W. G., Sheng Y. L., Zhang J. B., Jin Y. F., Yang H. Q., Li Z. G., Wang S. S., GONG Ming*

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

the reaction was stopped by adding glycine to final concentration 0.2M for 10 minutes at

Transcription:

ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University November 24, 2017

Contents Background ChIP-seq protocol ChIP-seq data analysis

Transcriptional regulation Transcriptional regulation is largely controlled by protein-dna interactions Chromatin Distal TFBS Co-activator complex Transcription initiation complex Transcription initiation CRM Proximal TFBS Figure from (Wasserman & Sandelin, 2004)

Transcriptional regulation Transcriptional regulation is largely controlled by protein-dna interactions Chromatin Distal TFBS Co-activator complex Transcription initiation complex Transcription initiation CRM Proximal TFBS Figure from (Wasserman & Sandelin, 2004)

Protein-DNA binding A transcription factor is a protein that binds to specific DNA sequences, recruits other co-factors and controls gene transcription Can function alone or with other proteins Can activate or repress gene expression Transcription factors contain DNA-binding domain(s) (DBDs)

Protein-DNA binding Proteins DNA-binding specificities are encoded in their DNA-binding domains that specialize them to recognize and bind specific types of binding sites Figure from (Kissinger et al., 1990)

Modeling transcriptional regulation The goal An accurate method to quantify protein-dna interactions Challenges Human genome contains about 3.2 billion (3.2 10 9!) nucleotides Human genome is physically about 2 meters long, packed in a cell of average diameter of 10µm

Modeling transcriptional regulation The goal An accurate method to quantify protein-dna interactions Challenges Human genome contains about 3.2 billion (3.2 10 9!) nucleotides Human genome is physically about 2 meters long, packed in a cell of average diameter of 10µm Protein-DNA binding can be studied using e.g. Probabilistic models for biological sequences Biological experiments + statistical analysis: ChIP-seq, protein binding microarray, high-throughput SELEX, chromatin accessibility Biophysics: all atom-level modeling

Contents Background ChIP-seq protocol ChIP-seq data analysis

ChIP-seq So, for any given condition, how do we find the genomic locations where DNA binding proteins bind? Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is the current state-of-the-art method The basic principle: Use a specific antibody to detect a protein of interest (or a post-translationally modified histone protein in nucleosomes) ChIP-seq procedure enriches DNA fragments that are bound to these proteins or nucleosomes

ChIP-seq protocol ChIP-seq steps: Crosslink DNA-binding proteins with DNA in vivo Shear the chromatin into small fragments (200-600 bp) amenable for sequencing (sonication) Immunoprecipitate the DNA-protein complex with a specific antibody Reverse the crosslinks Assay enriched DNA to determine the sequences bound by the protein of interest Figure from (Visel et al., 2009)

ChIP-seq protocol again 152 Biometrics, March 2011 Figure 1. Details of a ChIPseq experiment. Figure A DNA-binding from (Zhang proteinet isal., cross-linked 2011) to its in vivo genomic DNA targets, and the chromatin (a complex of DNA and protein) is isolated (1). The DNA with the bound proteins is extracted from the cells,

Poisson model A probability distribution that aligned reads. Owing to the large number of reads, the is often used to model the use of conventional alignment algorithms can take hundreds or thousands of processor visualization hours; therefore, a Strand specificity andnumber of read random events in a density new fixed interval. Given an average number of events in the interval, the probability of a alignment, as all subsequent results are based on the generation of aligners has been developed 57, and more are expected soon. Every aligner is a balance between accuracy, speed, memory and flexibility, and no aligner can be best suited for all applications. Alignment for given number of occurrences can be calculated. A data view of protein-dna binding Protein or nucleosome of interest 5 Positive strand 3 3 Negative strand 5 5 ends of fragments are sequenced ChIP seq should allow for a small number of mismatches due to sequencing errors, SNPs and indels or the difference between the genome of interest and the reference genome. This is simpler than in RNA seq, for example, in which large gaps corresponding to introns must be considered. Popular aligners include: Eland, an efficient and fast aligner for short reads that was developed by Illumina and is the default aligner on that platform; Mapping and Assembly with Qualities (MAQ) 58, a widely used aligner with a more exhaustive algorithm and excellent capabilities for detecting SNPs; and Bowtie 59, an extremely fast mapper that is based on an algorithm that was originally developed for file compression. These methods use the quality score that accompanies each base call to indicate its reliability. For the SOLiD di-base sequencing technology, in which two consecutive bases are read at a time, modified aligners have been developed 60,61. Many current analysis pipelines discard non-unique tags, but studies involving the repetitive regions of the genome 27,62 64 require careful handling of these non-unique tags. Short reads are aligned Distribution of tags Reference genome is computed Profile is generated from combined tags For example, each mapped Peak identification location is extended can be performed with a fragment of on either profile estimated size Fragments are added Figure 5 Strand-specific profiles at enriched sites. DNA fragments from a chromatin immunoprecipitation Figure experiment from (Park, are sequenced 2009) from the 5 end. Therefore, the alignment of these tags to the genome results in two peaks (one on Identification of enriched regions. After sequenced reads are aligned to the genome, the next step is to identify regions that are enriched in the ChIP sample relative to the control with statistical significance. Several peak callers that scan along the genome to identify the enriched regions are currently available 24,26,38,48,65 70. In early algorithms, regions were scored by the number of tags in a window of a given size and then assessed by a set of criteria based on factors such as enrichment over the control and minimum tag density. Subsequent algorithms take advantage of the directionality of the reads 71. As shown in FIG. 5, the fragments are sequenced at the 5 end, and the locations of mapped reads should form two distributions, one on the positive strand and the other on the negative strand, with a consistent distance between the peaks of the distributions. In these methods, a smoothed profile of each strand is constructed 65,72 and the combined profile is calculated either by shifting each distribution towards the centre or by extending each mapped position into an appropriately oriented fragment and then adding the fragments together. The latter approach should result in a more accurate profile with respect to the width of the binding, but it requires an estimate of the fragment size as well as the assumption that fragment size is uniform. Given a combined profile, peaks can be scored in several ways. A simple fold ratio of the signal for the ChIP sample relative to that of the control sample around the peak (FIG. 3B) provides important information, but it is not adequate. A fold ratio of 5 estimated from 50 and 10 tags (from the ChIP and control experiments, respectively) has a different statistical significance to the same ratio estimated from, for example, 500 and 100 tags. A Poisson model for the tag distribution is an effective

Contents Background ChIP-seq protocol ChIP-seq data analysis

Identification of binding sites from ChIP-seq data First steps in ChIP-seq data analysis: quality control, short read alignment Given a read density (or read densities on both strands) along genome, the actual data analysis task involves identification of the protein binding sites Given the above information, we should expect to see two signal peaks on opposite DNA strands within a proper distance This analysis is often called peak detection

Identification of binding sites from ChIP-seq data First steps in ChIP-seq data analysis: quality control, short read alignment Given a read density (or read densities on both strands) along genome, the actual data analysis task involves identification of the protein binding sites Given the above information, we should expect to see two signal peaks on opposite DNA strands within a proper distance This analysis is often called peak detection But how much signal (how many reads) are considered enough to call a protein-dna interaction site? What affects the signal strength? Protein binding in the first place Sequencing efficiency and depth Chromatin accessibility Fragmentation efficiency Mappability (i.e., uniqueness) of a local genomic region

ChIP-seq controls The best way to assess significance of a signal at putative binding sites is to use a control for ChIP-seq Input-DNA: sequencing data of the genomic DNA. More/Less accessible regions are sequenced more/less ChIP-seq experiment with an unspecific antibody which does not detect any specific protein ChIP-seq controls can be used to account for many of the biases which affect the signal strength Input-DNA is currently considered the best control

Detecting binding sites from ChIP-seq data S A Fraction of peaks recovered Early methods used a single cut-off for signal strength or a Ba Not statistically significant log-fold enrichment (for a given putative binding Enrichment ratio: 1.5 region/window) Considering all statistically significant peaks 1 with fold enrichment above a # threshold ChIP-seqControl reads 10in a window score = log # Input DNA reads in a window Ba Considering only peaks Not statistically significant Enrichment ratio: 1.5 ChIP 15 Considering only peaks with fold enrichment 5 above a threshold 10 Control 100 0 Control 0 Figure1 from (Park, 2009) Fraction of reads sampled from the data Bb Statistically significant Figure Current 3 Depth of sequencing. state-of-the-art A To determine methods whether enough are tags have probabilistic been sequenced, a simulation can be carried out to characterize the fraction of the peaks that would be recovered if a smaller number of tags had been Enrichment Enrichment sequenced. In many cases, new statistically significant ratio: peaks 4 are discovered at ratio: a steady 1.5 rate with an increasing number of tags (solid curve) that is, there is no saturation of binding sites. However, when a minimum threshold is imposed for the enrichment ratio between chromatin ChIP immunoprecipitation 20 (ChIP) 150 and input DNA peaks, the rate at which new peaks are discovered slows down (dashed curve) that is, saturation of detected binding sites can occur when only sufficiently prominent binding positions are considered. For a given data set, multiple curves corresponding to different thresholds can be examined to identify 5 the threshold at which the curve becomes sufficiently flat to meet the Control 100 Bb ChIP ChIP 15 Statistically significant 20 Enrichment ratio: 4 150 Enrichment ratio: 1.5

Model-based Analysis of ChIP-Seq (MACS) A commonlyfigure used S6. method Workflow chart for of MACS. detecting TF binding sites from ChIP-seq data: MACS (Zhang et al, 2008) Workflow: Figure from (Zhang et al., 2008)

Model-based Analysis of ChIP-Seq (MACS) Find model peaks: mfoldlow and mfold high : high confidence fold-enrichment parameters MACS slides 2 bandwidth window across the genome to find regions with tags (=reads) more than mfold low (but smaller than mfold high to avoid artefacts) enriched relative to a random tag genome distribution bandwidth = assumed sonicated fragment size

Model-based Analysis of ChIP-Seq (MACS) Model the shift size of ChIP-seq tags Take 1000 high-quality peaks randomly Separate Watson and Crick tags Aligns the tags by the mid point Find d: distance between the modes of the Watson and Crick peaks in the alignment http://genomebiology.com/2008/9/9/r137 Genome Biology 2008, Volume 9, Iss (a) (b) Tag percentage (%) 300 200 100 0 100 200 300 alized) / 10 kb 0 500 600 ontrol / kb 2.7 2.8 0.0 0.1 0.2 0.3 0.4 0.5 d = 126 bp Watson tags Crick tags Tag percentage (%) 0.1 0.2 0.3 0.4 0.5 0.6 0.0 d = 122 bp 300 200 100 0 (c) Location with respect to the center of Watson and Crick peaks (bp) Figure from (Zhang et al., 2008) (d) Location with respect to F

Model-based Analysis of ChIP-Seq (MACS) Shift all the tags by d/2 toward the 3 ends to the most likely protein-dna interaction sites Remove redundant tags: Sometimes the same tag can be sequenced repeatedly, more than expected from a random genome-wide tag distribution Such tags might arise from biases during ChIP-DNA amplification and sequencing library preparation (PCR duplicates) These are likely to add noise to the final peak calls MACS removes duplicate tags in excess of what is warranted by the sequencing depth (binomial distribution p-value < 10 5 ) For example, for the 3.9 million ChIP-seq tags, MACS allows each genomic position to contain no more than one tag and removes all the redundancies

Model-based Analysis of ChIP-Seq (MACS) Identifying the most likely binding sites Counting process: if reads were sampled independently from a population with given, fixed fractions of genomic locations, the read counts x in each genomic location/window would follow a multinomial distribution (binomial for a single location), which can be approximated by the Poisson distribution

Multinomial and binomial distributions n independent trials Each trial leads to a success for exactly one of k categories Each category has a fixed success probability p k Multinomial gives probability of any particular combination of numbers of successes for the k categories Binomial is a special case of 2 categories; can be approximated with a Poisson for larger n

Model-based Analysis of ChIP-Seq (MACS) Let x denote the number of sequencing reads in a window (genomic region) Analysing each genomic window independently, if all genomic regions would be equally likely, then x Poisson( λ BG ) = λx BG x! e λ BG, x = 0, 1, 2,... where λ BG is the rate of observing reads in the control sample For experiments with a control, MACS linearly scales the total control tag count to be the same as the total ChIP tag count

Model-based Analysis of ChIP-Seq (MACS) Let x denote the number of sequencing reads in a window (genomic region) Analysing each genomic window independently, if all genomic regions would be equally likely, then x Poisson( λ BG ) = λx BG x! e λ BG, x = 0, 1, 2,... where λ BG is the rate of observing reads in the control sample For experiments with a control, MACS linearly scales the total control tag count to be the same as the total ChIP tag count Because ChIP-seq data has several bias sources which vary across the genome, it is better to model the data using a local/dynamic Poisson λ local = max(λ BG, [λ 1K ], λ 5K, λ 10K ) λ X K is estimated from the X K window of the control sample (e.g. input-dna) centered at a specific location

Model-based Analysis of ChIP-Seq (MACS) Assessing statistical significance of x reads (in a genomic region i) using hypothesis testing H 0 : the ith location is not a binding site H 1 : the ith location is a binding site The p-value is the probability of observing x many reads or more, assuming the null hypothesis is true: p value = Poisson(x λ local ) k=x

Model-based Analysis of ChIP-Seq (MACS) Assessing statistical significance of x reads (in a genomic region i) using hypothesis testing H 0 : the ith location is not a binding site H 1 : the ith location is a binding site The p-value is the probability of observing x many reads or more, assuming the null hypothesis is true: p value = Poisson(x λ local ) k=x The location with the highest fragment pileup (summit) is predicted as the precise binding location The ratio between the ChIP-seq tag count and λ local is reported as the fold enrichment

Tag percentage (%) 0.0 0.1 0.2 0.3 0 Model-based Analysis of ChIP-Seq (MACS) The read count in ChIP-seq versus control in 10 kb windows across the genome. Each dot represents a 10 kb window; red dots are windows 300 200 100 0 100 200 300 Location with respect to the center of Watson and Crick peaks (bp) containing ChIP peaks and black dots are windows containing (c) (d) control peaks used for FDR calculation. Control tag number (normalized) / 10 kb 0 100 200 300 400 500 600 Average tag number in control / kb Tag percentage (%) 2.3 2.4 2.5 2.6 2.7 2.8 0.1 0.2 0.3 0.4 0.0 300 200 100 Location wit 0 200 400 600 800 1,000 20 10 (e) FoxA1 ChIP Seq tag number / 10 kb Figure from (Zhang et al., 2008) Motif occurrence in peak centers for FoxA1 ChIP-Seq (f) Distance Spatial resol 70 MACS Without local lambda Without tag shifting 5 MACS Without local lambda Without tag shifting

ChIP-seq peak: Illustration An illustration of a strong TF binding site Figure from http://www.nature.com/nmeth/journal/v6/n4/images/nmeth.f.247-f2.jpg

Multiple correction in MACS For a ChIP-seq experiment with controls, MACS empirically estimates the false discovery rate (FDR) At each p-value, MACS uses the same parameters to find ChIP-seq peaks over control, and Control peaks over ChIP-seq (i.e., a sample swap) The empirical FDR is defined as empirical FDR = #control peaks #ChIP peaks

Differential binding MACS can also be applied to differential binding between two conditions by treating one of the samples as the control Differential binding analysis in MACS will only work with two samples, i.e., one biological replicate per condition Empirical FDR control will not work in such a scenario

Summary ChIP-seq is a powerful way to detect TF binding sites (or e.g. enriched regions of histone modifications; next lecture) ChIP-seq approaches are limited in that Only a subset of all TFs have a chip-grade antibody None of the antibodies are perfect A single experiment will profile a single protein

References Metzker ML (2010) Sequencing technologies - the next generation, Nat Rev Genet. 11(1):31-46. Park PJ (2009) ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet. 10(10):669-80. Axel Visel, Edward M. Rubin & Len A. Pennacchio (2009) Genomic views of distant-acting enhancers, Nature 461, 199-205. Zhang Y et al. (2008), Model-based analysis of ChIP-Seq (MACS), Genome Biol. 9(9):R137. Zhang X et al. (2011) PICS: probabilistic inference for ChIP-seq, Biometrics, 67(1): 151-163. Wyeth W. Wasserman & Albin Sandelin, Applied bioinformatics for the identification of regulatory elements, Nature Reviews Genetics 5, 276-287, 2004.