Peak-calling for ChIP-seq and ATAC-seq

Similar documents
ChIP-seq analysis. J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant, M. Thomas-Chollier, O.Sand

ChIP-seq data analysis

Processing, integrating and analysing chromatin immunoprecipitation followed by sequencing (ChIP-seq) data

The Epigenome Tools 2: ChIP-Seq and Data Analysis

Computational aspects of ChIP-seq. John Marioni Research Group Leader European Bioinformatics Institute European Molecular Biology Laboratory

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

ChIP-seq hands-on. Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation

EPIGENOMICS PROFILING SERVICES

Session 6: Integration of epigenetic data. Peter J Park Department of Biomedical Informatics Harvard Medical School July 18-19, 2016

Raymond Auerbach PhD Candidate, Yale University Gerstein and Snyder Labs August 30, 2012

Part-II: Statistical analysis of ChIP-seq data

Nature Structural & Molecular Biology: doi: /nsmb.2419

Exploring chromatin regulation by ChIP-Sequencing

High Throughput Sequence (HTS) data analysis. Lei Zhou

Comparison of open chromatin regions between dentate granule cells and other tissues and neural cell types.

RNA-seq Introduction

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Package NarrowPeaks. August 3, Version Date Type Package

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

An epigenetic approach to understanding (and predicting?) environmental effects on gene expression

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

Genome-Wide Localization of Protein-DNA Binding and Histone Modification by a Bayesian Change-Point Method with ChIP-seq Data

Heintzman, ND, Stuart, RK, Hon, G, Fu, Y, Ching, CW, Hawkins, RD, Barrera, LO, Van Calcar, S, Qu, C, Ching, KA, Wang, W, Weng, Z, Green, RD,

Transcript-indexed ATAC-seq for immune profiling

Yingying Wei George Wu Hongkai Ji

A Practical Guide to Integrative Genomics by RNA-seq and ChIP-seq Analysis

Supplemental Figure S1. Tertiles of FKBP5 promoter methylation and internal regulatory region

Patterns of Histone Methylation and Chromatin Organization in Grapevine Leaf. Rachel Schwope EPIGEN May 24-27, 2016

Genome-wide Association Studies (GWAS) Pasieka, Science Photo Library

Supplementary Figure S1. Gene expression analysis of epidermal marker genes and TP63.

Introduction to Systems Biology of Cancer Lecture 2

Differential peak calling of ChIP-seq signals with replicates with THOR

Nature Genetics: doi: /ng Supplementary Figure 1. Assessment of sample purity and quality.

Nature Genetics: doi: /ng Supplementary Figure 1

Tutorial. ChIP Sequencing. Sample to Insight. September 15, 2016

CTCF-Mediated Functional Chromatin Interactome in Pluripotent Cells

ChromHMM Tutorial. Jason Ernst Assistant Professor University of California, Los Angeles

Supplemental Figure 1: Asymmetric chromatin maturation leads to epigenetic asymmetries on sister chromatids.

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Supplementary Figures

Statistical analysis of ChIP-seq data

The Insulator Binding Protein CTCF Positions 20 Nucleosomes around Its Binding Sites across the Human Genome

A novel ATAC-seq approach reveals lineage-specific reinforcement of the open chromatin landscape via cooperation between BAF and p63

Functional annotation of farm animal genomes: ChIP-seq.

Nature Immunology: doi: /ni Supplementary Figure 1. Transcriptional program of the TE and MP CD8 + T cell subsets.

Table S1. Total and mapped reads produced for each ChIP-seq sample

MIR retrotransposon sequences provide insulators to the human genome

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

Not IN Our Genes - A Different Kind of Inheritance.! Christopher Phiel, Ph.D. University of Colorado Denver Mini-STEM School February 4, 2014

Single-strand DNA library preparation improves sequencing of formalin-fixed and paraffin-embedded (FFPE) cancer DNA

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Sudin Bhattacharya Institute for Integrative Toxicology

SUPPLEMENTARY INFORMATION

cis-regulatory enrichment analysis in human, mouse and fly

Transcriptional control in Eukaryotes: (chapter 13 pp276) Chromatin structure affects gene expression. Chromatin Array of nuc

Yue Wei 1, Rui Chen 2, Carlos E. Bueso-Ramos 3, Hui Yang 1, and Guillermo Garcia-Manero 1

Discovery of two identities of neuroblastoma cells via the analysis of super-enhancer landscapes

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

Epigenetics. Jenny van Dongen Vrije Universiteit (VU) Amsterdam Boulder, Friday march 10, 2017

DNA-seq Bioinformatics Analysis: Copy Number Variation

Supplemental Materials

A fully Bayesian approach for the analysis of Whole-Genome Bisulfite Sequencing Data

RNA- seq Introduc1on. Promises and pi7alls

Inferring Biological Meaning from Cap Analysis Gene Expression Data

BST227: Introduction to Statistical Genetics

Assignment 5: Integrative epigenomics analysis

STAT1 regulates microrna transcription in interferon γ stimulated HeLa cells

High-resolution characterization of sequence signatures due to non-random cleavage of cell-free DNA

DNA Sequence Bioinformatics Analysis with the Galaxy Platform

Chapter 1. Introduction

Supplementary Information

Allelic reprogramming of the histone modification H3K4me3 in early mammalian development

Supplementary Figure 1 IL-27 IL

Histone Modifications Are Associated with Transcript Isoform Diversity in Normal and Cancer Cells

Sequence and chromatin determinants of cell-type specific transcription factor binding

TrainMALTA Summer School on Epigenomics

AVENIO ctdna Analysis Kits The complete NGS liquid biopsy solution EMPOWER YOUR LAB

PRC2 inhibition counteracts the culture-associated loss of engraftment potential of human cord blood-derived hematopoietic stem and progenitor cells

Lecture 8 Understanding Transcription RNA-seq analysis. Foundations of Computational Systems Biology David K. Gifford

Nature Immunology: doi: /ni Supplementary Figure 1. Characteristics of SEs in T reg and T conv cells.

Stem Cell Epigenetics

SUPPLEMENTARY FIGURE 1: f-i

Histones modifications and variants

Mechanisms of alternative splicing regulation

Genome-wide relationship between histone H3 lysine 4 mono- and tri-methylation and transcription factor binding

Chromatin marks identify critical cell-types for fine-mapping complex trait variants

Modeling gene expression using five histone modifications

Multiplex target enrichment using DNA indexing for ultra-high throughput variant detection

ChipSeq. Technique and science. The genome wide dynamics of the binding of ldb1 complexes during erythroid differentiation

QIAsymphony DSP Circulating DNA Kit

The corrected Figure S1J is shown below. The text changes are as follows, with additions in bold and deletions in bracketed italics:

Use Case 9: Coordinated Changes of Epigenomic Marks Across Tissue Types. Epigenome Informatics Workshop Bioinformatics Research Laboratory

Transcription and chromatin. General Transcription Factors + Promoter-specific factors + Co-activators

Integrated analysis of sequencing data

Genome-Wide Analysis of Histone Modifications in Human Endometrial Stromal Cells

Lecture 10. Eukaryotic gene regulation: chromatin remodelling

Supplementary Figure 1. Efficiency of Mll4 deletion and its effect on T cell populations in the periphery. Nature Immunology: doi: /ni.

Transcription:

Peak-calling for ChIP-seq and ATAC-seq Shamith Samarajiwa CRUK Autumn School in Bioinformatics 2017 University of Cambridge

Overview Peak-calling: identify enriched (signal) regions in ChIP-seq or ATAC-seq data Software packages Practical and Statistical aspects (Normalization, IDR, QC measures) MACS peak calling Overview of transcription factor, DNA binding protein, histone mark and nucleosome free region peaks Narrow and Broad peaks A brief look at the MACS2 settings and methodology ATAC-seq signal detection

Signal to Noise modified from Carl Herrmann

Strand dependent bimodality Wilbanks et al. 2010 PLOS One

Peak Calling Software Comprehensive list is at: https://omictools.com/peak-calling-category MACS2 (MACS1.4) Most widely used peak caller. Can detect narrow and broad peaks. Epic (SICER) Specialised for broad peaks BayesPeak R/Bioconductor Jmosaics Detects enriched regions jointly from replicates T-PIC Shape based EDD Detects megabase domain enrichment GEM Peak calling and motif discovery for ChIP-seq and ChIP-exo SPP Fragment length computation and saturation analysis to determine if read depth is adequate.

Quality Measures Fraction of reads in peaks (FRiP) is dependant on data type. FRiP can be calculated with deeptools2 PCR Bottleneck Coefficient (PBC) is a measure of library complexity N1= Non redundant, uniquely mapping reads N2= Uniquely mapping reads Preseq and preseqr for determining library complexity Daley et al., 2013, Nat. Methods

Quality Measures Relative strand cross-correlation The RSC is the ratio of the fragment-length cross-correlation value minus the background cross-correlation value, divided by the phantom-peak cross-correlation value minus the background cross-correlation value. The minimum possible value is 0 (no signal), highly enriched experiments have values greater than 1, and values much less than 1 may indicate low quality.

Quality Measures Normalised strand cross-correlation NSC The ratio of the maximal cross-correlation value (which occurs at strand shift equal to fragment length) divided by the background cross-correlation (minimum cross-correlation value over all possible strand shifts). Higher values indicate more enrichment, values > 1.1 are relatively low NSC scores, and the minimum possible value is 1 (no enrichment). This score is sensitive to technical effects; for example, high-quality antibodies such as H3K4me3 and CTCF score well for all cell types. This score is also sensitive to biological effects; narrow marks score higher than broad marks (H3K4me3 vs H3K36me3, H3K27me3) for all cell types and ENCODE production groups, and features present in some individual cells, but not others, in a population are expected to have lower scores. A measure of enrichment derived without dependence on prior determination of enriched regions.

IDR: Irreproducible Discovery Rate If two replicates measure the same underlying biology, the most significant peaks which are likely to be genuine signals, are expected to have high consistency between replicates. Peaks with low significance, which are more likely to be noise, are expected to have low consistency. IDR measures consistency between replicates in high-throughput experiments. The IDR method compare a pair of ranked lists of identifications (such as ChIP-seq peaks). These ranked lists should not be pre-thresholded, i.e they should provide identifications across the entire spectrum of high confidence/enrichment (signal) and low confidence/enrichment (noise). The method uses reproducibility in score rankings between peaks in each replicate to determine an optimal cutoff for significance. The IDR method then fits the bivariate rank distributions over the replicates in order to separate signal from noise based on a defined confidence of rank consistency and reproducibility of identifications. software: https://github.com/nboley/idr

ENCODE

Encode Quality Metrics

MACS2 Most widely used peak caller. identifies genome-wide locations of TF binding, histone modification or NFRs from ChIP-seq or ATAC-seq data. Can be used without a control (Input -samples of sonicated chromatin OR IgG - nonspecific antibody) - Not Recommended for ChIP-seq! Controls eliminate bias due to GC content, mappability, DNA repeats or CNVs. Can call narrow and broad peaks. Many settings for optimizing results.

MACS2

MACS1.4

Peak calling with MACS2 Step 1: Estimate fragment length d and adjust read position Slide a window of length 2 x bw bandwidth (half of estimated sonication size) across genome. Retain windows with > MFOLD ( fold-enrichment of treatment / background) Compute the average +/- strand specific read-densities for these bins.

MACS2 Step 2: Identify local noise slide a window of size 2*d across treatment and input estimate local parameter of Poisson distribution modified from Carl Herrmann

MACS2 Step 3: Identify enriched (peak) regions determine regions with p-value < PVALUE determine summit position within enriched regions as max density modified from Carl Herrmann

MACS2 Step 4: Estimate FDR positive peaks (P-values) swap treatment and input, call negative peaks (P-value) modified from Carl Herrmann

Reads to Peaks +ive and -ive strand reads do not represent true binding sites Fragment length d can be detected experimentally or estimated from strand asymmetry in data Reads from both strands can be extended to the length of d OR Reads can be shifted towards 3 by d/2

Narrow, Broad and Mixed Peaks Different data types have different peak shapes. Use appropriate peak callers or domain detectors. Same TF may have different peak shapes reflecting differences in biological conditions. Replicates should have similar binding patterns. Most TF peaks are narrow, with particularly sharp peaks from ChIP-exo data. ChIP-seq peaks from epigenomic data can be narrow, broad or gapped. Histone marks such as H3K9me3 or H3K27me3 are broad while others such as H3K4me3 and proteins such as CTCF are narrow. Other DNA binding proteins such as HP1, Lamins (Lamin A or B), HMGA etc. form broad peaks or domains. PolII peaks can be narrow or broad depending on whether its detecting transcription initiation at the TSS or propagation along the gene body. ATAC-seq data representing nucleosome free regions (NFRs) can be narrow or broad depending on the properties of regulatory regions underlying them.

Peak Shapes Sims et al., 2014 Nat Rev Genet.

Broad peak and Domain callers MACS2: macs2 callpeak --broad Epic: Useful for finding medium or diffusely enriched domains in chip-seq data. Epic is an improvement over the original SICER, by being faster, more memory efficient, multi core, and significantly easier to install and use. Others: Enriched Domain Detector (EDD), RSEG, BroadPeak, PeakRanger (CCAT)

ATAC-seq settings If using paired end reads use --format BAMPE to let MACS2 pileup the whole fragments in general. If you want to focus on looking for where the 'cutting sites' are, then --nomodel --shift -100 --extsize 200 should work. Since the DNA wrapped on a nucleosome is about 147bp, for single nucleosome detection use --nomodel --shift -37 --extsize 73. BASED ON EPIGENETICS CHROMATIN, 7:33, 2014. Scientist, Volume 30 Issue 1 January 2016

References Landt et al., ChIP-seq guidelines and practices of the ENCODE and modencode consortia. Genome Res. 2012, 22:1813-1831. PMID: 22955991. Wilbanks et al., Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 2010, Jul 8;5(7):e11471. PMID: 20628599 Nakato et al., Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation. Brief Bioinform. 2017 Mar 1;18(2):279-290. PMID: 26979602 Buenrostro et al., ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Curr Protoc Mol Biol. 2015 Jan 5;109:21.29.1-9. PMID: 25559105 Sims