Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Similar documents
Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

Comparison of open chromatin regions between dentate granule cells and other tissues and neural cell types.

Processing, integrating and analysing chromatin immunoprecipitation followed by sequencing (ChIP-seq) data

Raymond Auerbach PhD Candidate, Yale University Gerstein and Snyder Labs August 30, 2012

ChromHMM Tutorial. Jason Ernst Assistant Professor University of California, Los Angeles

Session 6: Integration of epigenetic data. Peter J Park Department of Biomedical Informatics Harvard Medical School July 18-19, 2016

The Epigenome Tools 2: ChIP-Seq and Data Analysis

Peak-calling for ChIP-seq and ATAC-seq

ChIP-seq data analysis

Nature Structural & Molecular Biology: doi: /nsmb.2419

Data mining with Ensembl Biomart. Stéphanie Le Gras

Supplemental Figure S1. Tertiles of FKBP5 promoter methylation and internal regulatory region

Mechanisms of alternative splicing regulation

ChIP-seq analysis. J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant, M. Thomas-Chollier, O.Sand

Histone Modifications Are Associated with Transcript Isoform Diversity in Normal and Cancer Cells

Supplementary Figure S1. Gene expression analysis of epidermal marker genes and TP63.

Computational aspects of ChIP-seq. John Marioni Research Group Leader European Bioinformatics Institute European Molecular Biology Laboratory

High Throughput Sequence (HTS) data analysis. Lei Zhou

EPIGENOMICS PROFILING SERVICES

Nature Genetics: doi: /ng Supplementary Figure 1

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

cis-regulatory enrichment analysis in human, mouse and fly

Heintzman, ND, Stuart, RK, Hon, G, Fu, Y, Ching, CW, Hawkins, RD, Barrera, LO, Van Calcar, S, Qu, C, Ching, KA, Wang, W, Weng, Z, Green, RD,

Not IN Our Genes - A Different Kind of Inheritance.! Christopher Phiel, Ph.D. University of Colorado Denver Mini-STEM School February 4, 2014

Research Article Identifying Liver Cancer-Related Enhancer SNPs by Integrating GWAS and Histone Modification ChIP-seq Data

Transcript-indexed ATAC-seq for immune profiling

Assignment 5: Integrative epigenomics analysis

REVIEWERS' COMMENTS: Reviewer #1 (Remarks to the Author):

Yingying Wei George Wu Hongkai Ji

Sudin Bhattacharya Institute for Integrative Toxicology

Nature Immunology: doi: /ni Supplementary Figure 1. Characteristics of SEs in T reg and T conv cells.

Supplementary Figure 1. Efficiency of Mll4 deletion and its effect on T cell populations in the periphery. Nature Immunology: doi: /ni.

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

Epigenetics. Jenny van Dongen Vrije Universiteit (VU) Amsterdam Boulder, Friday march 10, 2017

RNA-seq Introduction

An epigenetic approach to understanding (and predicting?) environmental effects on gene expression

Supplementary Figures

The Insulator Binding Protein CTCF Positions 20 Nucleosomes around Its Binding Sites across the Human Genome

A Practical Guide to Integrative Genomics by RNA-seq and ChIP-seq Analysis

MIR retrotransposon sequences provide insulators to the human genome

MODULE 3: TRANSCRIPTION PART II

Chromatin marks identify critical cell-types for fine-mapping complex trait variants

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

R2 Training Courses. Release The R2 support team

Transcriptional control in Eukaryotes: (chapter 13 pp276) Chromatin structure affects gene expression. Chromatin Array of nuc

Genome Control in Cell Identity and Disease! Development and cell identity Loss of cell identity and disease New diagnostics and therapeutics

STAT1 regulates microrna transcription in interferon γ stimulated HeLa cells

Tutorial. ChIP Sequencing. Sample to Insight. September 15, 2016

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Genome-wide Association Studies (GWAS) Pasieka, Science Photo Library

Iso-Seq Method Updates and Target Enrichment Without Amplification for SMRT Sequencing

Exploring chromatin regulation by ChIP-Sequencing

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

EXPression ANalyzer and DisplayER

A novel ATAC-seq approach reveals lineage-specific reinforcement of the open chromatin landscape via cooperation between BAF and p63

Chip Seq Peak Calling in Galaxy

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

Introduction to Systems Biology of Cancer Lecture 2

Part-II: Statistical analysis of ChIP-seq data

The epigenetic landscape of T cell subsets in SLE identifies known and potential novel drivers of the autoimmune response

Statistical Genetics. Matthew Stephens. Statistics Retreat, October 26th 2012

Package NarrowPeaks. August 3, Version Date Type Package

Use Case 9: Coordinated Changes of Epigenomic Marks Across Tissue Types. Epigenome Informatics Workshop Bioinformatics Research Laboratory

Big Data Training for Translational Omics Research. Session 1, Day 3, Liu. Case Study #2. PLOS Genetics DOI: /journal.pgen.

Eukaryotic transcription (III)

The corrected Figure S1J is shown below. The text changes are as follows, with additions in bold and deletions in bracketed italics:

ChipSeq. Technique and science. The genome wide dynamics of the binding of ldb1 complexes during erythroid differentiation

University of Pittsburgh Annual Progress Report: 2008 Formula Grant

User Guide. Association analysis. Input

DNA-seq Bioinformatics Analysis: Copy Number Variation

The search for cis-regulatory driver mutations in cancer genomes

CTCF-Mediated Functional Chromatin Interactome in Pluripotent Cells

Analyse de données de séquençage haut débit

Patterns of Histone Methylation and Chromatin Organization in Grapevine Leaf. Rachel Schwope EPIGEN May 24-27, 2016

Ambient temperature regulated flowering time

Genome-wide relationship between histone H3 lysine 4 mono- and tri-methylation and transcription factor binding

ChIP-seq hands-on. Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

ChIPSeq. Technique and science. The genome wide dynamics of the binding of ldb1 complexes during erythroid differentiation

Supplementary Figure 1 IL-27 IL

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Effects of UBL5 knockdown on cell cycle distribution and sister chromatid cohesion

Yue Wei 1, Rui Chen 2, Carlos E. Bueso-Ramos 3, Hui Yang 1, and Guillermo Garcia-Manero 1

Open Access RESEARCH. Background Malaria is a major public health problem in many developing countries, with the parasite Plasmodium

Supplemental Materials

H3K4 demethylase KDM5B regulates global dynamics of transcription elongation and alternative splicing in embryonic stem cells

Nature Genetics: doi: /ng Supplementary Figure 1. Assessment of sample purity and quality.

Annotation of Functional Regulatory Elements in Livestock Species

genomics for systems biology / ISB2020 RNA sequencing (RNA-seq)

Supplementary Figure 1. Nature Genetics: doi: /ng.3736

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

Heritability enrichment of differentially expressed genes. Hilary Finucane PGC Statistical Analysis Call January 26, 2016

Histones modifications and variants

RNA- seq Introduc1on. Promises and pi7alls

SALSA MLPA KIT P060-B2 SMA

the reaction was stopped by adding glycine to final concentration 0.2M for 10 minutes at

Allelic reprogramming of the histone modification H3K4me3 in early mammalian development

Integrated analysis of sequencing data

Computational Modeling of mirna Biogenesis

Transcription:

1 William M Keck Professor of Biochemistry Keck School of Medicine University of Southern California How many human genes are encoded in our 3x10 9 bp? C. elegans (worm) 959 cells and 1x10 8 bp 20,000 genes 2001 First guess for human genes 100,000-150,000 2001 genome draft up to 40,000 genes 2 2012: ~20,000 genes The ENCODE project Encyclopedia of DNA Elements Goal: delineate all functional elements in the human genome http://www.genome.gov/10005107 3 1

Accessing and using ENCODE data What are ENCODE assays What has been learned so far How to access ENCODE data How to use ENCODE data 4 Advantages of studying the genome in a large consortium 1) Publicly available datasets 2) Integrative analysis of many assays from different groups using the same cells 3) Development of standards and guidelines 5 http://www.genome.ucsc.edu/encode/ 6 2

ENCODE assays Open chromatin Gene expression 3D looping TF binding 7 ENCODE assay: RNA-seq Fragment RNA Convert to cdna add sequencing adapters 8 Unbiased analysis of known and novel RNAs Both coding and non-coding RNAs can be studied Differential exon usage can be identified Strand-specific analyses can distinguish transcripts ENCODE assay: ChIP-seq 1400 site-specific DNA binding proteins Very few have been well-characterized Nature Reviews Genetics10, 252-263, 2009 9 Less false positives than bioinformatic methods Less false negatives than motif searching De novo motif discovery Can distinguish members of transcription factor families 3

Evolution of the ChIP assay Crosslink proteins to DNA Sonicate chromatin Immunoprecipitate Reverse crosslinks Purify DNA ChIP-PCR ChIP-chip ChIP-seq 10 1998: PCR actcatgcatgaaacctgacgcagg ccgtatcgatgaggaqtctctcagga.. 2004: ENCODE ChIP-chip 2007: ENCODE ChIP-seq ENCODE assays: regulatory chromatin ChIP-seq DNase-seq FAIRE-seq Open chromatin 11 ChIP-seq of modified histones Antibodies to active chromatin Promoters: H3K4me3, H3K9Ac Enhancers: H3K4me1, H3K27Ac Elongation: H3K36me3, H3K79me2 Antibodies to silent chromatin H3K27me3, H3K9me3 12 4

Comparison of DNase-seq, FAIRE-seq, modified histones, and TF binding DNase-seq FAIRE-seq TF ChIP-seq H3K4me3 H3K9Ac H3K27Ac 13 ENCODE assays: 3D looping 14 3C: examines a small number of regions; TF involved is not known 4C: can be used to find everything that interacts with a specific region 5C: higher throughput, sequencing-based version of 3C ChIA-PET: identifies loops mediated by a specific TF; Genome-wide HiC: can map all possible interactions; TF involved is not known Accessing and using ENCODE data What is ENCODE What has been learned so far How to access ENCODE data How to use ENCODE data 15 5

An integrated encyclopedia of DNA elements in the human genome Nature 489: 57-74, 2012 182 cell lines/tissues Jan 2011 data freeze 3,010 experiments 5 tera bases 1716x of the human genome 16 GM12878 K562 H1-Hesc HeLa-S3 HepG2 Huvec 164 assays (114 different Chip) Insights from ENCODE Information density Genes and transcripts Chromatin Transcription factors Relationship between regulatory elements and genetic variation 17 Density of information encoded in the human genome A functional element is a discrete genomic region that encodes a defined product or displays a reproducible biochemical signature 80% is in use; 62% is covered by RNA molecules >200 nts in length 95% lies within 8 kb of a DNAprotein interaction 18 RNA transcripts> histone marks> DHS> TF ChIP-seq peaks> TF bound motifs> exons 6

Protein coding transcripts There are 20,687 coding genes Average of 6.3 transcripts/gene Average of 3.9 different proteins/gene Longest transcript: 100,272 nt Most spliced isoforms: 65 19 a b c Comparison of coding and non-coding RNAs Protein-coding RNAs 20,687 different coding genes Longest coding RNA: 100,272 nt Most spliced isoforms: 65 Non-coding RNAs 8801 small RNAs 9640 long RNAs Longest non-coding RNA: 29,517 nt Most spliced isoforms: 40 20 Open chromatin 2.89 million unique DHS ~200,000 DHS/cell line ~1% of the genome in a cell type ~4% of the genome in aggregate ~98% of ChIP-seq TF peaks are in a DHS 21 7

Chromatin modifications that influence gene expression Relative contribution to R^2 (regression) 0.00 0.05 0.10 0.15 0.20 0.25 H3K79me2 H3K9Ac, H3K4me3 H3K27me3, H3K9me3 22 Classification of transcription factors Enhancer element Core promoter Core promoter Common sites Proximal promoter Proximal sites 23 Cell type-specific Nat Rev Genet10: 605-16, 2009 Distal sites Binding preferences of TFs GATA2: mostly distal sites AP2 gamma : mix of proximal and distal sites E2F4: very TSS-centric 24 Percentage within 1 kb of a TSS 8

Chromosomal loops 5C analysis of 1% of the genome in 4 cell types Average number of distal elements interacting with a TSS= 3.9 A gene can be regulated by multiple enhancers Average number of TSSs interacting with a distal element= 2.5 25 An enhancer can regulate multiple genes ENCODE elements are linked to genetic variation 4492 GWAS SNPs: 34% overlap DNAse HS 26 Nature 489: 57-74, 2012 http://www.genome.gov/gwastudies/ http://www.nature.com/encode/ 27 9

Accessing and using ENCODE data What is ENCODE What has been learned so far How to access ENCODE data How to use ENCODE data 28 Visualizing ENCODE data 29 http://www.genome.ucsc.edu/encode/ 30 http://www.genome.ucsc.edu/ 10

31 Snapshot of the ENCODE tracks on the UCSC browser 32 ENCODE tracks 33 11

How to access tables listing ENCODE data 34 http://www.genome.ucsc.edu/encode/ 35 Experiment summary 36 12

Downloading ENCODE data 37 http://www.genome.ucsc.edu/encode/ 38 ENCODE user s guide PLOS Biol 9:e1001046, 2011 39 13

Using ENCODE data What is ENCODE What has been learned so far How to access ENCODE data How to use ENCODE data 40 Scenario #1: understanding ChIP-seq results How can ENCODE data provide information about the function of my favorite transcription factor? 41 ENCODE standards for ChIP-seq You ve just performed a ChIP-seq experiment for your favorite TF Is the ChIP-seq data of high quality? Choosing a peak cut-off 42 14

Reproducibility and peak cut-off http://www.genome.ucsc.edu/encode/softwaretools.html Epigenetics Chromatin 6:13, 2103 *12,543 peaks called on the merged dataset* 43 44 Comparing a ChIP-seq peak file to other ENCODE datasets 1) Create a ChIP-seq peak file with genomic coordinates ENCODE default: SPP and Peak-seq http://www.genome.ucsc.edu/encode/encodetools.html 2) Download BAM files from ENCODE ChIP-seq datasets for other TFs ChIP-seq datasets for modified histones http://www.genome.ucsc.edu/encode/downloads.html 3) Create tag directories of each file using the maketagdirectory script in HOMER: http://biowhat.ucsd.edu/homer/ngs/index.html 4) Map the density of sequenced tags from each dataset relative to the center of your called peaks file using annotatepeaks in HOMER Tag density plots of ChIP-seq data Pol2 H3K9ac H3K27me3 45 15

Analysis of target genes Classification of a TF Gene ontology analysis Expression levels of nearest genes http://bejerano.stanford.edu/great/public/html/ http://david.abcc.ncifcrf.gov/ http://biowhat.ucsd.edu/homer/ngs/annotation.html http://genome.ucsc.edu/encode/datamatrix/encodedatasummaryhuman.html http://www.roadmapepigenomics.org/data http://www.ncbi.nlm.nih.gov/geo/ 46 Motif analysis: Factorbook http://www.factorbook.org/ Sequence features and chromatin states around ENCODE TF peak sets 47 An example from FactorBook 48 16

ZBTB33 sequence logos Motif1 M1 M2 M3 M4 M5 Flanking-L Flanking-R 49 http://www.factorbook.org/mediawiki/index.php/zbtb33 Scenario #2: understanding GWAS results You ve identified a genomic region of interest using GWAS How can I get an overview of the biochemical activity in this region? How can I find the important regulatory elements? 50 Overview of the genomic landscape rs6983267 around a risk SNP rs6983267 51 http://genome.ucsc.edu/cgi-bin/hgtrackui?hgsid= 345204705&c=chr8&g=wgEncodeReg 17

Tag SNPs & functional SNPs GWAS Tag SNP High LD region Functional element Tag SNP SNPs statistically associated with the phenotype fsnp SNP associated with the phenotype AND in a regulatory region HaploReg http://www.broadinstitute.org/mammals/haploreg/ RegulomeDB http://regulome.stanford.edu/ 52 FunciSNP http://bioconductor.org/packages/2.12/bioc/html/funcisnp.html FunciSNP 53 Example of FunciSNP workflow: Defining cancer risk enhancers Cancer tag SNPs Genomic window (+/-200kb) around tag SNPs Extract all known SNPs from 1000 genome database Merge all SNPs with genomic feature: find overlaps ChIP-seq data Peak calling H3K4me1 H3K27Ac H3K4me3 Genomic features H3K27Ac peak file H3K4me1 peak file H3K4me3 peak file 54 Select correlated SNPs: measure of LD (r 2 >0.5) Functional SNPs 18

Identification of functional SNPs Tag SNP Enhancer SNPs H3K27Ac ChIP-seq 55 Summary: accessing and using ENCODE data What is ENCODE www.genome.gov/10005107 What has been learned so far www.nature.com/encode/ How to access ENCODE data www.genome.ucsc.edu/encode/ How to use ENCODE data 56 www.genome.ucsc.edu/encode/usageresources.html 57 19