1 William M Keck Professor of Biochemistry Keck School of Medicine University of Southern California How many human genes are encoded in our 3x10 9 bp? C. elegans (worm) 959 cells and 1x10 8 bp 20,000 genes 2001 First guess for human genes 100,000-150,000 2001 genome draft up to 40,000 genes 2 2012: ~20,000 genes The ENCODE project Encyclopedia of DNA Elements Goal: delineate all functional elements in the human genome http://www.genome.gov/10005107 3 1
Accessing and using ENCODE data What are ENCODE assays What has been learned so far How to access ENCODE data How to use ENCODE data 4 Advantages of studying the genome in a large consortium 1) Publicly available datasets 2) Integrative analysis of many assays from different groups using the same cells 3) Development of standards and guidelines 5 http://www.genome.ucsc.edu/encode/ 6 2
ENCODE assays Open chromatin Gene expression 3D looping TF binding 7 ENCODE assay: RNA-seq Fragment RNA Convert to cdna add sequencing adapters 8 Unbiased analysis of known and novel RNAs Both coding and non-coding RNAs can be studied Differential exon usage can be identified Strand-specific analyses can distinguish transcripts ENCODE assay: ChIP-seq 1400 site-specific DNA binding proteins Very few have been well-characterized Nature Reviews Genetics10, 252-263, 2009 9 Less false positives than bioinformatic methods Less false negatives than motif searching De novo motif discovery Can distinguish members of transcription factor families 3
Evolution of the ChIP assay Crosslink proteins to DNA Sonicate chromatin Immunoprecipitate Reverse crosslinks Purify DNA ChIP-PCR ChIP-chip ChIP-seq 10 1998: PCR actcatgcatgaaacctgacgcagg ccgtatcgatgaggaqtctctcagga.. 2004: ENCODE ChIP-chip 2007: ENCODE ChIP-seq ENCODE assays: regulatory chromatin ChIP-seq DNase-seq FAIRE-seq Open chromatin 11 ChIP-seq of modified histones Antibodies to active chromatin Promoters: H3K4me3, H3K9Ac Enhancers: H3K4me1, H3K27Ac Elongation: H3K36me3, H3K79me2 Antibodies to silent chromatin H3K27me3, H3K9me3 12 4
Comparison of DNase-seq, FAIRE-seq, modified histones, and TF binding DNase-seq FAIRE-seq TF ChIP-seq H3K4me3 H3K9Ac H3K27Ac 13 ENCODE assays: 3D looping 14 3C: examines a small number of regions; TF involved is not known 4C: can be used to find everything that interacts with a specific region 5C: higher throughput, sequencing-based version of 3C ChIA-PET: identifies loops mediated by a specific TF; Genome-wide HiC: can map all possible interactions; TF involved is not known Accessing and using ENCODE data What is ENCODE What has been learned so far How to access ENCODE data How to use ENCODE data 15 5
An integrated encyclopedia of DNA elements in the human genome Nature 489: 57-74, 2012 182 cell lines/tissues Jan 2011 data freeze 3,010 experiments 5 tera bases 1716x of the human genome 16 GM12878 K562 H1-Hesc HeLa-S3 HepG2 Huvec 164 assays (114 different Chip) Insights from ENCODE Information density Genes and transcripts Chromatin Transcription factors Relationship between regulatory elements and genetic variation 17 Density of information encoded in the human genome A functional element is a discrete genomic region that encodes a defined product or displays a reproducible biochemical signature 80% is in use; 62% is covered by RNA molecules >200 nts in length 95% lies within 8 kb of a DNAprotein interaction 18 RNA transcripts> histone marks> DHS> TF ChIP-seq peaks> TF bound motifs> exons 6
Protein coding transcripts There are 20,687 coding genes Average of 6.3 transcripts/gene Average of 3.9 different proteins/gene Longest transcript: 100,272 nt Most spliced isoforms: 65 19 a b c Comparison of coding and non-coding RNAs Protein-coding RNAs 20,687 different coding genes Longest coding RNA: 100,272 nt Most spliced isoforms: 65 Non-coding RNAs 8801 small RNAs 9640 long RNAs Longest non-coding RNA: 29,517 nt Most spliced isoforms: 40 20 Open chromatin 2.89 million unique DHS ~200,000 DHS/cell line ~1% of the genome in a cell type ~4% of the genome in aggregate ~98% of ChIP-seq TF peaks are in a DHS 21 7
Chromatin modifications that influence gene expression Relative contribution to R^2 (regression) 0.00 0.05 0.10 0.15 0.20 0.25 H3K79me2 H3K9Ac, H3K4me3 H3K27me3, H3K9me3 22 Classification of transcription factors Enhancer element Core promoter Core promoter Common sites Proximal promoter Proximal sites 23 Cell type-specific Nat Rev Genet10: 605-16, 2009 Distal sites Binding preferences of TFs GATA2: mostly distal sites AP2 gamma : mix of proximal and distal sites E2F4: very TSS-centric 24 Percentage within 1 kb of a TSS 8
Chromosomal loops 5C analysis of 1% of the genome in 4 cell types Average number of distal elements interacting with a TSS= 3.9 A gene can be regulated by multiple enhancers Average number of TSSs interacting with a distal element= 2.5 25 An enhancer can regulate multiple genes ENCODE elements are linked to genetic variation 4492 GWAS SNPs: 34% overlap DNAse HS 26 Nature 489: 57-74, 2012 http://www.genome.gov/gwastudies/ http://www.nature.com/encode/ 27 9
Accessing and using ENCODE data What is ENCODE What has been learned so far How to access ENCODE data How to use ENCODE data 28 Visualizing ENCODE data 29 http://www.genome.ucsc.edu/encode/ 30 http://www.genome.ucsc.edu/ 10
31 Snapshot of the ENCODE tracks on the UCSC browser 32 ENCODE tracks 33 11
How to access tables listing ENCODE data 34 http://www.genome.ucsc.edu/encode/ 35 Experiment summary 36 12
Downloading ENCODE data 37 http://www.genome.ucsc.edu/encode/ 38 ENCODE user s guide PLOS Biol 9:e1001046, 2011 39 13
Using ENCODE data What is ENCODE What has been learned so far How to access ENCODE data How to use ENCODE data 40 Scenario #1: understanding ChIP-seq results How can ENCODE data provide information about the function of my favorite transcription factor? 41 ENCODE standards for ChIP-seq You ve just performed a ChIP-seq experiment for your favorite TF Is the ChIP-seq data of high quality? Choosing a peak cut-off 42 14
Reproducibility and peak cut-off http://www.genome.ucsc.edu/encode/softwaretools.html Epigenetics Chromatin 6:13, 2103 *12,543 peaks called on the merged dataset* 43 44 Comparing a ChIP-seq peak file to other ENCODE datasets 1) Create a ChIP-seq peak file with genomic coordinates ENCODE default: SPP and Peak-seq http://www.genome.ucsc.edu/encode/encodetools.html 2) Download BAM files from ENCODE ChIP-seq datasets for other TFs ChIP-seq datasets for modified histones http://www.genome.ucsc.edu/encode/downloads.html 3) Create tag directories of each file using the maketagdirectory script in HOMER: http://biowhat.ucsd.edu/homer/ngs/index.html 4) Map the density of sequenced tags from each dataset relative to the center of your called peaks file using annotatepeaks in HOMER Tag density plots of ChIP-seq data Pol2 H3K9ac H3K27me3 45 15
Analysis of target genes Classification of a TF Gene ontology analysis Expression levels of nearest genes http://bejerano.stanford.edu/great/public/html/ http://david.abcc.ncifcrf.gov/ http://biowhat.ucsd.edu/homer/ngs/annotation.html http://genome.ucsc.edu/encode/datamatrix/encodedatasummaryhuman.html http://www.roadmapepigenomics.org/data http://www.ncbi.nlm.nih.gov/geo/ 46 Motif analysis: Factorbook http://www.factorbook.org/ Sequence features and chromatin states around ENCODE TF peak sets 47 An example from FactorBook 48 16
ZBTB33 sequence logos Motif1 M1 M2 M3 M4 M5 Flanking-L Flanking-R 49 http://www.factorbook.org/mediawiki/index.php/zbtb33 Scenario #2: understanding GWAS results You ve identified a genomic region of interest using GWAS How can I get an overview of the biochemical activity in this region? How can I find the important regulatory elements? 50 Overview of the genomic landscape rs6983267 around a risk SNP rs6983267 51 http://genome.ucsc.edu/cgi-bin/hgtrackui?hgsid= 345204705&c=chr8&g=wgEncodeReg 17
Tag SNPs & functional SNPs GWAS Tag SNP High LD region Functional element Tag SNP SNPs statistically associated with the phenotype fsnp SNP associated with the phenotype AND in a regulatory region HaploReg http://www.broadinstitute.org/mammals/haploreg/ RegulomeDB http://regulome.stanford.edu/ 52 FunciSNP http://bioconductor.org/packages/2.12/bioc/html/funcisnp.html FunciSNP 53 Example of FunciSNP workflow: Defining cancer risk enhancers Cancer tag SNPs Genomic window (+/-200kb) around tag SNPs Extract all known SNPs from 1000 genome database Merge all SNPs with genomic feature: find overlaps ChIP-seq data Peak calling H3K4me1 H3K27Ac H3K4me3 Genomic features H3K27Ac peak file H3K4me1 peak file H3K4me3 peak file 54 Select correlated SNPs: measure of LD (r 2 >0.5) Functional SNPs 18
Identification of functional SNPs Tag SNP Enhancer SNPs H3K27Ac ChIP-seq 55 Summary: accessing and using ENCODE data What is ENCODE www.genome.gov/10005107 What has been learned so far www.nature.com/encode/ How to access ENCODE data www.genome.ucsc.edu/encode/ How to use ENCODE data 56 www.genome.ucsc.edu/encode/usageresources.html 57 19