Elucidating Transcriptional Regulation at Multiple Scales Using High-Throughput Sequencing, Data Integration, and Computational Methods Raymond Auerbach PhD Candidate, Yale University Gerstein and Snyder Labs August 30, 2012 1
Outline Background Transcriptional Regulation, ENCODE, ChIP-Seq Selected Projects from my PhD work Understanding the technical aspects of ChIP-Seq scoring and how choice of reference sample matters Using high-throughput sequencing to gain a genome-wide view of chromatin remodeling (SWI/SNF complex) Understanding the effects of long-range interactions and genome folding on transcription CAPE: a tool to classify features by RNAPII binding and gene expression 2
g DNA folding Transcriptional Regulation: A Cartoon View TF combinations DNA folding Site-specific binding Holstege and Young, PNAS, 1999 Histone modifications Chromatin remodeling 3 Credit: Adam Steinberg
ENCODE Data Description The ENCODE Project Consortium, PLoS Biology, 2011 4
ChIP-Seq 5
Early ChIP-Seq Questions How should peaks be identified? Which peaks are significant? The ChIP peaks seem obvious, so why not score against randomized background? Are controls or references needed? What biases are present? Does the peak calling need to be tuned for different factors and/or organisms? 6
Highlights from Key Papers Nature Biotechnology, 2009 PNAS, 2009 7
First surprise: Input DNA has structure Input DNA profile shows peaks itself and is not flat The Pol2 antibody is exceptional. For ChIP with a typical antibody, the input DNA peaks could affect the ability to call significant peaks 8
Origin of Input DNA from Nuclear Lysate 1. Reverse cross-links 2. Phenol-chloroform extract 3. Purify DNA 4. Size select DNA 5. Ligate Illumina adapters What happens if we change some of these variables? 9
Hypothesis and Strategy Initial hypothesis: Input DNA peaks will be highest in regions of open chromatin Input DNA peaks also seen in other genomes Strategy Using ChIP-Seq experiments in HeLa S3 and yeast, score all tracks and aggregate signal over interesting features 10
Reference Types We Examined Input DNA ChIP DNA that is not IP ed with an antibody MNase-digested DNA Use MNase to cleave DNA instead of sonication IgG (non-specific antibody) ChIP DNA IP ed with a non-specific antibody Naked DNA Sonicated DNA. Not crosslinked or IP ed. Proteins removed. 11
Both Size Selection and Crosslinking are Necessary Auerbach and Euskirchen et al., PNAS, 2009 12
Aggregation Plot Expressed Genes (TSS) Pol II Input DNA 100-350 bp Naked DNA Input DNA 350-500 bp IgG Mappability MNase 13 Input DNA enriched 4x over background!
Regions Associated with Active Transcription Input DNA 100-350 bp Input DNA 350-500 bp Auerbach and Euskirchen, et al. PNAS, 2009. 14
Regions Associated with Transcriptional Inactivity Input DNA 100-350 bp Input DNA 350-500 bp Auerbach and Euskirchen, et al. PNAS, 2009. 15
What are the peaks? 16
Summary and Bioinformatics Contributions First comprehensive analysis of ChIP-Seq reference DNAs on peak scoring Led to the choice of a preferred reference by our lab for ENCODE Consortium work (IgG) Integration of various data sets with reference DNA types to gain a greater understanding of scoring biases Useful for detecting accessible chromatin regions, particularly as a first pass 17
Generalized Peak Caller 18
Considerations with Early Peak Callers Usually designed around ChIP with an ideal antibody Also usually targeted toward one organism Default parameters typically arise from choices of the experimental collaborator How do peak callers work with more typical antibodies? How about with members of a protein complex? 19
ChIP-Seq of a Large Chromatin Remodeling Complex (SWI/SNF) Paper: Euskirchen and Auerbach, et al., PLoS Genetics, 2011 20
Chromatin Remodeling: Why You Should Care Can change whether a region is accessible to TFs and other proteins Quick way to regulate regions that are actively transcribed Zofall et al., Nature Structural & Molecular Biology, 2006 21
Chromatin Remodelers and Epigenetics de la Serna et al., Nature Reviews Genetics, 2006 22
Role in Cancer SWI/SNF subunit Cancer Mutation Type Reference Ini1 malignant rhabdoid tumors truncating mutations BAF250A/ARID1A BAF250A/ARID1A ovarian clear cell carcinomas transitional cell carcinoma of the bladder somatically acquired, inactivating mutations (1998) Nature 394: 203; (2006) Mod. Pathol. 19: 717 (2010) Science 330: 228; (2010) N. Engl. J. Med. 363:1532 somatic, non-silent mutations (2011) Nat. Genet. 43: 875 BAF200 hepatitis C virus-associated hepatocellular carcinomas somatic, inactivating mutations (2011) Nat. Genet. 43: 828 BAF180 clear cell renal carcinomas somatic, inactivating mutations (2011) Nature 469: 539 Brg1 & Brm Brg1 BAF250A/ARID1, Brg1 & BAF180 non-small cell lung carcinomas lung cancer cell lines, esp. nonsmall cell lung cancers pancreatic cancers 23 unknown; based on negative staining of tissue (2003) Cancer Res. 63: 560 inactivating mutations (2008) Hum. Mutat. 29: 617 various (nonsense, missense, indel, frameshift, rearrangement, splice site) Brd7 breast cancer multi-gene deletion (2012) PNAS 109: E252 (2010) Nature Cell Biol. 12, 380-389
SWI/SNF Has 288 Subunit Combinations! ARID (1a or 1b or 2) * * * * 24
Project Overview Analysis Questions Where does SWI/SNF bind and in what configurations? What other elements are associated with SWI/SNF binding sites? Functional implications (pathway analysis, etc.) Experimental Procedure ChIP-Seq against Brg1, BAF155, BAF170, and Ini1 in HeLa S3 cells Mass spectrometry to inventory co-immunoprecipitating proteins 25
Features We Integrated Feature Platform Source Ini1 Sequencing Euskirchen and Auerbach et al., 2011 Brg1 Sequencing Euskirchen and Auerbach et al., 2011 BAF155 Sequencing Euskirchen and Auerbach et al., 2011 BAF170 Sequencing Euskirchen and Auerbach et al., 2011 RNA Polymerase II Sequencing Rozowsky et al., 2009 IgG Control Sequencing Auerbach and Euskirchen et al., 2009 Lamin A/C Array Euskirchen and Auerbach et al., 2011 Lamin B Array Euskirchen and Auerbach et al., 2011 H3K27me3 Sequencing Cuddapah et al., 2009 CTCF Sequencing Cuddapah et al., 2009 Predicted enhancers Array Heintzman et al., 2009 RNA Polymerase III Sequencing Oler et al., 2010; Barski et al., 2010 RNA-Seq Sequencing Morin et al., 2008 Non-canonical small RNAs Sequencing 26 Affymetrix and CSHL ENCODE Transcription Project, 2009 DNA replication origins Array Cadoret et al., 2008
How to Combine Data? 27
Subunit Breakdown from ChIP-Seq Subunit Number in 49,555 union regions Ini1 24,478 (49%) BAF155 37,921 (77%) BAF170 25,433 (51%) Brg1 12,317 (25%) SWI/SNF Subunit Combinations Total Observed SWI/SNF high-confidence union set 49,555 Two or more subunits 30,310 Three or more subunits 15,535 Core set: Ini1, BAF155, and BAF170 (may include Brg1) 9,760 Ini1, BAF155, BAF170, and Brg1 4,750 28
SWI/SNF Co-occurrences CTCF, Pol II Enhancers, 5 ends, (any combination) SWI/SNF Union Set (49,555 regions) SWI/SNF Core Set (9,760 regions) 44,755 (90%) 8,968 (92%) Unclassified 4,800 (10%) 792 (8%) RNA Pol II Sites 19,669 (40%) 6,562 (67%) Putative Enhancers 21,228 (43%) 3,431 (35%) CTCF Sites 8,542 (17%) 1,692 (17%) 5 ends of Ensembl protein-coding genes (within 2.5 kb) 14,291 (29%) 4,089 (42%) 29
Association of Subunit Combinations with Transcription Levels Euskirchen and Auerbach, et al. PLoS Genetics, 2011. 30
Pathway Analysis Euskirchen and Auerbach, et al. PLoS Genetics, 2011. 31
Overrepresented GO Categories (Mass Spectrometry) Euskirchen and Auerbach, et al. PLoS Genetics, 2011. 32
Summary and Bioinformatics Contributions Different peak scoring criteria for ubiquitous factors Inferring information about a complex given ChIP-Seq from subunits Overall, SWI/SNF binds very generally, but is enriched at 5 ends, genes associated with cell cycle, DNA repair, and cancer. 33
SWI/SNF and DNA Looping Euskirchen and Auerbach, et al. PLoS Genetics, 2011. CIITA locus (~150 kb) 34
Exploring Transcription, DNA Folding, and Nuclear Organization in a Multidimensional Context Paper: Li, Ruan, Auerbach, and Sandhu, et al., Cell, 2012 35
ChIA-PET Chromatin Interaction Analysis by Paired End ditag Sequencing Collaboration with Stanford and Genome Institute of Singapore In addition to ChIA-PET method, FISH, qpcr, enhancer assays, and other methods used for validation. Question: How does transcriptional regulation work in 3-D space on an intrachromosomal level? 36
So How Does ChIA-PET Work? (Cliffs Notes Version) ChIP-Seq ChIA-PET DNA 1 DNA 1 D Linker DNA 2 DNA 2 37
The Textbook Version 38
The Textbook Version 39
First Goal - Transcription Factories Sutherland and Bickmore. Nature Reviews Genetics, 2009. 40
Second Goal - Formation of Protein Complexes PJ Farnham, Nature Reviews Genetics. 2009. 41
Models of Transcription Li, Ruan, Auerbach, and Sandhu, et al. Cell, 2012. 42
Gene Expression Characteristics Li, Ruan, Auerbach, and Sandhu, et al. Cell, 2012. 43
Binding of Different TFs Across Models Li, Ruan, Auerbach, and Sandhu, et al. Cell, 2012. 44
IRS1 and T2D: Long Range Interactions and Disease Li, Ruan, Auerbach, and Sandhu, et al. Cell, 2012. 45
ChIA-PET Conclusions Active regions are connected to other active regions Some factors are present at promoters while others are brought in by LRI Most interactions follow the basal promoter model, but most genes are involved in multigene complexes Long range interactions and its role in disease 46
Bioinformatics Contributions Integration of LRI data with ChIP-Seq, RNA- Seq, etc., to look at transcription as a system Basis for future studies in how various protein complexes are formed in vivo 47
CAPE - Coupled Analysis of Polymerase and Expression 48
Combining RNAPII ChIP-Seq with RNA- Seq A natural experiment for transcription analysis Simple to generate, gain a lot of information Many paired datasets available in public repositories Can identify transcripts with unexpected relationships between binding and expression compare to other organisms/samples/conditions 49
CAPE Summary Publicly available tool designed to categorize features based on expression & RNAPII binding Open-source and multiplatform (Java) Designed to work on diverse sets of genomes out of the box, but also allows for parameter customization Useful for comparative genomics (e.g. modencode) Two modules: CAPE-analyze and CAPE-compare 50
CAPE: Coupled Analysis of Polymerase Binding and Expression Auerbach et al. In revision. 51
Sample CAPE-analyze Output 52
Sample CAPE-compare Output (raw) 53
Sample CAPE-compare Output (HTML) 54
Sample CAPE-compare Output (Venn) 55
Overall Summary Technical implications of scoring ChIP-Seq data (PNAS) Considerations when analyzing data from ChIP-Seq experiments targeted to non-standard transcription factors and protein complexes (PLoS Genetics) How DNA folding affects how we view transcription and ChIP-Seq data (Cell) New, robust tool to quickly classify transcripts/genes based on mrna abundance and RNAPII binding levels 56
Other Work While at Yale Co-author of 14 peer-reviewed papers while at Yale (4 as primary or starred) 12 published 2 in press One manuscript being revised for resubmission 57
Acknowledgements 58
Questions? 59