High Throughput Sequence (HTS) data analysis Lei Zhou (leizhou@ufl.edu) High Throughput Sequence (HTS) data analysis 1. Representation of HTS data. 2. Visualization of HTS data. 3. Discovering genomic pattern from HTS data. 4. Integrated data analysis and hypothesis formulation. 1
Recoding sequence information sequence file format FASTA format suitable for single gene or genomic region, pre-genomic era. > Gene_name or accession, (other info) ACTGGGTTTATGACGTGTCATGCATGCA ATGTAGCTAGATGCTAGCTAGATGCTAG CTAGATGCTA. Defined format is necessary for computers to identify and process the information. Example of FASTA format file from NCBI 2
Representation of (HTS) data BED (Browser Extensible Data) file Chrom. Start End name Scor Strand chr2 10000192 10000217 U0 0 + chr2 10000227 10000252 U1 0 - chr2 10000310 10000335 U2 0 + chr3 10000496 10000521 U1 0 - chr2 10000556 10000581 U2 0 + With the completion of the genome, there is no need to record the base pair identity (if it is the same as the reference genome). Detailed description of genomic data formats: http://genome.ucsc.edu/faq/faqformat.html Representation of HTS data The importance of a reference genome All coordinates are only meaningful for a given genome assembly. One assembly may have multiple releases (annotations). You need to know which reference genome was used to generate the BED file. 3
How to gain knowledge from HTS data Visualization of HTS data. Discovering genomic patterns. Identifying novel mechanism hypothesis generation. Visualization of HTS data. Simple visualization - distribution of tags (or normalized values). Barski et al. (2007) Cell Chr. ChrStart ChrEnd Value (2007) Cell chr4 0 200 0 chr4 200 400 2 chr4 400 600 13 chr4 600 800 35 chr4 800 1000 27 BedGraph file (Wig) 4
Visualization of HTS data. Shifting sequence tag position may be necessary to reflect nucleosome positions. In this example the mapping positions were shifted +73bp for forward strain and -73bp for reverse strain to reflect the midpoint of the nucleosome. Jiang & Pugh, Nat. Rev. Genet., 2009 Visualization of HTS data. Advanced visualization depending on purpose of comparison. Example - Circos plot depicts genomic location, chromosomal copy number (red, copy gain; blue, copy loss). Interchromosomal translocations (purple) and intra-chromosomal (green) rearrangements observed in primary prostate cancers Berger et al. (2011) Nature 5
Discovering genomic patterns Barski et al. (2007) Cell Usually requires some programming (scripting). As a biologist, you need to clearly define your question, and the logic to obtain the data summary. Discovering genomic patterns Q: Is H3K4me3 associated with TSS? Is such an association related to gene expression status? Logic: 1. Group genes based on expression levels obtained with a microarray study (Su et al, 2004). 2. For each gene, obtain the normalized H3K4me3 ChIP-Seq counts within [-2k, +2k] of the TSS. 3. For each of the expression group, plot the average value along the [-2k, +2k] interval. 6
Integrated data analysis and hypothesis formulation An example: Chromatin barrier Gaszner & Felsenfeld, Nat. Rev. Genet., 2006 Chromatin barriers demarcate the boundaries between heterochromatin and euchromatin regions. It is usually bound by insulator proteins such as CTCF, Su(Hw) etc. An example of genomic analysis and hypothesis generation The problem: binding of insulator proteins does not always correlate with chromatin boundary. H3K27me3 ChIP-Seq Insulator protein ChIP- Seq binding sites from those that are in heterochromatic (H3K27me3-enriched) region? 7
An example of genomic analysis and hypothesis generation binding sites from those that are in H3K27me3-enriched region? 1. Genome-wide Identification of H3K27me3 boundaries using ChIP-Seq data. Verifying the predications using RNA-Seq data for the same cell type. An example of genomic analysis and hypothesis generation binding sites from those that are in heterochromatic region? 1. Genome-wide Identification of H3K27me3 boundaries using ChIP-Seq data. 2. Compare the binding levels of CTCF and co- factors. Figures reflect the average binding intensity for hundreds of sites at the boundary (solid line) or within heterochromatic regions (dotted line). 8
An example of genomic analysis and hypothesis generation binding sites from those that are in H3K27me3-enriched region? 1. Genome-wide Identification of H3K27me3 boundaries using ChIP-Seq data. 2. Compare the binding levels of CTCF and co-factors. 3. Is the difference significant? U (rank sum) test. P>0.01 P<0.001 An example of genomic analysis and hypothesis generation binding sites from those that are in H3K27me3-enriched region? 1. Genome-wide Identification of H3K27me3 boundaries using ChIP-Seq data. 2. Compare the binding levels of CTCF and co-factors. 3. Is the difference significant? 4. What might be the cause of this difference? comparing underlying DNA sequences. The 400 bp regions surrounding the CTCF binding sites at the boundary were compared against those in H3K27me3- enriched region to identify discriminative motifs. (http://meme.sdsc.edu/meme/int ro.html) 9
An example of genomic analysis and hypothesis generation binding sites from those that are in H3K27me3-enriched region? Compared to CTCF binding site in H3K27me3-enriched region, CTCF binding sites at the boundary are associated with multi-da motif, have less nucleosome density, and have higher level of co-factor binding. Hypothesis: the presence of multi-da motif facilitates the binding of co-factor by destabilizing nucleosome formation. Useful resources for HTS analysis Galaxy tools (http://galaxy.hpc.ufl.edu/; http://galaxy.psu.edu/ psu ) UCSC Genome Browser (http://genome.ucsc.edu/ ) Sequence motif identification (MEME: http://meme.sdsc.edu/meme/intro.html ) 10