High Throughput Sequence (HTS) data analysis. Lei Zhou

Similar documents
Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

Nature Structural & Molecular Biology: doi: /nsmb.2419

Genome-wide Association Studies (GWAS) Pasieka, Science Photo Library

Supplemental Figure S1. Tertiles of FKBP5 promoter methylation and internal regulatory region

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Comparison of open chromatin regions between dentate granule cells and other tissues and neural cell types.

The Epigenome Tools 2: ChIP-Seq and Data Analysis

The Insulator Binding Protein CTCF Positions 20 Nucleosomes around Its Binding Sites across the Human Genome

Peak-calling for ChIP-seq and ATAC-seq

MIR retrotransposon sequences provide insulators to the human genome

a) List of KMTs targeted in the shrna screen. The official symbol, KMT designation,

Chip Seq Peak Calling in Galaxy

ChromHMM Tutorial. Jason Ernst Assistant Professor University of California, Los Angeles

ChIP-seq data analysis

Supplementary Figure S1. Gene expression analysis of epidermal marker genes and TP63.

Assignment 5: Integrative epigenomics analysis

Session 6: Integration of epigenetic data. Peter J Park Department of Biomedical Informatics Harvard Medical School July 18-19, 2016

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

EPIGENOMICS PROFILING SERVICES

Package NarrowPeaks. August 3, Version Date Type Package

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

DNA-seq Bioinformatics Analysis: Copy Number Variation

Introduction. Introduction

CTCF-Mediated Functional Chromatin Interactome in Pluripotent Cells

Part-II: Statistical analysis of ChIP-seq data

Heintzman, ND, Stuart, RK, Hon, G, Fu, Y, Ching, CW, Hawkins, RD, Barrera, LO, Van Calcar, S, Qu, C, Ching, KA, Wang, W, Weng, Z, Green, RD,

An epigenetic approach to understanding (and predicting?) environmental effects on gene expression

STAT1 regulates microrna transcription in interferon γ stimulated HeLa cells

Data mining with Ensembl Biomart. Stéphanie Le Gras

EXPression ANalyzer and DisplayER

Nature Genetics: doi: /ng Supplementary Figure 1. Immunofluorescence (IF) confirms absence of H3K9me in met-2 set-25 worms.

A Practical Guide to Integrative Genomics by RNA-seq and ChIP-seq Analysis

Supplementary Figure 1 IL-27 IL

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

ChIP-seq hands-on. Iros Barozzi, Campus IFOM-IEO (Milan) Saverio Minucci, Gioacchino Natoli Labs

ChIP-seq analysis. J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant, M. Thomas-Chollier, O.Sand

Computational aspects of ChIP-seq. John Marioni Research Group Leader European Bioinformatics Institute European Molecular Biology Laboratory

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

Nature Immunology: doi: /ni Supplementary Figure 1. DNA-methylation machinery is essential for silencing of Cd4 in cytotoxic T cells.

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

Supplementary Figure 1. Efficiency of Mll4 deletion and its effect on T cell populations in the periphery. Nature Immunology: doi: /ni.

Comprehensive nucleosome mapping of the human genome in cancer progression

Functional annotation of farm animal genomes: ChIP-seq.

Analysis of Massively Parallel Sequencing Data Application of Illumina Sequencing to the Genetics of Human Cancers

MODULE 3: TRANSCRIPTION PART II

Supplementary note: Comparison of deletion variants identified in this study and four earlier studies

RNA-seq Introduction

SUPPLEMENTARY INFORMATION

Hands-On Ten The BRCA1 Gene and Protein

Genomic structural variation

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Large conserved domains of low DNA methylation maintained by Dnmt3a

Raymond Auerbach PhD Candidate, Yale University Gerstein and Snyder Labs August 30, 2012

Supplemental Information For: The genetics of splicing in neuroblastoma

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

Supplemental Figure 1. Genes showing ectopic H3K9 dimethylation in this study are DNA hypermethylated in Lister et al. study.

Epigenetic interplay between mouse endogenous retroviruses and host genes

Patterns of Histone Methylation and Chromatin Organization in Grapevine Leaf. Rachel Schwope EPIGEN May 24-27, 2016

Yingying Wei George Wu Hongkai Ji

Processing, integrating and analysing chromatin immunoprecipitation followed by sequencing (ChIP-seq) data

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022

Nature Immunology: doi: /ni Supplementary Figure 1 33,312. Aire rep 1. Aire rep 2 # 44,325 # 44,055. Aire rep 1. Aire rep 2.

Table S1. Total and mapped reads produced for each ChIP-seq sample

Nature Immunology: doi: /ni Supplementary Figure 1. Transcriptional program of the TE and MP CD8 + T cell subsets.

Nature Genetics: doi: /ng Supplementary Figure 1. Assessment of sample purity and quality.

Supplementary Figures

the reaction was stopped by adding glycine to final concentration 0.2M for 10 minutes at

Plasticity in patterns of histone modifications and chromosomal proteins in Drosophila heterochromatin

Metadata of the chapter that will be visualized online

Nature Immunology: doi: /ni Supplementary Figure 1. Characteristics of SEs in T reg and T conv cells.

Open Access RESEARCH. Background Malaria is a major public health problem in many developing countries, with the parasite Plasmodium

RNA SEQUENCING AND DATA ANALYSIS

Global Epigenetic and Transcriptional Trends among Two Rice Subspecies and Their Reciprocal Hybrids W

Supplementary Information. Supplementary Figures

Supplementary Information

Exploring chromatin regulation by ChIP-Sequencing

ESCs were lysed with Trizol reagent (Life technologies) and RNA was extracted according to

Nature Biotechnology: doi: /nbt.1904

Generating Spontaneous Copy Number Variants (CNVs) Jennifer Freeman Assistant Professor of Toxicology School of Health Sciences Purdue University

Analysis of the peroxisome proliferator-activated receptor-β/δ (PPARβ/δ) cistrome reveals novel co-regulatory role of ATF4

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Histone Modifications Are Associated with Transcript Isoform Diversity in Normal and Cancer Cells

Statistical analysis of ChIP-seq data

Introduction to Systems Biology of Cancer Lecture 2

Tutorial. ChIP Sequencing. Sample to Insight. September 15, 2016

A Quick-Start Guide for rseqdiff

Not IN Our Genes - A Different Kind of Inheritance.! Christopher Phiel, Ph.D. University of Colorado Denver Mini-STEM School February 4, 2014

ONLINE. Online supplementary information S1 (Box) Method. Supplementary data. Online links

Supplementary Figure S1: Defective heterochromatin repair in HGPS progeroid cells

Eukaryotic Gene Regulation

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Mechanisms of alternative splicing regulation

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

V24 Regular vs. alternative splicing

SUPPLEMENTAL FILE. mir-22 and mir-29a are members of the androgen receptor cistrome modulating. LAMC1 and Mcl-1 in prostate cancer

levels of genes were separated by their expression levels; 2,000 high, medium, and low

Annotation of Chimp Chunk 2-10 Jerome M Molleston 5/4/2009

Dominic J Smiraglia, PhD Department of Cancer Genetics. DNA methylation in prostate cancer

Supplementary Materials

Lung Met 1 Lung Met 2 Lung Met Lung Met H3K4me1. Lung Met H3K27ac Primary H3K4me1

Transcription:

High Throughput Sequence (HTS) data analysis Lei Zhou (leizhou@ufl.edu) High Throughput Sequence (HTS) data analysis 1. Representation of HTS data. 2. Visualization of HTS data. 3. Discovering genomic pattern from HTS data. 4. Integrated data analysis and hypothesis formulation. 1

Recoding sequence information sequence file format FASTA format suitable for single gene or genomic region, pre-genomic era. > Gene_name or accession, (other info) ACTGGGTTTATGACGTGTCATGCATGCA ATGTAGCTAGATGCTAGCTAGATGCTAG CTAGATGCTA. Defined format is necessary for computers to identify and process the information. Example of FASTA format file from NCBI 2

Representation of (HTS) data BED (Browser Extensible Data) file Chrom. Start End name Scor Strand chr2 10000192 10000217 U0 0 + chr2 10000227 10000252 U1 0 - chr2 10000310 10000335 U2 0 + chr3 10000496 10000521 U1 0 - chr2 10000556 10000581 U2 0 + With the completion of the genome, there is no need to record the base pair identity (if it is the same as the reference genome). Detailed description of genomic data formats: http://genome.ucsc.edu/faq/faqformat.html Representation of HTS data The importance of a reference genome All coordinates are only meaningful for a given genome assembly. One assembly may have multiple releases (annotations). You need to know which reference genome was used to generate the BED file. 3

How to gain knowledge from HTS data Visualization of HTS data. Discovering genomic patterns. Identifying novel mechanism hypothesis generation. Visualization of HTS data. Simple visualization - distribution of tags (or normalized values). Barski et al. (2007) Cell Chr. ChrStart ChrEnd Value (2007) Cell chr4 0 200 0 chr4 200 400 2 chr4 400 600 13 chr4 600 800 35 chr4 800 1000 27 BedGraph file (Wig) 4

Visualization of HTS data. Shifting sequence tag position may be necessary to reflect nucleosome positions. In this example the mapping positions were shifted +73bp for forward strain and -73bp for reverse strain to reflect the midpoint of the nucleosome. Jiang & Pugh, Nat. Rev. Genet., 2009 Visualization of HTS data. Advanced visualization depending on purpose of comparison. Example - Circos plot depicts genomic location, chromosomal copy number (red, copy gain; blue, copy loss). Interchromosomal translocations (purple) and intra-chromosomal (green) rearrangements observed in primary prostate cancers Berger et al. (2011) Nature 5

Discovering genomic patterns Barski et al. (2007) Cell Usually requires some programming (scripting). As a biologist, you need to clearly define your question, and the logic to obtain the data summary. Discovering genomic patterns Q: Is H3K4me3 associated with TSS? Is such an association related to gene expression status? Logic: 1. Group genes based on expression levels obtained with a microarray study (Su et al, 2004). 2. For each gene, obtain the normalized H3K4me3 ChIP-Seq counts within [-2k, +2k] of the TSS. 3. For each of the expression group, plot the average value along the [-2k, +2k] interval. 6

Integrated data analysis and hypothesis formulation An example: Chromatin barrier Gaszner & Felsenfeld, Nat. Rev. Genet., 2006 Chromatin barriers demarcate the boundaries between heterochromatin and euchromatin regions. It is usually bound by insulator proteins such as CTCF, Su(Hw) etc. An example of genomic analysis and hypothesis generation The problem: binding of insulator proteins does not always correlate with chromatin boundary. H3K27me3 ChIP-Seq Insulator protein ChIP- Seq binding sites from those that are in heterochromatic (H3K27me3-enriched) region? 7

An example of genomic analysis and hypothesis generation binding sites from those that are in H3K27me3-enriched region? 1. Genome-wide Identification of H3K27me3 boundaries using ChIP-Seq data. Verifying the predications using RNA-Seq data for the same cell type. An example of genomic analysis and hypothesis generation binding sites from those that are in heterochromatic region? 1. Genome-wide Identification of H3K27me3 boundaries using ChIP-Seq data. 2. Compare the binding levels of CTCF and co- factors. Figures reflect the average binding intensity for hundreds of sites at the boundary (solid line) or within heterochromatic regions (dotted line). 8

An example of genomic analysis and hypothesis generation binding sites from those that are in H3K27me3-enriched region? 1. Genome-wide Identification of H3K27me3 boundaries using ChIP-Seq data. 2. Compare the binding levels of CTCF and co-factors. 3. Is the difference significant? U (rank sum) test. P>0.01 P<0.001 An example of genomic analysis and hypothesis generation binding sites from those that are in H3K27me3-enriched region? 1. Genome-wide Identification of H3K27me3 boundaries using ChIP-Seq data. 2. Compare the binding levels of CTCF and co-factors. 3. Is the difference significant? 4. What might be the cause of this difference? comparing underlying DNA sequences. The 400 bp regions surrounding the CTCF binding sites at the boundary were compared against those in H3K27me3- enriched region to identify discriminative motifs. (http://meme.sdsc.edu/meme/int ro.html) 9

An example of genomic analysis and hypothesis generation binding sites from those that are in H3K27me3-enriched region? Compared to CTCF binding site in H3K27me3-enriched region, CTCF binding sites at the boundary are associated with multi-da motif, have less nucleosome density, and have higher level of co-factor binding. Hypothesis: the presence of multi-da motif facilitates the binding of co-factor by destabilizing nucleosome formation. Useful resources for HTS analysis Galaxy tools (http://galaxy.hpc.ufl.edu/; http://galaxy.psu.edu/ psu ) UCSC Genome Browser (http://genome.ucsc.edu/ ) Sequence motif identification (MEME: http://meme.sdsc.edu/meme/intro.html ) 10