Nature Methods: doi: /nmeth.3115

Similar documents
SUPPLEMENTARY FIGURES: Supplementary Figure 1

SUPPLEMENTARY INFORMATION

Expanded View Figures

Supplementary Figures

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

SUPPLEMENTARY APPENDIX

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

MethylMix An R package for identifying DNA methylation driven genes

Cancer Informatics Lecture

Nature Medicine: doi: /nm.3967

Nature Neuroscience: doi: /nn Supplementary Figure 1

Assignment 5: Integrative epigenomics analysis

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

Epigenetics. Jenny van Dongen Vrije Universiteit (VU) Amsterdam Boulder, Friday march 10, 2017

Epigenetic programming in chronic lymphocytic leukemia

DNA methylation signatures for 2016 WHO classification subtypes of diffuse gliomas

ARTICLE RESEARCH. Macmillan Publishers Limited. All rights reserved

Journal: Nature Methods

DiffVar: a new method for detecting differential variability with application to methylation in cancer and aging

Comparison of open chromatin regions between dentate granule cells and other tissues and neural cell types.

Supplementary Figures

Use Case 9: Coordinated Changes of Epigenomic Marks Across Tissue Types. Epigenome Informatics Workshop Bioinformatics Research Laboratory

Supplementary Materials for

Supplementary Information

Integrated Analysis of Copy Number and Gene Expression

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies

SUPPLEMENTAL INFORMATION

Nature Immunology: doi: /ni Supplementary Figure 1. Transcriptional program of the TE and MP CD8 + T cell subsets.

Expert-guided Visual Exploration (EVE) for patient stratification. Hamid Bolouri, Lue-Ping Zhao, Eric C. Holland

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

ASMS 2015 ThP 459 Glioblastoma Multiforme Subtype Classification: Integrated Analysis of Protein and Gene Expression Data

Tissue of origin determines cancer-associated CpG island promoter hypermethylation patterns

Expanded View Figures

Supplementary Materials for

Supplementary Figure 1. Metabolic landscape of cancer discovery pipeline. RNAseq raw counts data of cancer and healthy tissue samples were downloaded

Agilent GeneSpring/MPP Metadata Analysis Framework

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

EXPression ANalyzer and DisplayER

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022

Supplementary Figure 1. General strategy to classify genes and identify TSGs.

Supplementary Materials for

IMPaLA tutorial.

Introduction to LOH and Allele Specific Copy Number User Forum

Supplemental Information. Molecular, Pathological, Radiological, and Immune. Profiling of Non-brainstem Pediatric High-Grade

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

chapter 1 - fig. 2 Mechanism of transcriptional control by ppar agonists.

Phenotype prediction based on genome-wide DNA methylation data

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

SSM signature genes are highly expressed in residual scar tissues after preoperative radiotherapy of rectal cancer.

Session 4 Rebecca Poulos

Expanded View Figures

Nature Genetics: doi: /ng.2995

A fully Bayesian approach for the analysis of Whole-Genome Bisulfite Sequencing Data

Nature Genetics: doi: /ng Supplementary Figure 1. Workflow of CDR3 sequence assembly from RNA-seq data.

Integrative DNA methylome analysis of pan-cancer biomarkers in cancer discordant monozygotic twin-pairs

Exercises: Differential Methylation

Module 3: Pathway and Drug Development

EPIGENETIC RE-EXPRESSION OF HIF-2α SUPPRESSES SOFT TISSUE SARCOMA GROWTH

SUPPLEMENTARY INFORMATION

Nature Medicine: doi: /nm.4439

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

An annotated list of bivalent chromatin regions in human ES cells: a new tool for cancer epigenetic research

Nature Genetics: doi: /ng Supplementary Figure 1. PCA for ancestry in SNV data.

SubLasso:a feature selection and classification R package with a. fixed feature subset

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

Session 4 Rebecca Poulos

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

RNA-SEQUENCING APPLICATIONS: GENE EXPRESSION QUANTIFICATION AND METHYLATOR PHENOTYPE IDENTIFICATION

Nature Genetics: doi: /ng Supplementary Figure 1. Assessment of sample purity and quality.

Vega: Variational Segmentation for Copy Number Detection

The Cancer Genome Atlas & International Cancer Genome Consortium

Supplementary Materials Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE

Supplementary Note. Nature Genetics: doi: /ng.2928

CNV PCA Search Tutorial

Session 6: Integration of epigenetic data. Peter J Park Department of Biomedical Informatics Harvard Medical School July 18-19, 2016

Large conserved domains of low DNA methylation maintained by Dnmt3a

S1 Appendix: Figs A G and Table A. b Normal Generalized Fraction 0.075

PREPARED FOR: U.S. Army Medical Research and Materiel Command Fort Detrick, Maryland

Supplementary Figure S1. Gene expression analysis of epidermal marker genes and TP63.

Nature Immunology: doi: /ni Supplementary Figure 1. RNA-Seq analysis of CD8 + TILs and N-TILs.

Package MethPed. September 1, 2018

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:

NEpiC: a network-assisted algorithm for epigenetic studies using mean and variance combined signals

Metabolomic and Proteomics Solutions for Integrated Biology. Christine Miller Omics Market Manager ASMS 2015

Meta-analysis of IDH-mutant cancers identifies EBF1 as a novel interaction partner for

T. R. Golub, D. K. Slonim & Others 1999

Supplemental Figure 1. Genes showing ectopic H3K9 dimethylation in this study are DNA hypermethylated in Lister et al. study.

Supplementary Tables. Supplementary Figures

Cancer outlier differential gene expression detection

Canadian Bioinforma1cs Workshops

Introduction to Gene Sets Analysis

Supplementary Figure 1

Figure S1. Analysis of endo-sirna targets in different microarray datasets. The

Nature Getetics: doi: /ng.3471

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

SUPPLEMENTARY INFORMATION In format provided by Javier DeFelipe et al. (MARCH 2013)

New Enhancements: GWAS Workflows with SVS

Clustered mutations of oncogenes and tumor suppressors.

7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Transcription:

Supplementary Figure 1 Analysis of DNA methylation in a cancer cohort based on Infinium 450K data. RnBeads was used to rediscover a clinically distinct subgroup of glioblastoma patients characterized by increased DNA methylation levels (termed G-CIMP+), and to predict the G-CIMP status for a total of 124 patients using Infinium 450k data obtained from the TCGA project (http://cancergenome.nih.gov). (a) Detection of genetic duplicates among the patient samples (columns) using a clustered heatmap of intensity values for the genotyping probes that are present on the Infinium microarray (rows). The inset shows that two samples exhibit a high level of genetic identity, and they are indeed derived from tumors of the same patient. (b) Quality control plot summarizing the outcome of the data filtering. The bar plots on the top left show that the majority of CpG sites (top) and samples (bottom) are of good quality and can be retained. The relatively straight line in the quantile-quantile plot indicates that the probe filtering does not have a major impact on the distribution of DNA methylation in the dataset. (c) Identification of a small but clearly distinguished cluster of G-CIMP+ glioblastoma samples with elevated DNA methylation levels especially in CpG-rich genomic regions (light blue in the leftmost column). In the heatmap, blue colors denote high levels of DNA methylation, red indicates low levels and grey represents intermediate levels. For visualization purposes, only the 100 gene promoters (rows) with the highest levels of inter-sample variation in DNA methylation are shown (columns), but the hierarchical clustering is based on the full set of promoters. (d) Global assessment of the similarity between the DNA methylation profiles, plotting all glioblastoma samples according to their second and third principal components. The samples exhibit strong separation according to the G-CIMP status (denoted by point shape) and IDH1 mutation status (denoted by point color). (e) Analysis of significant associations between all user-provided sample annotations. Significant p-values (<0.05) are highlighted in the left triangle, and the corresponding statistical tests are annotated in the right triangle (orange: Pearson correlation followed by permutation-based estimation of the p-value; green: Fisher s exact test; blue: Wilcoxon rank sum test; violet: Kruskal-Wallis one-way analysis of variance). (f) Genome-scale comparison between the DNA methylation levels of G-CIMP positive (y-axis) and G-CIMP negative (x-axis) tumor samples, focusing on CpG islands (left scatterplot) and on 5-kilobase tiling regions with a CpG content in the bottom quartile (right scatterplot), respectively. Genomic regions that are differentially methylated with an FDR below 0.05 are presented as red points. All other regions are displayed in blue, and color brightness denotes point density.

Supplementary Figure 2 RnBeads-based Methylome Resource of reference epigenome data sets. Screenshot of the Methylome Resource (http://rnbeads.mpi-inf.mpg.de/methylomes.php), which makes large DNA methylation datasets more readily available for follow-up research. On the one hand, it provides detailed analysis reports for publicly available methylome datasets that can be explored interactively. On the other hand, the Methylome Resource website lets RnBeads users download all data and configurations that are needed to re-run all or part of the DNA methylation analyses in their local or cloud-based computing environment. These re-runnable analysis configurations make it straightforward for RnBeads users to analyze their own DNA methylation data in the context of publicly available reference epigenome maps.

Comprehensive analysis of DNA methylation data with RnBeads Yassen Assenov, Fabian Müller, Pavlo Lutsik, Jörn Walter, Thomas Lengauer & Christoph Bock Supplementary Note As an example for RnBeads-based analysis of Infinium 450k data, we performed a reanalysis of a publicly available glioblastoma dataset generated by The Cancer Genome Atlas (TCGA) project (Weisenberger, 2014). Glioblastoma multiforme is an aggressive type of brain cancer with a median survival time of little more than a year and substantial variation between patients (Wen and Kesari, 2008). In an attempt to stratify patients according to the molecular characteristics of the tumors, recent research has identified a subtype that is characterized by elevated levels of DNA methylation, prolonged survival and high frequency of mutations in the IDH1 gene (Noushmehr et al., 2010). The discovery of this glioblastoma CpG island methylator phenotype positive (G-CIMP+) subtype was based on Illumina s Infinium 27k assay, prompting us to validate this observation using RnBeads and an extended dataset of Infinium 450k profiles for 124 glioblastoma patients. We downloaded the raw microarray signal intensity files in IDAT format from the TCGA website (http://tcgadata.nci.nih.gov), created a sample annotation file that contains the available patient data including IDH1 mutation status and then launched RnBeads. The software identifies the data directory and input file format from the annotation file and normalizes the raw intensity data using SWAN (Makismovic et al., 2012) (other normalization algorithms are supported as well, as described in the Online Methods). CpG-specific DNA methylation levels are obtained from the normalized data and collected in an RnBSet object that is the basis for all subsequent analyses. During quality control, RnBeads performs clustering of all samples based on genotype fingerprinting probes included on the Infinium microarray (Supplementary Figure 1a), which is an effective method for identifying sample mix-ups and duplications. Here, we identified two samples with identical SNP patterns, in concordance with their TCGA annotation as primary and recurrent tumors from the same patient. All other samples were taken from genetically unrelated patients. RnBeads provides flexible features for data filtering as part of the preprocessing module (Supplementary Figure 1b), which are useful for excluding measurements that could bias the analysis (e.g., due to low signal quality, overlap with SNPs, or X-chromosome association in case of different sex ratios between cases and controls). Based on the filtered and quality-controlled dataset, RnBeads performs hierarchical clustering to facilitate data exploration and outlier detection. In the clustered heatmap, we observe a small and distinct group of samples with increased promoter hypermethylation suggestive of the G-CIMP+ subtype (Supplementary Figure 1c). These putative G-CIMP+ samples indeed exhibit the characteristic enrichment of IDH1 mutations and a clear separation with respect to their global DNA methylation levels patterns that are particularly evident from a low-dimensional projection of the entire dataset that has been annotated with IDH1 mutation status and G- CIMP subtype information (Supplementary Figure 1d). The significance of this association is also confirmed by pairwise statistical tests for associations that RnBeads performs between all sample annotations (Supplementary Figure 1e). Furthermore, RnBeads calculates groupwise comparisons between the mean DNA methylation levels in the G-CIMP positive versus negative samples for CpG islands and for genome-wide tiling regions (Supplementary Figure 1f). The resulting scatterplots show that the gain of DNA methylation among the G-CIMP+ samples is more pronounced in CpG islands than in genomic regions exhibiting low CpG content. These automated, exploratory analyses provide a starting point for dissecting the patterns and mechanisms of epigenetic deregulation that may affect DNA methylation in G-CIMP+ tumors. Follow-up analyses can be performed directly in R, most conveniently by using the precalculated RnBSet data object that RnBeads prepares as part of the initial analysis. Furthermore, RnBeads makes it easy to export the data and results in a variety of formats and to hand them over to stand-alone or web-based bioinformatic tools for further analysis.

Supplementary Tables Supplementary Table 1: Comparison between software tools for DNA methylation analysis <Large table available as a separate file> Supplementary Table 2: Performance benchmark for large DNA methylation analyses with RnBeads Data type 1 No. of Samples 2 No. of CpGs 3 No. of Annotations 4 No. of Comparisons 5 Runtime (node) 6 Runtime (cluster) 7 Infinium 450k 100 482,421 2 2 2h 12min 1h 9min Infinium 450k 500 482,421 6 6 15h 2min 7h 29min Infinium 450k 1000 482,421 10 10 1d 13h 51min 20h 15min Infinium 450k 4034* 482,421 5 18 9d 7h 21min 6d 18h 40min RRBS 10 1,804,103 2 2 1h 56min 49min RRBS 50 2,169,859 6 6 5h 32min 1h 54min RRBS 100 2,221,920 10 10 10h 13min 2h 57min RRBS 216* 2,295,083 7 11 1d 8h 50min 14h 27min WGBS 5 28,132,494 2 2 20h 43min 8h 23min WGBS 10 28,150,344 6 6 2d 10h 23min 20h 5min WGBS 20 28,154,125 10 10 4d 12h 21min 1d 15h 34min WGBS 41* 28,158,385 5 6 3d 4h 54min 1d 9h 27min 1 Data from the following sources were included in the analysis: TCGA (Infinium 450k), ENCODE (RRBS), Ziller et al. (WGBS) 2 Subsets of the full datasets were randomly generated in order to assess the effect of sample size on runtime 3 Number of CpG sites present in at least one sample. For RRBS/WGBS, low coverage sites are removed prior to counting 4 Adding more columns to the sample annotation table increases the complexity and runtime of the analysis 5 Including more pairwise comparisons in the analysis strongly increases runtime but can be parallelized effectively 6 Serial runtime measured on a scientific computing cluster (16 nodes), summing up the runtime of all contributing nodes 7 Parallel runtime / time to completion on a scientific computing cluster (16 nodes) with optimal use of job parallelization * The analysis results for the full datasets are available as part of the Methylome Resource on the RnBeads website