Nature Genetics: doi: /ng Supplementary Figure 1

Similar documents
7SK ChIRP-seq is specifically RNA dependent and conserved between mice and humans.

Nature Structural & Molecular Biology: doi: /nsmb.2419

Accessing and Using ENCODE Data Dr. Peggy J. Farnham

Chromatin marks identify critical cell-types for fine-mapping complex trait variants

Comparison of open chromatin regions between dentate granule cells and other tissues and neural cell types.

Supplementary Figure 1. Efficiency of Mll4 deletion and its effect on T cell populations in the periphery. Nature Immunology: doi: /ni.

Supplementary Figures

Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor suppressor genes

Computational Analysis of UHT Sequences Histone modifications, CAGE, RNA-Seq

Relationship between genomic features and distributions of RS1 and RS3 rearrangements in breast cancer genomes.

Supplementary Figure S1. Gene expression analysis of epidermal marker genes and TP63.

MIR retrotransposon sequences provide insulators to the human genome

Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features. Tyler Yue Lab

Nature Genetics: doi: /ng Supplementary Figure 1. Assessment of sample purity and quality.

Reconstruction of enhancer-target networks in 935 samples of human primary cells, tissues and cell lines

of TERT, MLL4, CCNE1, SENP5, and ROCK1 on tumor development were discussed.

Supplementary Figures

Expanded View Figures

Supplementary Figure 1: High-throughput profiling of survival after exposure to - radiation. (a) Cells were plated in at least 7 wells in a 384-well

ChIP-seq analysis. J. van Helden, M. Defrance, C. Herrmann, D. Puthier, N. Servant, M. Thomas-Chollier, O.Sand

Peak-calling for ChIP-seq and ATAC-seq

ChIP-seq data analysis

Session 6: Integration of epigenetic data. Peter J Park Department of Biomedical Informatics Harvard Medical School July 18-19, 2016

Inferring Biological Meaning from Cap Analysis Gene Expression Data

Supplementary Figure 1: Attenuation of association signals after conditioning for the lead SNP. a) attenuation of association signal at the 9p22.

Yingying Wei George Wu Hongkai Ji

CTCF-Mediated Functional Chromatin Interactome in Pluripotent Cells

Supplementary information for: Human micrornas co-silence in well-separated groups and have different essentialities

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Nature Immunology: doi: /ni Supplementary Figure 1. Characteristics of SEs in T reg and T conv cells.

Gene-microRNA network module analysis for ovarian cancer

COMPUTATIONAL OPTIMISATION OF TARGETED DNA SEQUENCING FOR CANCER DETECTION

cis-regulatory enrichment analysis in human, mouse and fly

Plasma-Seq conducted with blood from male individuals without cancer.

Lentiviral Delivery of Combinatorial mirna Expression Constructs Provides Efficient Target Gene Repression.

Computational aspects of ChIP-seq. John Marioni Research Group Leader European Bioinformatics Institute European Molecular Biology Laboratory

Processing, integrating and analysing chromatin immunoprecipitation followed by sequencing (ChIP-seq) data

Histone Modifications Are Associated with Transcript Isoform Diversity in Normal and Cancer Cells

Genetic alterations of histone lysine methyltransferases and their significance in breast cancer

ChromHMM Tutorial. Jason Ernst Assistant Professor University of California, Los Angeles

Supplemental Information. Genomic Characterization of Murine. Monocytes Reveals C/EBPb Transcription. Factor Dependence of Ly6C Cells

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Heintzman, ND, Stuart, RK, Hon, G, Fu, Y, Ching, CW, Hawkins, RD, Barrera, LO, Van Calcar, S, Qu, C, Ching, KA, Wang, W, Weng, Z, Green, RD,

Supplementary Information. Supplementary Figures

Integrated network analysis reveals distinct regulatory roles of transcription factors and micrornas

Nature Genetics: doi: /ng Supplementary Figure 1. Workflow of CDR3 sequence assembly from RNA-seq data.

Whole Genome and Transcriptome Analysis of Anaplastic Meningioma. Patrick Tarpey Cancer Genome Project Wellcome Trust Sanger Institute

Nature Immunology: doi: /ni Supplementary Figure 1. Transcriptional program of the TE and MP CD8 + T cell subsets.

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies

EPIGENOMICS PROFILING SERVICES

Supplementary Tables. Supplementary Figures

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Supplemental Information. A Highly Sensitive and Robust Method. for Genome-wide 5hmC Profiling. of Rare Cell Populations

Supplemental Figure 1. Genes showing ectopic H3K9 dimethylation in this study are DNA hypermethylated in Lister et al. study.

a) List of KMTs targeted in the shrna screen. The official symbol, KMT designation,

The search for cis-regulatory driver mutations in cancer genomes

Supplementary Figure 1. Quantile-quantile (Q-Q) plot of the log 10 p-value association results from logistic regression models for prostate cancer

Lung Met 1 Lung Met 2 Lung Met Lung Met H3K4me1. Lung Met H3K27ac Primary H3K4me1

An epigenetic approach to understanding (and predicting?) environmental effects on gene expression

Supplementary Materials for

SUPPLEMENTARY INFORMATION

Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma.

Raymond Auerbach PhD Candidate, Yale University Gerstein and Snyder Labs August 30, 2012

Statistical analysis of RIM data (retroviral insertional mutagenesis) Bioinformatics and Statistics The Netherlands Cancer Institute Amsterdam

Supplementary Information. Preferential associations between co-regulated genes reveal a. transcriptional interactome in erythroid cells

Figure S2. Distribution of acgh probes on all ten chromosomes of the RIL M0022

Supplementary Materials for

The Epigenome Tools 2: ChIP-Seq and Data Analysis

underlying metastasis and recurrence in HNSCC, we analyzed two groups of patients. The

Sudin Bhattacharya Institute for Integrative Toxicology

The Insulator Binding Protein CTCF Positions 20 Nucleosomes around Its Binding Sites across the Human Genome

Nature Biotechnology: doi: /nbt.1904

Analysis of the peroxisome proliferator-activated receptor-β/δ (PPARβ/δ) cistrome reveals novel co-regulatory role of ATF4

SSM signature genes are highly expressed in residual scar tissues after preoperative radiotherapy of rectal cancer.

Theta sequences are essential for internally generated hippocampal firing fields.

Discovery of Novel Human Gene Regulatory Modules from Gene Co-expression and

Supplemental Figure S1. Expression of Cirbp mrna in mouse tissues and NIH3T3 cells.

VL Network Analysis ( ) SS2016 Week 3

Nature Genetics: doi: /ng Supplementary Figure 1

p.r623c p.p976l p.d2847fs p.t2671 p.d2847fs p.r2922w p.r2370h p.c1201y p.a868v p.s952* RING_C BP PHD Cbp HAT_KAT11

Patterns of Histone Methylation and Chromatin Organization in Grapevine Leaf. Rachel Schwope EPIGEN May 24-27, 2016

Nature Genetics: doi: /ng Supplementary Figure 1. Phenotypic characterization of MES- and ADRN-type cells.

Supplementary Figure 1: Features of IGLL5 Mutations in CLL: a) Representative IGV screenshot of first

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

Supplementary information

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

High Throughput Sequence (HTS) data analysis. Lei Zhou

Journal: Nature Methods

Supplementary Figures

Nature Genetics: doi: /ng Supplementary Figure 1. HOX fusions enhance self-renewal capacity.

The common colorectal cancer predisposition SNP rs at chromosome 8q24 confers potential to enhanced Wnt signaling

Nature Genetics: doi: /ng Supplementary Figure 1

The corrected Figure S1J is shown below. The text changes are as follows, with additions in bold and deletions in bracketed italics:

Nature Genetics: doi: /ng Supplementary Figure 1

A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by modulating HOXB13 chromatin binding

Supplementary Figure 1. Spitzoid Melanoma with PPFIBP1-MET fusion. (a) Histopathology (4x) shows a domed papule with melanocytes extending into the

Hands-On Ten The BRCA1 Gene and Protein

Nature Medicine: doi: /nm.4439

Transcription:

Supplementary Figure 1 Expression deviation of the genes mapped to gene-wise recurrent mutations in the TCGA breast cancer cohort (top) and the TCGA lung cancer cohort (bottom). For each gene (each pair of red and blue bars), the average degree of expression deviation is compared between the mutated samples (red bar) and non-mutated samples (blue bar). For 72% (top) and 73% (bottom) of the cases, the mutated samples exhibited a higher degree of expression deviation. Genes were sorted according to the degree of the difference between the mutated and non-mutated groups. All genes are collapsed in the box plots on the right.

Supplementary Figure 2 Effects of copy number alterations versus mutations. (a) The effects of copy number alterations for genes ordered as in Figure 1c. For each gene (each pair of red and blue bars), the percentage of samples with copy number alterations is compared between the mutated samples (red bar) and non-mutated samples (blue bar). All genes are collapsed in the box plots on the right. (b) Outcome of multiple linear regression quantitatively comparing the contribution of mutation recurrence and copy number alteration to expression perturbation. Copy number alteration was processed through the deviation metric used for expression perturbation. The contribution of mutation burden remained significant after adjusting for copy number alteration.

Supplementary Figure 3 Permutation tests for expression perturbation by mutations. Gene expression profiles based on the RNA sequencing of 92 breast and 90 lung cancer samples were obtained from the TCGA and matched with the whole-genome sequences of the TCGA cohort samples. Array-based gene expression profiles of breast cancer samples were obtained from the ArrayExpress archive (E-MTAB-1088) and matched with the whole-genome sequences of the Sanger cohort samples. We obtained expression difference for each gene between the mutated and non-mutated groups as the t statistic. To estimate expected expression differences, the sample grouping (mutated versus non-mutated) was randomized while maintaining the sample size of each group. This permutation was repeated 1,000 times, and the t value was recalculated each time.

Supplementary Figure 4 Directionality of expression perturbation by mutations. The x axis has genes ordered as in Figure 1c. The y axis indicates the fraction of cases (samples) that go in the same direction (either positive or negative expression deviation).

Supplementary Figure 5 Recurrence rate (the percentage of mutated samples in the given cohort) was obtained for each mutation (for site-specific recurrence) or each gene (for gene-level recurrence). The site-specific recurrence rate was mapped to the target gene via the same chromatin interactome used for the gene-level recurrence analysis. P value indicates the differences in the recurrence rate between the known cancer genes and other genes.

Supplementary Figure 6 Permutation tests for biological validity of the chromatin interaction data. We shuffled chromatin interaction links between enhancers and promoters. This procedure was repeated 1,000 times. High recurrence rates were observed for known cancer genes (Fig. 2c), cancer gene interacting partners (Fig. 2d), and genes with relevant GO terms (Fig. 2e). We tested whether these patterns disappear as target genes are randomly mapped to mutations. To this end, the average recurrence rate of the above three groups of genes was obtained and divided by that of all genes. These relative recurrence rates were significantly higher with the real chromatin interactome map (black bars) than with the 1,000 randomized maps (gray density plots).

Supplementary Figure 7 Results of site-specific LOOCV The proportion of positive votes by 1,000 decision trees in our random forest for site-specific LOOCV compared between truly recurrent mutations (green) and false, non-recurrent, mutations (red) depending on tumor and epigenome types.

Supplementary Table 1. Whole genome mutation data Cohort Cancer type Number of samples Number of point mutations Breast 119 647,695 Sanger Lung 24 1,446,336 GIS Lung 30 1,763,572 Breast 92 756,434 TCGA Lung 90 3,594,840

Supplementary Table 2. Long-range enhancer-promoter data Cancer type State Cell line Data source Number of interaction pairs MCF7 ChIA-PET 24,501 Cancer MCF7 IM-PET 25,746 MCF7, T47D DHS tag correlation 66,222 Breast MCF7, T47D CAGE expression correlation 18,362 HMEC IM-PET 20,254 Cell of origin HMEC DHS tag correlation 75,167 HMEC CAGE expression correlation 26,001 A549 IM-PET* 27,435 Cancer A549 DHS tag correlation 50,135 Lung A549 CAGE expression correlation 18,509 NHLF, IMR90 IM-PET 40,189 Cell of origin AGO4550, HPF, NHLF, WI38 DHS tag correlation 70,785 AGO4550, HPF, NHLF, WI38 CAGE expression correlation 25,400 DHS tag correlation: Correlation of tag density between promoter DHSs and non-promoter DHSs within ± 500kb of the tss and within the same topologically associated domain CAGE expression correlation: TSS-enhancer association based on CAGE expression levels of erna and mrna in the FANTOM5 panel *This data was generated as described in the Methods

Supplementary Table 4. Random forest input features Target gene features Binding TF features TFBS features Genomic features Cancer epigenome features Cell-of-origin epigenome features Feature ID Target_Cancer_Gene Target_HumanNet.CGS_L1 Target_HumanNet.CGS_L2 Target_Interactome.CGS_L1 Target_Interactome.CGS_L2 Target_GO_cell.death Target_GO_cell.differentiation Target_GO_cell.proliferation Target_GO_mitototic.cell.cycle Target_DEG_score Target_Expression Target_CNA TF_Cancer_Gene TF_HumanNet.CGS_L1 TF_HumanNet.CGS_L2 TF_Interactome.CGS_L1 TF_Interactome.CGS_L2 TF_GO_cell.death TF_GO_cell.differentiation TF_GO_cell.proliferation TF_GO_mitototic.cell.cycle TF_DEG_score diff_pval avg_pval TF_sign Dist_GWAS_2D Early.to.late_Rate PhastCons Dnase H3k20me1 H3k27ac H3k27me3 H3k36me3 H3k4me1 H3k4me2 H3k4me3 H3k79me2 H3k9ac H3k9me3 Orig_Dnase Orig_H3k20me1 Orig_H3k27ac Orig_H3k27me3 Orig_H3k36me3 Orig_H3k4me1 Orig_H3k4me2 Orig_H3k4me3 Orig_H3k79me2 Orig_H3k9ac Orig_H3k9me3 Description Whether the target gene is a known cancer gene Number of one-hop interactions of the target gene with known cancer genes in HumanNet Number of two-hop interactions of the target gene with known cancer genes in HumanNet Number of one-hop interactions of the target gene with known cancer genes in Interactome Number of two-hop interactions of the target gene with known cancer genes in Interactome Whether the target gene belongs to a Gene Ontology term related to cell death Whether the target gene belongs to a Gene Ontology term related to cell differentation Whether the target gene belongs to a Gene Ontology term related to cell proliferation Whether the target gene belongs to a Gene Ontology term related to mitotic cell cycle Differential expression of the target gene between cancer and normal tissues* Expression level of the target gene in cancer Copy number alteration of the target region Whether the regulator is a known cancer gene Number of one-hop interactions of the regulator with known cancer genes in HumanNet Number of two-hop interactions of the regulator with known cancer genes in HumanNet Number of one-hop interactions of the regulator with known cancer genes in Interactome Number of two-hop interactions of the regulator with known cancer genes in Interactome Whether the regulator belongs to a Gene Ontology term related to cell death Whether the regulator belongs to a Gene Ontology term related to cell differentation Whether the regulator belongs to a Gene Ontology term related to cell proliferation Whether the regulator belongs to a Gene Ontology term related to mitotic cell cycle Differential expression of the regulator gene between cancer and normal tissues* Change in the P value of the motif score by the mutation Average of the P values of the motif scores with and without the mutation Gain or loss of the TF binding site by the mutation Chromosomal distance from the mutation to the closest cancer GWAS locus Replication timing represented as the early-to-late ratio, (G1B+S1) /(S4+G2)** Evolutionary conservation of the underlying sequences as meaured by PhastCons DNase peak score in the cancer cell line H3K20me1 modification peak score in the cancer cell line H3K27ac modification peak score in the cancer cell line H3K27me3 modification peak score in the cancer cell line H3K36me3 modification peak score in the cancer cell line H3K4me1 modification peak score in the cancer cell line H3K4me2 modification peak score in the cancer cell line H3K4me3 modification peak score in the cancer cell line H3K79me2 modification peak score in the cancer cell line H3K9ac modification peak score in the cancer cell line H3K9me3 modification peak score in the cancer cell line DNase peak score in the cells of origin H3K20me1 modification peak score in the cells of origin H3K27ac modification peak score in the cells of origin H3K27me3 modification peak score in the cells of origin H3K36me3 modification peak score in the cells of origin H3K4me1 modification peak score in the cells of origin H3K4me2 modification peak score in the cells of origin H3K4me3 modification peak score in the cells of origin H3K79me2 modification peak score in the cells of origin H3K9ac modification peak score in the cells of origin H3K9me3 modification peak score in the cells of origin * Sigmoid normalization of the P value of the T test as (log P)/(1+log P) ** Ref: http://www.nature.com/nature/journal/v488/n7412/abs/nature11273.html

Supplementary Table 7. Correlation of gene-level mutation burden and expression level Survival gene Survival P value (expressionsurvival correlation) Correlation between mutation burden and expression level Correlation P value SNORA48 0.027 0.418 5.59E-05 SNORA14B 0.014 0.386 2.19E-04 LAMP1 0.005 0.34 1.28E-03 MAPK12 0.048 0.332 1.70E-03 PRSS54 0.044 0.329 1.83E-03 PITPNM2 0.033 0.328 1.92E-03 LOC100129935 0.044 0.328 1.61E-03 APCS 0.035 0.314 3.06E-03 VCX3B 0.029 0.311 2.86E-03 CARS2 0.011 0.304 4.16E-03 RPS6KA1 0.007 0.304 4.20E-03

Supplementary Table 9. Gained motifs for ETS transcription factors by distal tert mutations Mutation ID Motif start Motif end P value ( < 1E-03) ETS family tert-6 1280320 1280333 4.03E-04 SPIC tert-6 1280320 1280333 5.23E-04 SPIC tert-6 1280322 1280331 7.69E-04 ELK1 tert-6 1280323 1280331 8.03E-04 GABPA tert-9 1293556 1293565 8.59E-05 FLI1 tert-9 1293554 1293563 1.57E-04 ELK1 tert-9 1293554 1293567 3.85E-04 GABPA tert-9 1293556 1293566 4.11E-04 ELK4 tert-9 1293556 1293566 4.21E-04 ERG tert-9 1293556 1293564 4.45E-04 ELK1 tert-9 1293556 1293564 5.78E-04 ELF1 tert-9 1293554 1293566 6.16E-04 GABPA tert-9 1293557 1293566 6.39E-04 ELK4 tert-9 1293555 1293565 6.95E-04 GABPA tert-9 1293555 1293565 7.24E-04 GABPA tert-9 1293556 1293565 7.56E-04 ETV3 tert-9 1293553 1293565 7.71E-04 GABPA tert-9 1293553 1293566 7.88E-04 ELK1 tert-9 1293555 1293565 8.16E-04 ELF1 tert-9 1293555 1293565 8.21E-04 ELK4 tert-9 1293552 1293566 8.37E-04 GABPA tert-9 1293556 1293565 9.49E-04 ETS1 tert-9 1293556 1293565 9.57E-04 ELK1 tert-9 1293556 1293565 9.92E-04 ELK4

Supplementary Note Background mutation model We first chose the mutations that appear in two or more samples either gene-wise or in a site-specific manner, and then measured the statistical significance of mutation occurrence in a target region by considering local background mutation rates. As for site recurrence, we defined the target region as 5, 10, or 20 bp windows centred on the mutation of interest. As for gene-level recurrence, the mutation-containing regulatory region that is defined in the associated chromatin interactome data was used as the target region. We estimated the expected mutation frequencies per site based on background mutation density in windows of varying size (1 kb, 5 kb, 10 kb, or 100 kb). Consequently, the probability that a given region is mutated can be given as Prob target region R is mutated = 1 1 Prob site s is mutated. Given a very low per-site mutation frequency, the expected regional probability of mutation occurrence, p, was approximated as follows:!! p = Prob target region R is mutated!! Prob site s is mutated. Given p and the total number of samples n, the number of samples in which the given region is mutated follows the binomial distribution, B(n, p). We used Poisson approximation with parameter np for the binomial distribution. Therefore, the statistical significance of mutation occurrence x in a given target region was represented as the P value,

!!! e P X x = 1!!" np! i!!!!, where x = 1,2,, n and X ~ Poisson np. Only the target regions with P < 0.05 were retained. For site-specific significance, we used the pan-cancer data that were described in the Whole genome cancer sequences section. Additionally, the significance of gene-level occurrence was tested. We mapped the pan-cancer data including irrelevant tumour types to the cell-type-specific chromatin interactome to estimate the expected gene-level probability of mutation occurrence as q = Prob target gene G is mutated E/N, where E is the number of the pan-cancer samples with the given gene mutated, and N is the total number of the pan-cancer samples. Then, the number of cohort samples with the given gene mutated was assumed to follow the binomial distribution, B(n, q). Under the null hypothesis that cell-type specific recurrence was not different from the pan-cancer recurrence, the P value can be calculated as follows: P Y y = 1!!!!!! n i q! (1 q)!!!, where y = 1,2,, n and Y ~ B n, q. Finally, the mutations with the gene recurrence P value < 0.01 or < 0.05 were included in the true set.

Cohort extension analysis When the Sanger cohort was used as the initial (reference) cohort, the TCGA cohort was employed as the extension (rescue) cohort, and vice versa. We performed the LOOCV as described above for the initial cohort. The threshold fraction of votes between positive and negative calls was determined by applying the Jenks natural breaks classification method 1 for the distribution of votes for the two classes (true or false) of mutations. In other words, we sought a threshold that can minimize each class s average deviation from the class s mean vote while maximizing each class s deviation from the mean votes of the other classes. We then looked for the mutations that were positively predicted as recurrent although not actually being recurrent in the initial reference cohort but turned out to be recurrent when the gene-level recurrence rate was obtained after combining the initial and extension cohorts. These mutations were referred to as rescued mutations. The rescue frequency was calculated as the fraction of the rescued mutations over all mutations, that is, Rescue frequency % = True!"#$%&'( True!"#$%&'( + False!"#$%&'( 100, where True!"#$%&'( indicates the number of true (recurrent) mutations in the combined cohort and False!"#$%&'( is the number of false (non-recurrent) mutations in the combined cohort. This frequency was obtained for the initial positive calls and negative calls, separately. Next, we wanted to estimate the expected number of samples required to achieve the rescue frequency of 100% for positive calls. First, by a simple linear extrapolation, this number could be obtained as N 100 Rescue frequency or N Positive True!"#$%&'(, where N is the number of samples in the extension (rescue) cohort and Positive indicates the number of all positive calls. However, not all positive calls can be true and expected to be rescued.

Therefore, we estimated the number of true positives by assessing the precision of our random forest classifier in the LOOCV for the given initial cohort and the Jenks threshold. Because the precision is the ratio of true positives over all positives, the adjusted number of required samples should be N Positive Precision True!"#$%&'(.However, false negatives also should be taken into account. Sensitivity or recall is the ratio of true positives over all trues (true positives and false negatives). We assessed the sensitivity of our random forest classifier by estimating the number of false negatives as the initial negative calls that showed recurrence upon cohort extension. Hence, the final adjusted number of required samples can be given as N Positive Precision Recall True!"#$%&'(. We repeated all these calculations for each initial-extension cohort combination and also separately for the predictions based on cancer epigenomes and cell-of-origin epigenomes. 1. Jenks, G. F. & Caspall, F. C. Error on Choropleth Maps: Definition, Measurement, Reduction. Ann. Assoc. Am. Geogr. 61, 217 244 (1971).