Supplementary Figure 1 Expression deviation of the genes mapped to gene-wise recurrent mutations in the TCGA breast cancer cohort (top) and the TCGA lung cancer cohort (bottom). For each gene (each pair of red and blue bars), the average degree of expression deviation is compared between the mutated samples (red bar) and non-mutated samples (blue bar). For 72% (top) and 73% (bottom) of the cases, the mutated samples exhibited a higher degree of expression deviation. Genes were sorted according to the degree of the difference between the mutated and non-mutated groups. All genes are collapsed in the box plots on the right.
Supplementary Figure 2 Effects of copy number alterations versus mutations. (a) The effects of copy number alterations for genes ordered as in Figure 1c. For each gene (each pair of red and blue bars), the percentage of samples with copy number alterations is compared between the mutated samples (red bar) and non-mutated samples (blue bar). All genes are collapsed in the box plots on the right. (b) Outcome of multiple linear regression quantitatively comparing the contribution of mutation recurrence and copy number alteration to expression perturbation. Copy number alteration was processed through the deviation metric used for expression perturbation. The contribution of mutation burden remained significant after adjusting for copy number alteration.
Supplementary Figure 3 Permutation tests for expression perturbation by mutations. Gene expression profiles based on the RNA sequencing of 92 breast and 90 lung cancer samples were obtained from the TCGA and matched with the whole-genome sequences of the TCGA cohort samples. Array-based gene expression profiles of breast cancer samples were obtained from the ArrayExpress archive (E-MTAB-1088) and matched with the whole-genome sequences of the Sanger cohort samples. We obtained expression difference for each gene between the mutated and non-mutated groups as the t statistic. To estimate expected expression differences, the sample grouping (mutated versus non-mutated) was randomized while maintaining the sample size of each group. This permutation was repeated 1,000 times, and the t value was recalculated each time.
Supplementary Figure 4 Directionality of expression perturbation by mutations. The x axis has genes ordered as in Figure 1c. The y axis indicates the fraction of cases (samples) that go in the same direction (either positive or negative expression deviation).
Supplementary Figure 5 Recurrence rate (the percentage of mutated samples in the given cohort) was obtained for each mutation (for site-specific recurrence) or each gene (for gene-level recurrence). The site-specific recurrence rate was mapped to the target gene via the same chromatin interactome used for the gene-level recurrence analysis. P value indicates the differences in the recurrence rate between the known cancer genes and other genes.
Supplementary Figure 6 Permutation tests for biological validity of the chromatin interaction data. We shuffled chromatin interaction links between enhancers and promoters. This procedure was repeated 1,000 times. High recurrence rates were observed for known cancer genes (Fig. 2c), cancer gene interacting partners (Fig. 2d), and genes with relevant GO terms (Fig. 2e). We tested whether these patterns disappear as target genes are randomly mapped to mutations. To this end, the average recurrence rate of the above three groups of genes was obtained and divided by that of all genes. These relative recurrence rates were significantly higher with the real chromatin interactome map (black bars) than with the 1,000 randomized maps (gray density plots).
Supplementary Figure 7 Results of site-specific LOOCV The proportion of positive votes by 1,000 decision trees in our random forest for site-specific LOOCV compared between truly recurrent mutations (green) and false, non-recurrent, mutations (red) depending on tumor and epigenome types.
Supplementary Table 1. Whole genome mutation data Cohort Cancer type Number of samples Number of point mutations Breast 119 647,695 Sanger Lung 24 1,446,336 GIS Lung 30 1,763,572 Breast 92 756,434 TCGA Lung 90 3,594,840
Supplementary Table 2. Long-range enhancer-promoter data Cancer type State Cell line Data source Number of interaction pairs MCF7 ChIA-PET 24,501 Cancer MCF7 IM-PET 25,746 MCF7, T47D DHS tag correlation 66,222 Breast MCF7, T47D CAGE expression correlation 18,362 HMEC IM-PET 20,254 Cell of origin HMEC DHS tag correlation 75,167 HMEC CAGE expression correlation 26,001 A549 IM-PET* 27,435 Cancer A549 DHS tag correlation 50,135 Lung A549 CAGE expression correlation 18,509 NHLF, IMR90 IM-PET 40,189 Cell of origin AGO4550, HPF, NHLF, WI38 DHS tag correlation 70,785 AGO4550, HPF, NHLF, WI38 CAGE expression correlation 25,400 DHS tag correlation: Correlation of tag density between promoter DHSs and non-promoter DHSs within ± 500kb of the tss and within the same topologically associated domain CAGE expression correlation: TSS-enhancer association based on CAGE expression levels of erna and mrna in the FANTOM5 panel *This data was generated as described in the Methods
Supplementary Table 4. Random forest input features Target gene features Binding TF features TFBS features Genomic features Cancer epigenome features Cell-of-origin epigenome features Feature ID Target_Cancer_Gene Target_HumanNet.CGS_L1 Target_HumanNet.CGS_L2 Target_Interactome.CGS_L1 Target_Interactome.CGS_L2 Target_GO_cell.death Target_GO_cell.differentiation Target_GO_cell.proliferation Target_GO_mitototic.cell.cycle Target_DEG_score Target_Expression Target_CNA TF_Cancer_Gene TF_HumanNet.CGS_L1 TF_HumanNet.CGS_L2 TF_Interactome.CGS_L1 TF_Interactome.CGS_L2 TF_GO_cell.death TF_GO_cell.differentiation TF_GO_cell.proliferation TF_GO_mitototic.cell.cycle TF_DEG_score diff_pval avg_pval TF_sign Dist_GWAS_2D Early.to.late_Rate PhastCons Dnase H3k20me1 H3k27ac H3k27me3 H3k36me3 H3k4me1 H3k4me2 H3k4me3 H3k79me2 H3k9ac H3k9me3 Orig_Dnase Orig_H3k20me1 Orig_H3k27ac Orig_H3k27me3 Orig_H3k36me3 Orig_H3k4me1 Orig_H3k4me2 Orig_H3k4me3 Orig_H3k79me2 Orig_H3k9ac Orig_H3k9me3 Description Whether the target gene is a known cancer gene Number of one-hop interactions of the target gene with known cancer genes in HumanNet Number of two-hop interactions of the target gene with known cancer genes in HumanNet Number of one-hop interactions of the target gene with known cancer genes in Interactome Number of two-hop interactions of the target gene with known cancer genes in Interactome Whether the target gene belongs to a Gene Ontology term related to cell death Whether the target gene belongs to a Gene Ontology term related to cell differentation Whether the target gene belongs to a Gene Ontology term related to cell proliferation Whether the target gene belongs to a Gene Ontology term related to mitotic cell cycle Differential expression of the target gene between cancer and normal tissues* Expression level of the target gene in cancer Copy number alteration of the target region Whether the regulator is a known cancer gene Number of one-hop interactions of the regulator with known cancer genes in HumanNet Number of two-hop interactions of the regulator with known cancer genes in HumanNet Number of one-hop interactions of the regulator with known cancer genes in Interactome Number of two-hop interactions of the regulator with known cancer genes in Interactome Whether the regulator belongs to a Gene Ontology term related to cell death Whether the regulator belongs to a Gene Ontology term related to cell differentation Whether the regulator belongs to a Gene Ontology term related to cell proliferation Whether the regulator belongs to a Gene Ontology term related to mitotic cell cycle Differential expression of the regulator gene between cancer and normal tissues* Change in the P value of the motif score by the mutation Average of the P values of the motif scores with and without the mutation Gain or loss of the TF binding site by the mutation Chromosomal distance from the mutation to the closest cancer GWAS locus Replication timing represented as the early-to-late ratio, (G1B+S1) /(S4+G2)** Evolutionary conservation of the underlying sequences as meaured by PhastCons DNase peak score in the cancer cell line H3K20me1 modification peak score in the cancer cell line H3K27ac modification peak score in the cancer cell line H3K27me3 modification peak score in the cancer cell line H3K36me3 modification peak score in the cancer cell line H3K4me1 modification peak score in the cancer cell line H3K4me2 modification peak score in the cancer cell line H3K4me3 modification peak score in the cancer cell line H3K79me2 modification peak score in the cancer cell line H3K9ac modification peak score in the cancer cell line H3K9me3 modification peak score in the cancer cell line DNase peak score in the cells of origin H3K20me1 modification peak score in the cells of origin H3K27ac modification peak score in the cells of origin H3K27me3 modification peak score in the cells of origin H3K36me3 modification peak score in the cells of origin H3K4me1 modification peak score in the cells of origin H3K4me2 modification peak score in the cells of origin H3K4me3 modification peak score in the cells of origin H3K79me2 modification peak score in the cells of origin H3K9ac modification peak score in the cells of origin H3K9me3 modification peak score in the cells of origin * Sigmoid normalization of the P value of the T test as (log P)/(1+log P) ** Ref: http://www.nature.com/nature/journal/v488/n7412/abs/nature11273.html
Supplementary Table 7. Correlation of gene-level mutation burden and expression level Survival gene Survival P value (expressionsurvival correlation) Correlation between mutation burden and expression level Correlation P value SNORA48 0.027 0.418 5.59E-05 SNORA14B 0.014 0.386 2.19E-04 LAMP1 0.005 0.34 1.28E-03 MAPK12 0.048 0.332 1.70E-03 PRSS54 0.044 0.329 1.83E-03 PITPNM2 0.033 0.328 1.92E-03 LOC100129935 0.044 0.328 1.61E-03 APCS 0.035 0.314 3.06E-03 VCX3B 0.029 0.311 2.86E-03 CARS2 0.011 0.304 4.16E-03 RPS6KA1 0.007 0.304 4.20E-03
Supplementary Table 9. Gained motifs for ETS transcription factors by distal tert mutations Mutation ID Motif start Motif end P value ( < 1E-03) ETS family tert-6 1280320 1280333 4.03E-04 SPIC tert-6 1280320 1280333 5.23E-04 SPIC tert-6 1280322 1280331 7.69E-04 ELK1 tert-6 1280323 1280331 8.03E-04 GABPA tert-9 1293556 1293565 8.59E-05 FLI1 tert-9 1293554 1293563 1.57E-04 ELK1 tert-9 1293554 1293567 3.85E-04 GABPA tert-9 1293556 1293566 4.11E-04 ELK4 tert-9 1293556 1293566 4.21E-04 ERG tert-9 1293556 1293564 4.45E-04 ELK1 tert-9 1293556 1293564 5.78E-04 ELF1 tert-9 1293554 1293566 6.16E-04 GABPA tert-9 1293557 1293566 6.39E-04 ELK4 tert-9 1293555 1293565 6.95E-04 GABPA tert-9 1293555 1293565 7.24E-04 GABPA tert-9 1293556 1293565 7.56E-04 ETV3 tert-9 1293553 1293565 7.71E-04 GABPA tert-9 1293553 1293566 7.88E-04 ELK1 tert-9 1293555 1293565 8.16E-04 ELF1 tert-9 1293555 1293565 8.21E-04 ELK4 tert-9 1293552 1293566 8.37E-04 GABPA tert-9 1293556 1293565 9.49E-04 ETS1 tert-9 1293556 1293565 9.57E-04 ELK1 tert-9 1293556 1293565 9.92E-04 ELK4
Supplementary Note Background mutation model We first chose the mutations that appear in two or more samples either gene-wise or in a site-specific manner, and then measured the statistical significance of mutation occurrence in a target region by considering local background mutation rates. As for site recurrence, we defined the target region as 5, 10, or 20 bp windows centred on the mutation of interest. As for gene-level recurrence, the mutation-containing regulatory region that is defined in the associated chromatin interactome data was used as the target region. We estimated the expected mutation frequencies per site based on background mutation density in windows of varying size (1 kb, 5 kb, 10 kb, or 100 kb). Consequently, the probability that a given region is mutated can be given as Prob target region R is mutated = 1 1 Prob site s is mutated. Given a very low per-site mutation frequency, the expected regional probability of mutation occurrence, p, was approximated as follows:!! p = Prob target region R is mutated!! Prob site s is mutated. Given p and the total number of samples n, the number of samples in which the given region is mutated follows the binomial distribution, B(n, p). We used Poisson approximation with parameter np for the binomial distribution. Therefore, the statistical significance of mutation occurrence x in a given target region was represented as the P value,
!!! e P X x = 1!!" np! i!!!!, where x = 1,2,, n and X ~ Poisson np. Only the target regions with P < 0.05 were retained. For site-specific significance, we used the pan-cancer data that were described in the Whole genome cancer sequences section. Additionally, the significance of gene-level occurrence was tested. We mapped the pan-cancer data including irrelevant tumour types to the cell-type-specific chromatin interactome to estimate the expected gene-level probability of mutation occurrence as q = Prob target gene G is mutated E/N, where E is the number of the pan-cancer samples with the given gene mutated, and N is the total number of the pan-cancer samples. Then, the number of cohort samples with the given gene mutated was assumed to follow the binomial distribution, B(n, q). Under the null hypothesis that cell-type specific recurrence was not different from the pan-cancer recurrence, the P value can be calculated as follows: P Y y = 1!!!!!! n i q! (1 q)!!!, where y = 1,2,, n and Y ~ B n, q. Finally, the mutations with the gene recurrence P value < 0.01 or < 0.05 were included in the true set.
Cohort extension analysis When the Sanger cohort was used as the initial (reference) cohort, the TCGA cohort was employed as the extension (rescue) cohort, and vice versa. We performed the LOOCV as described above for the initial cohort. The threshold fraction of votes between positive and negative calls was determined by applying the Jenks natural breaks classification method 1 for the distribution of votes for the two classes (true or false) of mutations. In other words, we sought a threshold that can minimize each class s average deviation from the class s mean vote while maximizing each class s deviation from the mean votes of the other classes. We then looked for the mutations that were positively predicted as recurrent although not actually being recurrent in the initial reference cohort but turned out to be recurrent when the gene-level recurrence rate was obtained after combining the initial and extension cohorts. These mutations were referred to as rescued mutations. The rescue frequency was calculated as the fraction of the rescued mutations over all mutations, that is, Rescue frequency % = True!"#$%&'( True!"#$%&'( + False!"#$%&'( 100, where True!"#$%&'( indicates the number of true (recurrent) mutations in the combined cohort and False!"#$%&'( is the number of false (non-recurrent) mutations in the combined cohort. This frequency was obtained for the initial positive calls and negative calls, separately. Next, we wanted to estimate the expected number of samples required to achieve the rescue frequency of 100% for positive calls. First, by a simple linear extrapolation, this number could be obtained as N 100 Rescue frequency or N Positive True!"#$%&'(, where N is the number of samples in the extension (rescue) cohort and Positive indicates the number of all positive calls. However, not all positive calls can be true and expected to be rescued.
Therefore, we estimated the number of true positives by assessing the precision of our random forest classifier in the LOOCV for the given initial cohort and the Jenks threshold. Because the precision is the ratio of true positives over all positives, the adjusted number of required samples should be N Positive Precision True!"#$%&'(.However, false negatives also should be taken into account. Sensitivity or recall is the ratio of true positives over all trues (true positives and false negatives). We assessed the sensitivity of our random forest classifier by estimating the number of false negatives as the initial negative calls that showed recurrence upon cohort extension. Hence, the final adjusted number of required samples can be given as N Positive Precision Recall True!"#$%&'(. We repeated all these calculations for each initial-extension cohort combination and also separately for the predictions based on cancer epigenomes and cell-of-origin epigenomes. 1. Jenks, G. F. & Caspall, F. C. Error on Choropleth Maps: Definition, Measurement, Reduction. Ann. Assoc. Am. Geogr. 61, 217 244 (1971).