Supplementary Figure 1 Summary of exome sequencing data. ( a ) Exome tumor normal sample sizes for bladder cancer (BLCA), breast cancer (BRCA), carcinoid (CARC), chronic lymphocytic leukemia (CLLX), colorectal cancer (COLR), diffuse large B cell lymphoma (DLBC), esophageal adenocarcinoma (ESOP), glioblastoma multiforme (GLBM), head and neck cancer (HNSC), kidney clear cell carcinoma (KIRC), acute myeloid leukemia (LAML), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), medulloblastoma (MEDU), melanoma (MELA), multiple myeloma (MUMY), neuroblastoma (NEUB), ovarian cancer (OVAR), prostate cancer (PRAD), rhabdoid tumor (RHAB) and uterine corpus endometrial carcinoma (UCEC). ( b ) Reference coordinates for mutation impact annotation 29 (SnpEff). CDS, coding sequence. 1
Supplementary Figure 2 Background mutation models capture variance in somatic mutation rates and are well correlated. ( a ) Genome wide transition/transversion mutation probabilities per tumor type. ( b ) Absolute difference in the log probabilities of complementary mutations (C>T and G>A) per gene in melanoma for the Bayesian and 'Exonic' mutation probability models. The percentage of genes where complementary mutation probabilities are within one order of magnitude is indicated. ( c ) The median of Spearman correlations between the average Bayesian and 'Matched' mutation probabilities in distinct tumor types is shown for the sets of tumor types with minimum numbers of samples ( x axis). ( d ) Correlation between observed WGS intronic mutation probability (pan cancer) and those of the Bayesian (blue) or 'Matched' (gray) models. 2
Supplementary Figure 3 Density scores are highly correlated and enriched for known cancer driver genes. ( a ) Right, the pan cancer relationship between gene specific and global binomial probabilities is shown. Left, correlation (Spearman ρ ) is plotted as a function of density score in the low to mid density range. ( b ) Somatically altered SNV driven cancer gene (SCG) fold enrichment (red) and significance of enrichment (blue) of region associated genes as a function of region density score. ( c ) Fraction of SCGs that are region associated (blue) and fraction of region associated genes that are SCGs (red) as a function of region density score. 3
Supplementary Figure 4 Most mutation cluster density scores fit the null distribution and lie on the diagonal in a quantile quantile plot, indicating that simulations accurately capture the significance of mutation densities. Quantile quantile plots of the observed ( y axis) and simulated ( x axis) density scores ( log 10, P Density ). ( a d ) Representative examples from bladder cancer (BLCA) ( a ), breast cancer (BRCA) ( b ), colorectal cancer (COLR) ( c ) and diffuse large B cell lymphoma (DLBC) ( d ) are shown. The solid line represents the threshold for density score ( log 10, P Density ) that guarantees FDR 5% in each cancer type. The dashed line indicates the line corresponding to y = x. ( e ) Violin plots of density scores in an expanded set of 90 additional colorectal cancer simulations. ( f ) The distributions of density scores in the original (10 ; blue) and expanded (90 ; yellow) sets of simulations are highly concordant and yield tightly correlated FDR estimates for the observed density scores (inset, r 2 = 0.99985). Dashed lines indicate thresholds of FDR 5%. ( g ) 99.2% (128/129) of SMRs thresholded by FDR ( 5%) are shared by the FDR 10 and FDR 90 thresholded sets. 4
Supplementary Figure 5 Robust SMRs capture ~95% of high confidence SMRs from ten cancer types. Robust SMRs are 58.8 fold enriched for somatic, SNV driven Cancer Gene Census (CGC) genes ( P = 2.4 10 34 ). ( a ) Overlap (blue) of robust SMRs (cyan) and high confidence SMRs (gray). ( b, c ) Fraction of SMRs per cancer type classified as robust. Analyses in a and b are limited to high confidence SMRs from the ten cancer types (green) with sufficient intronic mutation clusters for intron based FDR estimation, as shown in b. 5
Supplementary Figure 6 Contribution of trinucleotide and APOBEC mutation heterogeneity in SMR identification. ( a ) The fraction ( ƒ ) of mutated sites in endometrial cancer (UCEC) is plotted for each trinucleotide. Trinucleotides are oriented by transcription strand. Trinucleotides associated with APOBEC mutation signatures at high and low rates are labeled orange and pink, respectively. Notably, ƒ TCT > ƒ TCA and ƒ AGA > ƒ AGT. As shown in the inset (i), SMR mutation sites show a generally reduced fraction of APOBEC associated trinucleotides as compared to the global set of somatic mutation sites in endometrial cancer. ( b ) As shown for endometrial cancer (i), the deviation in the observed over the (single nucleotide) expected trinucleotide representation was compared with the fold change in the trinucleotide representation in SMR mutation sites for cancers with 250 SMR mutation sites (positions). These cancer types encompass 79% of all SMRs. On average, trinucleotide mutation heterogeneity not captured by single nucleotide transition/transversion probabilities contributes to only 7.9% of the change in trinucleotide representation in SMRs. ( a, b ) Analyses performed with high and medium confidence SMRs. ( c ) Histogram of the fraction of mutations that are APOBEC associated per SMR. ( d ) Fraction of SMRs in which APOBEC associated mutations are statistically increased ( P < 0.05, Holmes Bonferroni) per cancer. As shown in the inset (i), 4.0% of identified SMRs ( n = 872) are driven by APOBEC associated mutations. Raw (uncorrected) P values would indicate that 12% of SMRs have higher than expected APOBEC mutation signatures. 6
Supplementary Figure 7 Histogram of the fraction of somatic mutations within each coding region SMR that are predicted to alter protein sequence or RNA splicing. 7
Supplementary Figure 8 Histogram of Gini coefficients of dispersion for nonsynonymous mutations per gene. Gini coefficients were calculated on the basis of the number of nonsynonymous mutations contained per residue mutated in each cancer for CGC genes. For each CGC gene ( n = 522), the maximum coefficient across cancers is plotted 31,32. A set of outliers with extreme Gini coefficients is labeled. 81% of CGC genes with unassociated SMRs have Gini coefficients <0.1. 8
Supplementary Figure 9 Molecular dynamics analysis of wild type and mutant PIK3CA in complex with PIK3R1. ( a ) Wild type (WT) PIK3CA in complex with PIK3R1. ( b ) The K111E mutant of PIK3CA in complex with PIK3R1. ( c ) The G118D mutant of PIK3CA in complex with PIK3R1. The interaction enthalpy across the full PIK3CA PIK3R1 binding interface follows a bimodal distribution (as shown in Fig. 3d ). Binding Mode 1 (blue) is preferred by WT PIK3CA and corresponds to binding interactions that are on average 1.8 kcal/mol tighter than those in Binding Mode 2 (orange), which predominates in the K111E mutant of PIK3CA. The difference between the two binding modes becomes apparent in the salt bridge pattern of R79. In Binding Mode 1, R79 is a key component of the binding interface (with E1215 and E1222 of PIK3R1; shown in gray helices). In Binding Mode 2, a salt bridge between R79 and E81 is in direct competition to this binding interaction (orange panel of a ). In WT PIK3CA, this competition is attenuated by the interaction of K111 with E81 (shown in the blue panel of a ) and to a similar degree by the interaction of R108 with E81 (data not shown). In the K111E mutant of PIK3CA, a similar attenuation can only occur through the simultaneous recruitment of R108 (blue panel of b ). Taken together, the data suggest that K111E causes an inversion of the bimodal binding distribution and effectively weakens the interactions between PIK3CA and PIK3R1 as compared to WT PIK3CA. ( c ) Molecular dynamics simulations of the G118D mutant of PIK3CA show a similar weakening of the binding interactions with R79 at their core, albeit through the reshaping of a more extensive network of salt bridges that involves D118. Data are from 20 independent 0.1 μs molecular dynamics simulations. The individual distributions in Figure 3d correspond to distinct conformational states at the binding interface. Their cumulative populations were normalized and are reported as percentages. 9
Supplementary Figure 10 Enrichment of CGC genes among SMR based protein coding drivers and SMR identified binding interfaces. ( a ) Fraction of SMR and OncodriveCLUST identified protein coding genes in the Cancer Gene Census (CGC). OncodriveCLUST results were obtained from Tamborero et al. 11. Driver analysis in endometrial (UCEC), ovarian (OVAR) and lung squamous cell carcinoma (LUSC) were performed with the same exome data sets. Breast cancer (BRCA) results were obtained with distinct sets of exome data sets and are therefore not directly comparable. ( b ) The fraction of SMR identified and previously reported 51 protein and DNA interaction interfaces with recurrent cancer somatic mutations. For direct comparison, we consider only interactions with nucleic acids and proteins. All CGC genes with previously reported 51 somatically altered nucleic acid or protein interfaces are captured by SMRs (inset). 10
Supplementary Figure 11 Molecular structure and spatial mapping of an SMR on histone H2B. An SMR on histone H2B (HIST1H2BK.1; orange) is highlighted within the structure of the human nucleosome core particle ( PDB, 2CV5 ). Histone H2B (blue), histone H2A (teal) and histone H4 (green) components are highlighted. 11
Supplementary Figure 12 NFE2L2 SMRs alter KEAP1 binding interfaces. The structures of SMR NFE2L2.1 (orange, shown here) and NFE2L2.2 ( Fig. 4g ) were mapped to NFE2L2 structures ( PDB, 2FLU and 3WN7 ). A sector of recurrent lung adenoma alterations on KEAP1 (teal) with density score FDR 5% did not meet the 2% mutation frequency cutoff. The structure of NFE2L2.2 mapped to the mouse NFE2L2 KEAP1 co crystal structure ( PDB, 3WN7 ) is shown in Figure 4g. 12