SUPPLEMENTARY INFORMATION doi:1.138/nature12912 Figure S1. Distribution of mutation rates and spectra across 4,742 tumor-normal (TN) pairs from 21 tumor types, as in Figure 1 from Lawrence et al. Nature (213). Each dot corresponds to a tumor-normal pair, with vertical position indicating the total frequency of somatic mutations in the exome. Tumor types are ordered by their median somatic mutation frequency, with the lowest frequencies (left) found in hematological and pediatric tumors, and the highest (right) in tumors induced by carcinogens such as tobacco smoke and UV light. Mutation frequencies vary more than 1-fold between lowest and highest mutation rates across cancer and also within several tumor types. The lower panel shows the relative proportions of the six different possible base-pair substitutions, as indicated in the legend on the left. WWW.NATURE.COM/NATURE 1
RESEARCH SUPPLEMENTARY INFORMATION Figure S2. Radial plot representing mutational processes across 4,742 tumor-normal (TN) pairs from 21 tumor types, as in Figure 2 from Lawrence et al. Nature (213). The angular space is compartmentalized into the fifteen different factors discovered by NMF. The distance from the center represents the total mutation frequency. Different tumor types segregate into different compartments based on their mutation spectra. Notable examples are: lung adenocarcinoma and lung squamous carcinoma (red; ~1 o clock position), melanoma (black; ~4 o clock position), esophageal (light green), colorectal (dark green), and endometrial cancer (purple) (~8-9 o clock position), samples harboring mutations of the HPV or APOBEC signature (bladder, and some lung and head and neck samples, marked in yellow, red, and blue respectively; ~7 o clock position), and AML samples showing a Tp*A->T signature, ~11 o clock position. 2 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH a b MutSigCV Burden of protein-altering mutations (compared to background) nonsilent silent MutSigCL Clustering of mutations in hotspots MutSigFN Functional impact of the mutations low med high high evolutionary conservation Figure S3. a. Three statistical tests were used to detect cancer genes: mutation burden test (calculated by MutSigCV); mutation clustering (calculated by MutSigC); and enrichment in functional sites (calculated by MutSigFN). b. Q-Q plot of the p-values obtained after combining the three tests, when applied on the combined set of 4,742 TN pairs. WWW.NATURE.COM/NATURE 3
RESEARCH SUPPLEMENTARY INFORMATION Figure S4. Mutation patterns for six known and novel cancer genes. Similar diagrams for all genes are available at http://www.tumorportal.org. a. EGFR shows distinctive tumor-type-specific concentrations of mutations at different regions of the gene. b. APC shows mutation throughout the gene in many tumor types, but colorectal cancer shows a distinctive concentration of truncating mutations in the N-terminal half of the protein. c. RHEB, which encodes a small GTPase in the RAS superfamily, shows a mutational hotspot in the effector domain. d. SMARCB1, which is significant based on its mutations in rhabdoid tumor and esophageal adenocarcinoma, also has truncating mutations in several other tumor types. e. RHOA, analogously to RHEB, encodes a small GTPase and shows a mutational hotspot in the effector domain. f. EZH1, which encodes a histone methyltransferase, does not achieve significance in any one tumor type, but does so when analyzing the cohort as a whole. 4 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH Acute myeloid leukemia n=196 FLT3 DNMT3A NPM1 IDH2 IDH1 TET2 NRAS RUNX1 WT1 U2AF1 KRAS PTPN11 KIT SMC3 STAG2 PHF6 RAD21 CEBPA ASXL1 SFRS2 SMC1A PAPD5 EZH2 PDSS2 MXRA5 KDM6A GATA2 frequency >1% 5 1% 3 5% 2 3% 1 2% <1% Bladder n=99 Breast n=892 KDM6A GATA3 MAP3K1 PTEN AKT1 CTCF CBFB MLL3 MAP2K4 RUNX1 CDH1 RB1 SF3B1 MLL2 CDKN1A ERCC2 STAG2 PIK3R1 RXRA TBC1D12 NFE2L2 NCOR1 KRAS SPEN RB1 MLL C3orf7 ERBB3 ELF3 FBXW7 FGFR3 FOXQ1 ERBB2 CREBBP HRAS SNX25 TSC1 MGA EZR TBL1XR1 CDKN1B HIST1H3B FOXA1 CASP8 MED23 TBX3 CUL4B STAG2 MYB RAB4A ASXL2 EP3 FGFR2 SETDB1 Carcinoid n=54 Chronic lymphocytic leukemia n=159 Colorectal n=233 Diffuse large B cell lymphoma n=58 Endometrial n=248 Esophageal adenocarcinoma n=141 Glioblastoma multiforme n=291 SF3B1 MYD88 APC FBXW7 SMAD4 NRAS PTEN PIK3R1 KRAS FBXW7 CTCF EGFR PIK3R1 IDH1 RB1 PTEN NF1 SMAD2 MYD88 TNFRSF14 ARHGAP35 TCF7L2 CD79B BRAF MLL2 KRAS EZH2 CARD11 CREBBP XPO1 TNF PCBP1 CD7 CDKN1B HIST1H1E RPS15 RPS2 TGFBR2 RB1 BP1 ACVR1B ERBB3 CASP8 ELF3 TRIM23 CDC27 B2M NTN4 AXIN2 SIRT4 GOT1 POU2F2 ITPKB HLA A BRAF GNA13 POU2AF1 NRAS DDX3X SPEN ATM CHD2 MED12 POT1 IDH1 RBM1 BCLAF1 BCOR MSH6 MAP2K4 IDH2 FAM123B HIST1H3B KRAS B2M STAT6 BCL6 ZFHX3 FGFR2 CCND1 PPP2R1A ING1 ZNF471 ERBB3 CTNNB1 BCOR CUX1 SPOP NRAS FAT1 ARID5B CCDC6 ZNF62 AKT1 SLC1A3 EIF2S2 TCP11L2 SGK1 DIAPH1 KLHL8 SOX17 GNPTAB SLC44A3 ADNP POLE TTLL9 SELP CHD4 PPM1D RSBN1L MICALCL SACS ANK3 TBL1XR1 SOS1 DNER MORC4 MYCN TXNDC8 INPPL1 ZRANB3 TAP1 EP3 TPX2 CAP2 ALKBH6 CDKN2A STAG2 RPL5 FLG SLC26A3 BRAF SMARCA4 SMARCB1 IL7R COL5A1 MAP3K1 KEL CD1D CHD8 DDX5 MUC17 QKI AZGP1 SETD2 NUP21L CASP8 SMAD4 KRAS ERBB4 FAM123B next 2 genes Head and neck n=384 CDKN2A CASP8 FAT1 NFE2L2 NOTCH1 MLL2 NSD1 HRAS EPHA2 AJUBA RAC1 ZNF75 RHOA TGFBR2 PTEN HLA A EP3 B2M OTUD7A HLA B CTCF IPO7 RASA1 MAP4K3 RB1 LCP1 MAPK1 KDM6A SMAD4 IRF6 Kidney clear cell n=417 SETD2 BAP1 VHL PBRM1 PTEN KDM5C MTOR GUSB RHEB ATM TCEB1 MPO CCDC12 PTCH1 Lung adenocarcinoma n=45 Lung squamous cell carcinoma n=178 KEAP1 CDKN2A STK11 CDKN2A KRAS U2AF1 SMARCA4 EGFR MLL2 KEAP1 MET NF1 RIT1 BRAF NFE2L2 RBM1 RB1 ERBB2 ATM SLC4A5 NBPF1 STX2 MAP2K1 RB1 HRAS HLA A FBXW7 FAT1 APC MBD1 MLL3 NRAS PTEN EP3 FUBP1 EGFR ARHGAP35 Medulloblastoma n=92 DDX3X CTNNB1 WAS QKI MLL2 SUFU Melanoma n=118 BRAF NRAS CDKN2A BCLAF1 RAC1 PTEN PPP6C FRMD7 CDK4 STK19 ACO1 XIRP2 LCTL OR52N1 ALPK2 WASF3 OR4A16 MYOCD CTNNB1 Multiple myeloma n=27 KRAS NRAS DIS3 BRAF FAM46C INTS12 TRAF3 PRDM1 IRF4 IDH1 FGFR3 PTPN11 Neuroblastoma n=81 ALK PTPN11 FGFR3 Ovarian n=316 BRCA1 RB1 CDK12 NF1 SMARCB1 BRCA2 KRAS Prostate n=138 SPOP ATM MED12 FOXA1 Rhabdoid tumor n=35 SMARCB1 next 2 genes gene p value 1 15 1 14 1 13 1 12 1 11 1 1 1 9 1 8 1 7 1 6 1 5 1 4 1 3.1.1 1 Figure S5. Cancer genes in 21 tumor types. For each tumor type, genes are arrayed on the horizontal line according to p-value (combined value for three tests in MutSig). Yellow region contains genes that achieve FDR q.1. Orange interval contains p- WWW.NATURE.COM/NATURE 5
RESEARCH SUPPLEMENTARY INFORMATION values for the next 2 genes. The color of the gene s name indicates whether the gene is a known cancer gene (blue), a novel gene with clear connection to cancer (red; discussed in text), or an additional novel gene (black). The color of the circle associated with each gene name indicates the frequency (percent of patients carrying non-silent somatic mutations) in the corresponding tumor type. Total number of samples analyzed (n) is indicated beneath the name of each tumor type. 6 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH a b Figure S6. a. Heatmap showing the similarity between each pair of tumor types. Similarity scores were defined for each pair of tumor types as the number of genes significantly mutated in both tumor types, divided by the number of genes significantly mutated in either tumor type. Heatmap shows the color scale from blue (similarity=) to red (similarity=1). b. Dendrogram based on pairwise distances between tumor types, equal to 1 minus their similarity score. WWW.NATURE.COM/NATURE 7
RESEARCH SUPPLEMENTARY INFORMATION 3 2 1 BLCA 5 8 6 4 2 5 1 2 1 ESO LUAD 2 4 3 2 1 BRCA 5 2 1 1 2 15 1 5 GBM LUSC 1 8 6 4 2 CLL 1 2 1 HNSC 2 2 1 MEL 5 1 2 1 CRC 1 2 2 1 KIRC 2 4 1 5 MM 1 2 2 1 DLBCL 5 2 1 LAML 1 6 4 2 UCEC 1 2 Figure S7. Down-sampling analysis shows that candidate cancer gene discovery is continuing to accumulate within each tumor type. Each point represents a random subsampling of the patients. X-axis indicates the number of patients in the sample; Y- axis indicates the number of significant genes in the sample. Blue line is a smoothed fit. 8 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH 2 15 1 5 5 1 15 2 # tumor types 18 16 14 12 1 8 6 4 2 1 2 3 4 Figure S8. Down-sampling of combined cohort, with FDR correction applied by subtracting the expected number of false positives at each downsampling point. WWW.NATURE.COM/NATURE 9
RESEARCH SUPPLEMENTARY INFORMATION Number of TN pairs needed for power of 9% in 5% and 9% of genes 2, 1, 5 3 2 1 5 3 2 1 5 1% 2% 3% 5% 1%.1.2.3.5 1 2 3 5 1 2 Somatic mutation frequency (/Mb) Rhabdoid tumor Medulloblastoma Acute myeloid leukemia Carcinoid Neuroblastoma Chronic lymphocytic leukemia Prostate Breast Multiple myeloma Ovarian Kidney clear cell Glioblastoma multiforme Endometrial Colorectal Diffuse large B cell lymphoma Head and neck Esophageal adenocarcinoma Bladder Lung adenocarcinoma Lung squamous cell carcinoma Melanoma Figure S9. Number of tumor-normal pairs required to achieve 9% power for 9% of genes (solid) and 9% power for the median gene (dashed). Other details of figure are exactly as described in Figure 5 of the main text. 1 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH Supplementary Tables Table S1. List of source datasets analyzed in this work, and references to the corresponding publications. Table S2. The 26 significantly mutated cancer genes found by analysis with the MutSig suite. The first set of columns (A-B) list the gene symbol and its full name. The second set of columns (D-I) indicate the membership of the genes in various sets. Column D contains a 1 if the gene is listed in the Cancer Gene Census (v65) as being recurrently somatically mutated. Column E contains a 1 if the gene was not previously reported to be somatically mutated in cancer at a significant rate, either in the CGC or in a recent publication (Supplementary Table 3). Columns F and G report whether the gene is a member of the Cancer5 and Cancer5-S sets, respectively. Column H contains a "1" for those thirty genes that were identified as significantly mutated only when analyzing the combined cohort of 4742 samples. Column I contains a "1" for the 33 novel genes with clear and consistent connections to cancer hallmarks that were discussed in the text. The remaining sets of columns diverge in the four tabs of the spreadsheet. The first tab ("Individual q-values") reports the q-values from analyzing each tumor type separately and correcting for only the 18,388 genes this corresponds to the Cancer5 set. The second tab ("q-values when testing 4K hyp.") reports the q-values after the more stringent correction for 18,388 genes x 22 analyses this corresponds to the Cancer5-S set. The third tab ("q-value from RHT") reports the q- values obtained when performing the Restricted Hypothesis Testing (RHT) as described in the text. The fourth tab ("Frac. of non-silent mutations") lists the number (and percentage) of patients of each tumor type carrying a non-silent mutation in the gene. The fifth and last tab ( P-values ) lists the raw p-values obtained from each statistical test (MutSigCV, MutSigCL, MutSigFN), as well as each pairwise combination,, and the final overall (all three tests) combined p-value, for each gene. Table S3. List of the 21 tumor types studied, and the significantly mutated genes found by the MutSig suite in each tumor type. Column D lists the genes that were found to be significant when directly analyzing the full list of 18,388 genes. Column E lists the additional genes found to be significant when performing the Restricted Hypothesis Testing (RHT) as described in the text. In parentheses after each gene name are two numbers: the number and percentage of patients carrying a non-silent mutation in the gene. Table S4. List of references reporting the identification of candidate cancer genes. This table lists the genes that are not yet listed in the Cancer Gene Census (CGC) but have been reported in previous publications. WWW.NATURE.COM/NATURE 11
RESEARCH SUPPLEMENTARY INFORMATION Table S5. List of references to biological literature supporting the 33 novel candidate cancer genes with clear and compelling connections to cancer biology. Table S6. Summary of the analysis comparing the performance of each of the three MutSig metrics separately, in pairwise combinations, and all three combined as in the main analysis. The table lists the number of candidate cancer genes in the Cancer 5 list and the Cancer 5-S list, as well as the number of genes identified from the list of CGC genes that contain somatic mutations in tumor types we studied, and the number of genes identified from the list of 33 novel candidate cancer genes with compelling connections to cancer biology. 12 WWW.NATURE.COM/NATURE