Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma.

Size: px

Start display at page:

Download "Nature Genetics: doi: /ng Supplementary Figure 1. Mutational signatures in BCC compared to melanoma."

Leona Stevenson
5 years ago
Views:

1 Supplementary Figure 1 Mutational signatures in BCC compared to melanoma. (a) The effect of transcription-coupled repair as a function of gene expression in BCC. Tumor type specific gene expression levels were used for BCC (this study) and cutaneous melanoma 21. Genes were grouped into five equally sized bins by their level of expression. 95% confidence intervals for the ratio between the two binomial distributions were calculated using the R function riskscoreci(). (b) Fraction of each nucleotide 5 to C>A mutations. 1

2 Supplementary Figure 2 Distribution of mutations in BCC driver genes. (a n) Distribution of mutations in the full sample set (293 BCC samples) along the FBXW7 (a), PPP6C (b), STK19 (c), CASP8 (d), RB1 (e), KNSTRN (f), ERBB2 (g), NOTCH2 (h), NOTCH1 (i), ARID1A (), SMO (k), TP53 (l), PTCH1 (m) and SUFU (n) protein diagrams. The orange lollipops represent truncating mutations, the purple lollipops represent missense mutations and the blue lollipops represent both truncating and missense events affecting the same amino acid. Protein functional domains are represented by colored boxes. The most recurrent events for each protein are labeled with the amino acid change. Protein diagrams were generated with cbioportal tools. 2

3 Supplementary Figure 3 SCNAs in BCC exomes. Top, overall profile of SCNAs in BCC. LOHs and cnlohs are on the top (dark blue); amplified regions are on the bottom (red). The positions of relevant genes are marked. Bottom, per-tumor SCNA profile. Each line represents a sample, and each column represents a chromosome. Loss of an allele (LOH) is depicted in blue, cnloh is indicated in black and copy number gain is indicated in yellow. Chromosomes X and Y have been excluded. 3

4 Supplementary Figure 4 LATS2 mutations in BCC. (a) Distribution of mutations (in the 136 sequenced exomes only) along the LATS2 protein schema. The purple lollipops represent missense mutations, and the orange lollipops represent truncating mutations. Protein functional domains are represented by green boxes. Events occurring two or more times are labeled with the amino acid change. (b) Kinase domain structure of LATS2 highlighting the position of residue Pro1004. The protein diagram was generated with cbioportal tools. 4

5 Supplementary Figure 5 Fraction of tumors with driver mutations per category. The bars represent the fraction of samples with driver mutations per BCC category indicated in the header (only non-clonal samples were used in this analysis). Vismo, vismodegib. Genes harboring driver mutations are labeled on the x axis. MYCN-p.44 is a subcategory containing only tumors with MYCN p.44 mutations; PTPN14-tr corresponds to PTPN14 truncating mutations. The fraction of tumors represented by the bars can be found at the left. 5

6 Supplementary Note Anomaly search for oncogenes in sequencing data using TumOnc The goal of the algorithm is to find genes with a specific mutation pattern, corresponding to high mutation rate in a nucleotide at a fixed position. This is the principal difference of the algorithm from the MutSigCV 1.4 of Lawrence et al. 1, where the overall mutation rate over a gene (taking into account mutation contexts) is analyzed. Since the mutation signal in a single nucleotide is smaller than over a full gene, TumOnc is less specific, but significantly more sensitive identifying oncogenes. The current version of the algorithm is written in python and is available upon request. The TumOnc algorithm consists of two maor steps: 1. Creation of the background probability model for the mutations. 2. Search for anomalously mutated genes. Here we describe these in order. Background model The background model is based on the observation by Lawrence et al. 1 that the mutation rate in the genes is significantly modulated by the values of the covariates of the genes, such as DNA replication time, expression level, and chromatin compartment. While in Lawrence et al. 1 the genes were categorized in groups ( bagels ) of adaptive size with similar covariates, we followed a different approach. We factorize the probability in factors dependent on the mutation context and a smooth function of the covariate values for the gene (which is chosen to be exponential ). To be precise, the mutation rate p cgp (or, probability that the given nucleotide is mutated in the given patient) is expressed as p cgp = f p f c p avg exp ( a i V i,g ) (1) where c is the context of the nucleotide in question, g gene, p patient. f c and f p are the context and patient specific factors (defined to have average equal to one), p avg is the average overall mutation rate, V i,g are the values of the covariates for the given gene (with I = 1, 2, 3 corresponding to logarithm of expression, replication time and chromatin compartment, respectively), and a I are the response coefficients. The context c in the calculations was chosen as three nucleotides around the nucleotide in question (always defined along the strand with C or T in the reference genome). It is possible to use a different context with minimal modification of the procedure, either by distinguishing some subclasses of the 3-nucleotide context (leading to larger, but rougher available statistics), or taking into account the type of mutation (the latter requires ust a slight modification taking into account multiple possible mutation outcomes, instead of binary mutation/no-mutation logic). Strictly speaking this is not a unique choice, but is a reasonable first guess based on the overall correlation of mutation probability and values of the covariates for the gene.

7 Thus, the probability model (1) is parametrized by parameters f p, f c, p avg, and a i. The patient and context specific rates are defined to have an average value of one, f p = f c = 1. The observed mutations are given by the mutation counts n cgp, number of possible mutation sites for each gene-context pair N cg, and number of patients C in the dataset. For a given mutation set one can find the optimal values of f p, f c, p avg, and a i by the maximal likelihood method, i.e. by extremizing the log-likelihood function lnp = [n cgp log (f p f c p avg exp ( a i V i,g )) cgp i (2) +(CN cg n cgp )log (1 f p f c p avg exp ( a i V i,g ))] It is possible to find the generic exact minimum of this function, but it can be simplified by approximating the second logarithm as log(1 p) p, which is an excellent approximation given the smallness of the overall mutation rate, p 1. Another simplifying approximation used was to define the patient specific mutation rates as f p = C i cg n cgp (3) cgp n cgp This is an excellent approximation for low mutation rates. After these approximations, the values of f c p avg and a i are found by an exact analytic solution for the maximum of (2) and further numerical maximization of ln P over the parameters a i using the Newton-Raphson method. Note also that the whole set (synonymous and nonsynonymous) of single nucleotide mutations should be used to generate the background model. Using only a subset (only missense mutations, for example) would lead to underestimation of the mutation probability. The resulting background model (1) can be used as predicted mutation rate for each nucleotide in the target region. Due to statistics requirements, it is possible to obtain the background model on one set of data (e.g. whole exome sequencing of a smaller set of patients) and use it for a constrained set of data (e.g. Cancer Panel) from a separate set of patients. This approach maximizes the use of passenger mutations for the background model determination, as compared to using only mutations in the Cancer Panel genes. In the analysis presented in the manuscript, the background model was generated from the whole exome panel mutation rate, while the search for anomalously mutated genes (described in the next subsection) was performed over the Cancer Panel genes only, for performance reasons. Search for anomalous genes This step significantly differs from the MutSigCV 1.4 algorithm. TumOnc searches for individual significantly mutated nucleotides in the sample. This is done by a variation of the 2-dimensional reduction described in Lawrence et al. 1 for the search of significantly mutated genes. That is, for each nucleotide with mutations in several (more than 3 or 4 in the analysis) patients we test the hypothesis that this mutation could be produced due to the background probability. The critical statistics for such an event is constructed as a log-probability of the event with the expected mutations p cgp for the given nucelotide with context c and gene g (which still depends on the

8 patient). To reduce the amount of possible outcomes to analyze, we use a d-dimensional reduction all the patients were ordered by increasing probability, so that p 1 < p 2 <... < p C. Each possible mutation can then be labeled by the d numbers of the patients with this mutation in this order. For example, the event (0,0) would mean no mutations, event (0, n 1 ) would mean only one mutation for the patient n 1, while event (n 1, n 2 ) would mean three mutations with probabilities p n1 < p n2, and possibly more mutations with larger probabilities. The probability of such an event is P (n1, n 2 ) = n 1 1 C (1 p i ), for n 1 = n 2 = 0 i=1 n 2 1 (1 p i ) p n2, for n 1 = 0, n 2 > 0 i=0 (1 p i ) p n1 (1 p i ) p n2, for n 2 > n 1 > 0 { i=0 n 2 1 i=n 1 +1 It is easy to generalize this for d>2, i.e. for events described by a set (n 1, n 2, n 3 ) or longer. This probability was used to generate the critical statistics, so p-value for the hypothesis H (n1, n 2 ) that a given event (n 1, n 2 ) equals the background event is obtained by summation of the probabilities for all possible mutation events with smaller probability (one-sided test) P H(n 1,n2) = P (m 1, m 2 ) (m 1, m 2 );P (m 1,m2) <P (n1,n2) In practice, we were using 4-dimensional reduction for a sample set of about 60 patients. Unfortunately, the method quickly becomes time consuming with large number of patients (roughly as C d ), and a simplified approximate method should be used for larger datasets. P-values and q-values Finally the p-values obtained were converted to q-values using the Benamini Hochberg procedure with the total number of hypothesis taken as the whole number of nucleotides in the exome. (4) (5) SCNA detection with the HMM The regions of SCNAs in the tumor were probabilistically inferred using a twenty state hidden Markov model (HMM) approach fed by normalized coverage ratios for ordered exons and by germline heterozygous SNPs. Briefly each state is represented by the pair {C,P}: T C T N s log 2 1, absp (1) N P C

9 T N where represent the normalized coverage of the exon, N the percentage of reads C / covering the maor allele in the tumor normal and = 1- contamination due to normal cells. According to (1), the non redundant admissible states are given by the outer product of the vector C of possible coverages normalized with respect to a diploid genome [0,0.5,1,1.5,2] and the coverage fraction P of the maor allele [0,16.6,25,50]. We modeled the observed ~ s c ~, ~ p as 1 a pair independently Gaussian distributed random variables C,, P, 7. Accordingly, 2 c~ i C ~ we set the transition probabilities as pi P 1 pi erfc erfc where and C 2 C 2 p 2 7. P P T / Distribution of mutations in tumor suppressor genes The search for the anomalously mutated tumor suppressor genes was realized by a one-sided test calculating probability of appearing for two or more LOF mutations in the given gene in the observed sample, assuming the overall uniform distribution of LOF mutations. The P-value for observing the mutation for a given gene k 1 tumors with single LOF and k 2+ tumors with two or more LOF mutations is given by M ( ) ( k 1 + k 2+ ) ( n k 1 k 2+ 1 ) k P(k 1, k 2+ ) = 1 + k 2+ k 1 n k 1 2k 2+, M + n 1 ( ) n where M is the total number of tumors and n is the total number of LOF mutations in the exome in all the tumors. ( a ) represents the binomial coefficient. b 1. Lawrence, M.S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, (2013).

Supplementary Materials for

www.sciencetranslationalmedicine.org/cgi/content/full/7/283/283ra54/dc1 Supplementary Materials for Clonal status of actionable driver events and the timing of mutational processes in cancer evolution