SUPPLEMENTARY METHODS

Size: px

Start display at page:

Download "SUPPLEMENTARY METHODS"

Owen Hodges
5 years ago
Views:

1 SUPPLEMENTARY METHODS Data Contents and Sources OASIS captured sample-level annotations and three omics data types - somatic mutation, copy number variation (CNV) and gene expression based on microarray or RNA-Seq data (Supplementary Table 1). Alterations were reported for individual genes and samples or summarized for individual genes and cancer types. Gene-level copy number was calculated and reported for each sample based on a weighted average of copy number values for all overlapping CNV segments. Gene-level expression was shown as log2 intensity for microarray data and TPM (transcripts per million) for RNA-Seq data calculated by the RSEM package 1. All mutations were consistently annotated using a custom pipeline based on the Variant Effect Predictor (VEP) 2 with additional information derived from comparisons with dbsnp 3 and COSMIC 4 databases. Disease descriptions of tumor samples were manually curated to consistently enforce a single controlled vocabulary across different datasets (Supplementary table 2). Differential and outlier gene expression analyses were performed on primary tumor datasets with expression data available for 20 normal samples. Most datasets consisted of samples from a single cancer origin except for CCLE and GTEx which are multi-cancer datasets. OASIS 1.0 contained data on 12,108 tumor, 13,007 normal and 1,054 cell line samples across 55 cancer types from TCGA (The Cancer Genome Atlas: CCLE (Cancer Cell Line Encyclopedia) 5, GTEx (Genotype-Tissue Expression) 6 and four Pfizer-funded genomics studies of liver, gastric and breast cancers 7-10 (Supplementary Table 1). Analysis result files including CNV, Mutation and RNA-Seq Expression data for TCGA and GTEx were provided by OmicSoft Corporation s OncoLand data service. CCLE RNA-Seq data was downloaded from the UCSC CGHub 11. Three gene lists that represent potential drug targets surfaceome, secretome and immunome - were obtained from publications and The Human Protein Atlas ( RNA-Seq Analysis RNA-Seq data for the CCLE dataset was analyzed using the RSEM algorithm 1 to quantify gene expression. Fastq files for each sample were extracted from BAM files using Picard s SamToFastq function ( RSEM (v ) was applied to the fastq paired-end read data using human RefSeq transcripts and GRCh37 genome assembly as references. Gene expression levels in TPM were reported by OASIS. For the TCGA dataset, TPM values were derived by applying the formula TPM = scaled estimate x 10 6 on scaled estimate data from the firehose pipeline output ( Copy Number Analysis CNV segments with lengths <1 Mb were defined as focal and CNV segments 1 Mb in length were defined as broad. CNV calls were further classified into 5 categories based on the copy number value: Amplification (CN 3.7), Copy gain (3.7 > CN 2.5), Neutral (2.5 > CN > 1.5), Copy loss (1.5 CN > 1.2) and Deletion (CN 1.2). Gene level CN for each sample was calculated based on a weighted average of CN values for all overlapping CN segments: Gene CN = (C*E) where C represents copy number of the CNV segment and E represents the percentage of exon sequence overlapping the CNV segment. Genomic regions that harbor recurrent and high-level CNV in a cohort of samples were identified using the GISTIC v.2 algorithm 17. Mutation Annotation

2 Mutations were annotated using the EnsEMBL gene set release and the Variant Effect Predictor (VEP) 2. For each mutation, transcript level annotations from VEP were filtered to select the most deleterious annotation at the gene level based on the mutation functional effect and functional consequence as predicted by SIFT 19. For those transcripts with the same predicted functional state, the longer coding transcript was selected. COSMIC (v.66) identifiers and sample counts for sample entries with confirmed somatic status were provided for those mutations with matching coordinates and mutant allele with a COSMIC entry. The dbsnp (v.137) 3 identifier and allele frequency were also annotated for those mutations having a coordinate match with a dbsnp entry. Pfam 20 protein domains overlapping with mutations were annotated for those mutations. For each mutation, the number of samples affected in each cancer type and across all cell lines were reported as tumor and model recurrences respectively. Differential Expression and Outlier Analysis Microarray data were quantile normalized and differential expression was analyzed using the limma package in Bioconductor 21. Differential expression analysis for RNA-Seq data was performed using the edger package in Bioconductor 22. The false discovery rate (FDR) was calculated using the Benjamini-Hochberg method 23. Outlier analysis was calculated for gene expression data using the likelihood ratio method 24, which identifies a change point in the sorted standardized expression level of tumor samples while using normal samples as a reference. Samples above the change point were defined as up-regulated outliers. Downregulated outliers were also defined similarly. The p-value for the outlier statistic was calculated based on simulated null distribution from standard Gaussian, and Benjamini- Hochberg FDR was then calculated from the p-values. Outliers were defined only for genes with FDR<0.05 in the outlier analysis. Druggability Score The small molecule druggability score for a gene indicates the potential of developing a small molecule drug to functionally modulate the gene product. It was developed using a semi-supervised approach of protein domain classification. The method started by examining the molecular targets of all marketed and experimental drugs, as well as a large collection of drug-like tool compounds. Known protein domains from these targets capable of binding by small molecule compounds with high affinity were then learned in a semi-automated fashion. By carrying out the mapping of these druggable protein domains to the genome-wide scale, the method classified all human proteins into six categories of druggability 25 : 1 unknown druggability, 2 - has catalytic activity, 3 - has possibly druggable protein domain(s), 4 - has druggable protein domain(s), 5 - has high affinity drug-like compound(s) and 6 - target of launched drug(s). SUPPLEMENTARY NOTE Overview of Web Portal The OASIS web portal ( was developed based on a custom version of the BioMart framework designed for oncogenomics data analysis 26,27. All functionalities can be accessed through the menu bar at the top of the Home page (Supplementary Figure 1). Users can click on Data Summary to obtain an overview of all data organized hierarchically by disease, dataset and data type (Figure 1A). Gene Search allows users to enter a gene name such as HUGO gene symbol and retrieve a Gene report that includes a summary of the frequency with which the gene is affected by genomic alterations across different datasets and cancers (Supplementary Figure 2, Supplementary

3 Table 3). For a specific alteration type such as somatic mutation or copy number gain, the alteration frequency is calculated as the proportion of tumors in a cohort that harbor at least one alteration affecting the gene of interest. Database Search provides users with an easy-to-use interface to build custom queries against the entire database (Figure 1B, Supplementary Figure 3A). Users can specify various query criteria based on sample annotations, cancer alterations or results of pre-computed analyses such as differential expression in tumor relative to normal tissues. The query result, called an Alteration report, is returned in pre-defined or user-specified tabular format and exportable in Microsoft Excel-compatible format (Supplementary Figure 3B, Supplementary Table 4). Programmatic access is also available through REST and SOAP services. The Analysis section provides users with two analytical tools, Pan-cancer report and OASIS-print, to explore complex patterns of alterations affecting multiple genes. Plots including Bar, Box, Scatter and Volcano plots facilitate exploratory analysis of the data at the gene and sample level (Figures 1D-G). All plots have a common layout (Supplementary Figure 4) and functionalities such as zooming, in-plot search that can identify a gene or sample by name and multi-select that enables ad hoc selection of genes or samples by drawing a rectangle around plot elements. Users can mouse over plot elements such as samples to obtain a pop-up window containing detailed information and links to a Sample report, a summary of all the alterations identified for that sample, among other visualizations. Further, samples may be multi-selected to obtain sample names, alteration details and estimated prevalence as a percentage of the cohort. While several web portals have been developed to facilitate analyses of publicly available cancer genomic datasets, OASIS provides a unique resource combining the sheer scale of its data collection with unique datasets, analysis results such as differential expression and druggability score and novel analytical tools such as Pan-cancer report and Volcano plot (Supplementary Tables 5-6). OASIS enables analyses on one of the largest collections of multi-omics datasets on the web, with genomic-scale profiles on 26,169 samples (12,108 primary tumors, 13,007 normal samples and 1,054 cell lines), including RNA-Seq profiles on 12,056 samples. As a point of reference, the well-established cbioportal 31 has compiled data on 21,401 samples as of August, 2015, including RNA-Seq data on 10,459 samples. OASIS is also the first cancer genomics portal that hosts RNA-Seq data from CCLE 5 and GTEx 33 to our knowledge. A major strength of OASIS is the integrative analysis of RNA-Seq derived gene expression data with genomic alterations and across multiple studies. Features such as the Box plot were designed to integrate RNA-Seq expression data from multiple sources for comparison analyses of primary tumors with cell lines or comparing tumors with normal tissue. OASIS also focused on solving problems that arise in the drug discovery process. For example, based on open source code from cbioportal, OASIS-print was significantly modified to enable selection of cell line or xenograft models based on multi-gene expression and genomic characteristics, a common use case in oncology drug discovery. Pan-cancer Report The Pan-cancer report provides a graphical summary of alteration patterns affecting a list of genes and identifies genes frequently altered in one or multiple cancers (Figure 1C, Supplementary Figure 5). This visualization resembles a heatmap where rows represent individual genes and columns consist of two sections. The Gene Info section provides gene-level annotation including gene symbol, druggability score (Supplementary Methods), oncogene/tumor suppressor status as defined in Cancer Gene Census 34, and protein classification (secretome, immunome surfaceome). All other columns represent alteration frequencies grouped by primary tissues and datasets. The Summary columns report

4 frequencies for individual alteration types summarized across all samples. Columns are color-coded to represent different alteration types. Color gradients within columns represent alteration frequencies with higher intensities corresponding to higher frequencies. Users can toggle between the default color-only heatmap view and the detailed view that display the numerical values. Any heatmap cell can be clicked to retrieve the Alteration report. This tool could identify drug targets frequently altered in a disease or altered at low prevalence but across multiple cancers. For instance, we used the Pan-cancer report to reveal that MET, a gene from the c-met signaling pathway implicated in a variety of human malignancies 35, harbored high-level copy number alterations in ~2% of gastric cancer cases (Supplementary Figure 5). Copy Number Analyses using Bar Plot and Scatter Plot MET amplification is known to confer sensitivity to tyrosine kinase inhibitors in subsets of cancer patients 36. A Bar plot analysis of TCGA datasets confirmed that high-levels of MET amplification occur in multiple cancers (Supplementary Figure 6). A Scatter plot analysis showed that MET amplification induced over-expression in a subset of gastric cancer and lung adenocarcinoma cases with Pearson correlation coefficients of 0.83 and 0.65 respectively (Supplementary Figure 7A-C). Scatter plot analysis of the CCLE dataset suggested that gastric cancer cell lines such as SNU5 harbor both MET amplification and over-expression, and therefore could be used to experimentally test the effects of tyrosine kinase inhibition (Supplementary Figure 7D). Expression Analyses with Box Plot Box plots and Bar plots can be used to visualize gene expression values calculated from either microarray or RNA-Seq data. Gene expression values derived from RNA-Seq are comparable across multiple datasets as normalization of RNA-Seq data is largely dataset independent. By default, Box plots render a side-by-side comparison of gene expression values in tumor vs. normal samples. For tumor datasets without normal samples such as CCLE, users can integrate normal gene expression data into the analysis by selecting the GTEx dataset during the plot configuration step. The Box plot enables us to estimate that MET is over-expressed in ~1% of TCGA gastric cancer samples (Supplementary Figure 8A). To identify cell lines that overexpress MET, we integrated CCLE and GTEx expression data in the same plot and sorted samples by tissue type to obtain side-by-side comparisons of tumor vs. normal samples. Further zooming in showed that MET was expressed at significantly higher levels in gastric cancer cell lines than in normal gastric tissues (Supplementary Figure 8B-C). Cancer cell lines may harbor multiple driver mutations and therefore complicate experimental design and interpretation. To enable model selection based on multi-gene profiles, we have developed an analytical tool called OASIS-print based on the open source version of Oncoprint in cbioportal 31. By combining graphical and tabular visualizations, OASIS-print allows us to visually identify four gastric cancer cell lines with MET amplification and wild-type EGFR and KRAS (Supplementary Figure 9). Differential Expression Analysis with Volcano Plot The Volcano plot provides an interactive visualization of differential expression between tumor and normal samples for a given dataset (Supplementary Methods). The query interface allows users to examine all genes in a dataset or only genes from user-specified oncogenic pathways or gene signatures defined by MSigDB 37. We can search for a particular gene by name and multi-select genes of interest within the Volcano plot. To learn more about a gene of interest, we can use the pop-up menu to retrieve the Gene report or to

5 perform additional analyses such as visualizing the gene expression across individual samples using Box plot or Bar plot. As shown in the Volcano plot, MET was over-expressed in tumor vs. normal tissues in the TCGA Thyroid cancer cohort (Supplementary Figure 10A). ADC Target Analysis Antibody drug conjugates (ADC) such as the anti-her2 Kadcyla is a powerful class of anticancer drugs that target cell surface proteins over-expressed in tumors vs. normal tissues. To enable ADC target gene analysis, OASIS has aggregated RNA-Seq data for 7,347 primary tumors (TCGA), 781 cancer cell lines (CCLE) and 3,231 normal tissues (GTEx). The Volcano plot can be used to identify all transmembrane genes over-expressed in TCGA breast cancer vs. normal tissues including ERBB2 (Supplementary Figure 10B). Box plot analysis showed that ERBB2 was over-expressed in tumor vs. normal in breast, gastric and non-small cell lung cancers (Supplementary Figure 11A). Integrative analysis of TCGA and GTEx expression data confirmed that ERBB2 expression was higher in tumors than all the normal tissues. Moreover, we confirmed that ERBB2 expression was higher in TCGA ovarian tumors than in normal tissues even though ovarian normal tissue data was not available in TCGA (Supplementary Figure 11B). Scatter plot analysis indicated that ERBB2 over-expression was induced by copy number amplifications (Supplementary Figure 11C). Finally, we used OASISprint to identify cancer cell lines that are ERBB2 amplified but ER-negative with wild-type PIK3CA (Supplementary Figure 11D). SUPPLEMENTARY REFERENCES 1. Li, B. & Dewey, C.N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). 2. McLaren, W. et al. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, (2010). 3. Sherry, S.T. et al. dbsnp: the NCBI database of genetic variation. Nucleic Acids Res 29, (2001). 4. Forbes, S.A. et al. COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic Acids Res (2014). 5. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, (2012). 6. Consortium, G.T. The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, (2013). 7. Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, (2012). 8. Kan, Z. et al. Whole-genome sequencing identifies recurrent mutations in hepatocellular carcinoma. Genome Res 23, (2013). 9. Wang, K. et al. Genomic landscape of copy number aberrations enables the identification of oncogenic drivers in hepatocellular carcinoma. Hepatology 58, (2013). 10. Wang, K. et al. Whole-genome sequencing and comprehensive molecular profiling identify new driver mutations in gastric cancer. Nat Genet 46, (2014). 11. Wilks, C. et al. The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data. Database (Oxford) 2014(2014). 12. da Cunha, J.P. et al. Bioinformatics construction of the human cell surfaceome. Proc Natl Acad Sci U S A 106, (2009). 13. Brown, K.J. et al. The human secretome atlas initiative: implications in health and disease conditions. Biochim Biophys Acta 1834, (2013).

6 14. Almen, M.S., Nordstrom, K.J., Fredriksson, R. & Schioth, H.B. Mapping the human membrane proteome: a majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol 7, 50 (2009). 15. Ortutay, C., Siermala, M. & Vihinen, M. Molecular characterization of the immune system: emergence of proteins, processes, and domains. Immunogenetics 59, (2007). 16. Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nat Biotechnol 28, (2010). 17. Mermel, C.H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 12, R41 (2011). 18. Flicek, P. et al. Ensembl Nucleic Acids Res 42, D (2014). 19. Kumar, P., Henikoff, S. & Ng, P.C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 4, (2009). 20. Finn, R.D. et al. Pfam: the protein families database. Nucleic Acids Res 42, D (2014). 21. Smyth, G.K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3, Article3 (2004). 22. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edger: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, (2010). 23. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B(1995). 24. Hu, J. Cancer outlier detection based on likelihood ratio test. Bioinformatics 24, (2008). 25. Hopkins, A.L. & Groom, C.R. The druggable genome. Nat Rev Drug Discov 1, (2002). 26. Kasprzyk, A. BioMart: driving a paradigm change in biological data management. Database (Oxford) 2011, bar049 (2011). 27. Zhang, J. et al. International Cancer Genome Consortium Data Portal--a one-stop shop for cancer genomics data. Database (Oxford) 2011, bar026 (2011). 28. Zhu, J. et al. The UCSC Cancer Genomics Browser. Nat Methods 6, (2009). 29. Gundem, G. et al. IntOGen: integration and data mining of multidimensional oncogenomic data. Nat Methods 7, 92-3 (2010). 30. Ching, K.A. et al. Cell Index Database (CELLX): a web tool for cancer precision medicine. Pac Symp Biocomput, 10-9 (2015). 31. Cerami, E. et al. The cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2, (2012). 32. Leiserson, M.D. et al. MAGI: visualization and collaborative annotation of genomic aberrations. Nat Methods 12, (2015). 33. Consortium, G.T. Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, (2015). 34. Futreal, P.A. et al. A census of human cancer genes. Nat Rev Cancer 4, (2004). 35. Organ, S.L. & Tsao, M.S. An overview of the c-met signaling pathway. Ther Adv Med Oncol 3, S7-S19 (2011). 36. Smolen, G.A. et al. Amplification of MET may identify a subset of cancers with extreme sensitivity to the selective tyrosine kinase inhibitor PHA Proc Natl Acad Sci U S A 103, (2006).

7 37. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, (2005).

8 Supplementary Table 1: Datasets in OASIS 1.0. Alteration type Mutation Copy Number Alterations Expression SOURCE PROJECT NAME TISSUE TYPE Copy Number Copy Number Peak MicroArray Expression RNA-Seq Expression Differential Expression Outlier Expression PFIZER-HKU Gastric cancer Stomach PFIZER-METABRIC Breast cancer Breast PFIZER- ACRG Hepatocellular carcinoma Liver PFIZER-SAMSUNG Hepatocellular carcinoma Liver TCGA Bladder urothelial carcinoma Bladder Breast invasive carcinoma Breast Cervical squamous cell carcinoma and endocervical adenocarcinoma Cervical Colon adenocarcinoma Colon Lymphoid neoplasm diffuse large B-cell lymphoma Diffuse large b-cell Lymphoma Glioblastoma multiforme glioblastoma multiforme Head and Neck squamous cell carcinoma Head and Neck Kidney chromophobe Kidney Kidney renal clear cell carcinoma Kidney Kidney renal papillary cell carcinoma Kidney Acute myeloid leukemia Acute Myeloid Leukemia Brain lower grade glioma Brain lower grade glioma Liver hepatocellular carcinoma Liver Lung adenocarcinoma Lung Lung squamous cell carcinoma Lung Ovarian serous cysta Ovary Pancreatic adenocarcinoma Pancreas Prostate adenocarcinoma Prostate Rectum adenocarcinoma Rectum Sarcoma Sarcoma Skin cutaneous melanoma Skin Stomach adenocarcinoma Stomach Thyroid carcinoma Thyroid Uterine corpus endometroid carcinoma Uterine BROAD Cancer Cell Line Enciclopedia (CCLE) * Cell lines BROAD Genotype Tissue Expression (GTEx) ** Normal Tissue * Mutation calls in CCLE are only made for a targeted list of Genes. ** GTEx data correspond to normal samples only

SUPPLEMENTARY PROTOCOL This protocol contains step-by-step instructions for the following use cases: (1) Analysis of CDK4 as a potential drug target; (2) Antibody drug conjugate (ADC) target

9 SUPPLEMENTARY PROTOCOL This protocol contains step-by-step instructions for the following use cases: (1) Analysis of CDK4 as a potential drug target; (2) Antibody drug conjugate (ADC) target analysis; (3) Selecting cell lines based on multi-gene omics data; (4) Downloading OASIS-Genomics source code; (5) Programmatic Access API. Use Case 1: Analysis of CDK4 as a potential drug target (1) Pan-cancer report This is a step-by-step guide on how to use OASIS web portal to evaluate CDK4 as a drug target gene based on integrative analysis of cancer omics data. In the first step, we use Pan-cancer report to retrieve the prevalence of alterations affecting a list of genes involved in cell cycle regulation. To access the Pancancer report feature, click on the Analysis option on the navigation menu, and then click on the Pan Cancer Report option in the drop-down menu to open the query interface to the Pan-cancer report. Three input options are available from the interface. The Cancer Types menu provides the list of datasets available for selection grouped by tissue of origin. If no cancer type is selected all of them will be used in the query. To select multiple cancer types, do Ctrl+click to select each cancer type. The Genes text box allows users to type in comma separated gene list. The Oncogenic Pathways menu provides a list of pre-defined pathways implicated in cancer. A user can either specify a custom list of genes or choose an oncogenic pathway or a combination of both. You can select Breast, Colon, Gastric, Glioblastoma, Head and Neck, Liver, Lung, Melanoma, Ovarian, Prostate, Rectal, Renal, Uterine, Models and Cell lines from the Cancer Types menu (A). In the Genes text box (B), type CCND1, CDK4, CDK6, CDKN2A, E2F1, RB1 and click on Search (C). 1

The Pan-cancer report returns a table where each row represents a gene and each column represents a single cancer type (top row) and dataset (second row).

10 The Pan-cancer report returns a table where each row represents a gene and each column represents a single cancer type (top row) and dataset (second row). Alteration types are color coded, with green representing mutations (substitutions and indels), red representing copy number gains and blue representing copy losses. Rows can be sorted by alteration frequency value in each table cell. Click on the red column header for Glioblastoma (TCGA) to sort the rows by the frequency of copy number alterations for the selected dataset (A). Mouse over a cell in the row corresponding to CDK4 to obtain information on the number and percentage of samples amplified in that cohort (B) 13.34% of the TCGA Glioblastoma cohort. From the Gene Info, we can also see that CDK4 is a known oncogene and may be targeted by small molecule drugs. 2

To access the Bar plot, click on Plots in the Navigation menu and then click on Bar plot (Copy number across samples).

11 (2) Copy number Bar plot The Bar plot can display the copy number values (log2 ratio) for one gene across multiple samples from different datasets. In this step, we use the Bar plot to visually assess copy number amplifications affecting CDK4 in the TCGA Glioblastoma cohort. To access the Bar plot, click on Plots in the Navigation menu and then click on Bar plot (Copy number across samples). In the datasets list select Glioblastoma (TCGA) (A). Once a dataset is selected type CDK4 into the gene symbol text box (B) in the Restrict Search section. Click on GO (C) to create the plot. 3

12 In the plot, each bar corresponds to a single sample colored by copy number status. Mouse over an individual bar to bring a pop-up box containing information about the selected sample. To select multiple samples click on Mode on the top left to switch from the default Zoom mode to the Select mode. 4

13 With Mode: Select enabled, highlight a group of samples by clicking on the plot and drag the mouse pointer to create a rectangle enclosing the samples of interest. In this case we want to select all the samples that harbor CDK4 amplifications (dark red bars). A pop-up box will appear with information about the selected samples including sample name, cancer and tissue type and copy number log2 ratio values. This box will also display the percentage of samples being selected. Here it confirms that ~13% of the Glioblastoma (TCGA) dataset harbor amplifications of CDK4. Selected data can be exported as a excel file by clicking on the button Export to Excel. Alternatively the sample names can be copied onto the clipboard by using the Copy sample names button. 5

14 (3) Scatter plot The Scatter plot may be used to assess whether CDK4 amplification correlates with higher gene expression. To access the Scatter plot click on Plots in the navigation menu and then click on Scatter plot (Copy number vs. gene expression). 6

15 From the Select Datasets section, select Glioblastoma (TCGA) (A) and type CDK4 into the gene symbol text box (B) in the Restrict Search section. From the Expression data type drop-down menu select RNASeq (RSEM) (C). Only RNA-Seq (RSEM) data should be used to combine gene expression data across multiple datasets. Click on GO (D) to create the plot. 7

16 The Scatter plot shows the copy number values (log2 ratio) on the X-axis and the expression values on the Y-axis. For RNA-Seq (RSEM) data, gene expression is in the unit of TPM (Transcripts-Per-Million). In the plot each data point represents a single sample with the straight line representing the linear fit. As shown in the legend, the Pearson correlation coefficient between copy number and expression of CDK4 is 0.78 in the Glioblastoma (TCGA) dataset, suggesting that amplification induces CDK4 over-expression. Mouse over a data point to display a pop-up box containing information about the highlighted sample. Links to the Sample report and other visualizations can be accessed from the pop-up box. 8

17 (4) Gene expression Box plot The Box plot in OASIS can be used to visualize gene-level expression based on either Microarray or RNA- Seq experiments for individual samples grouped by cancer type and status (tumor/normal). This visualization tool allows us to compare CDK4 expression between tumor and normal samples and across cancer types. To access Box plot, one can click on Plots and select Box plot (RNA-Seq Expression across cancer types). 9

18 Select Glioblastoma (TCGA) from the Select Datasets section (A). As the Glioblastoma (TCGA) dataset does not have normal samples available, you can integrate with a normal reference dataset from the GTEx project. Do Ctrl+click on Normal tissue Expression (GTEx) (B) to include this dataset in the analysis. Add CDK4 in the gene symbol text box (C) and click the GO button (D). 10

19 A new Box plot will appear with the combined data for the selected datasets. Click on Sort by name (A) in the options at the top of the plot. This action will sort the X-axis categories alphabetically and place TCGA Glioblastoma samples side-by-side with the GTEx normal brain samples. Click on Switch to Log(2) Scale (B) option to toggle the Y-axis values from linear to log2 scale. 11

20 With the default Mode: Zoom setting shown in the menu options, draw a rectangle around the brain samples. 12

21 The plot will zoom in to show only samples for the cancer types selected. Simply click on Reset zoom at the top right of the plot to return to the default view. This analysis demonstrates that CDK4 is indeed overexpressed in Glioblastoma tumors compared to normal brain tissues. 13

(5) Volcano plot The Volcano plot provides a way to visualize differential expression of all genes within a dataset and identify those genes that are up-regulated (red) or down-regulated (blue) in

22 (5) Volcano plot The Volcano plot provides a way to visualize differential expression of all genes within a dataset and identify those genes that are up-regulated (red) or down-regulated (blue) in tumor vs. normal tissues. To go to the Volcano plot click on Plots and select Volcano plot (Differential Expression across genes in a project). In the Select Datasets section select Lung Squamous Cell Carcinoma (TCGA) (A). The Volcano plot only supports the analysis of a single dataset at a time. In the Restrict Search section one can select a subset of genes that belong to a specific pathway or gene signature, genes that are known cancer genes (as defined by the Cancer Gene Census) or genes that belong to the Surfaceome, Secretome or Immunome categories. Select genes that are known Oncogenes or Tumor suppressors from the Cancer gene status filter (B) and click the Go button (C). 14

23 In the Volcano plot one can search for a gene of interest using the Gene Search function from the options menu at the top of the plot. 15

24 Click on Gene Search. In the pop-up search box type CDK4 and click on Search. If the gene exists in the selected gene set, it will be highlighted in the plot with a pop-up box showing the gene name and a change of color in the corresponding data point. 16

The Volcano plot analysis indicates that CDK4 is differentially expressed in Lung Squamous Cell Carcinoma compared to adjacent normal tissues.

25 The Volcano plot analysis indicates that CDK4 is differentially expressed in Lung Squamous Cell Carcinoma compared to adjacent normal tissues. (6) OASIS-print OASIS-print enables users to visualize multiple types of alterations affecting a list of genes across all the samples in a single dataset. Here OASIS-print can be used to identify cell line models with a specific profile - CDK4 amplifications and wild-type RB1. To access OASIS-print, click on Analysis in the navigation menu and select OASIS-print. 17

26 In the Select Datasets section click on Cell Lines (CCLE) (A). In the Entries with following IDs text box type CDK4, CDK6, CCND1, RB1 (B) and click on the Go button (C). The OASIS-print results page is divided into two sections. The upper section provides a graphical view of the data where each row represents a gene and each column of dots represents a single sample. Alterations are color-coded and color gradients are used to represent the alteration values with darker colors corresponding to higher values and lighter colors corresponding to lower values. The lower section provides the query results in a tabular view where each row represents a sample and each column represent a gene and alteration type. Values in each table cell are also color coded to highlight altered samples. Both the graphic view and the table view are fully interactive. 18

Within the table view samples can be filtered by using the search box (A) to identify a particular sample or to show only those samples that harbor a particular mutation.

27 Within the table view samples can be filtered by using the search box (A) to identify a particular sample or to show only those samples that harbor a particular mutation. Samples in the table view can also be sorted by clicking on the small arrow to the left of each column name. The sorting on the table is also applied to the graphical view above. The sample names column (B) also provide links to the Sample report where more information about the sample of interest can be obtained. This information is also accessible from the graphical view by doing a mouse over a sample and following the link in the pop-up box. Mouse over a sample name in the table will also highlight the same sample in the graphic view. 19

28 Using the OASIS-print feature it was possible to identify the Lung adenocarcinoma cell line RERFLCAD2 which harbors CDK4 amplification for target validation and pharmacological studies. OASIS-print has a unique feature allowing users to rearrange the graphic view by sorting the molecular data in the table view. By default OASIS-print shows cell lines sorted by the copy number value of the first gene in the gene list. To sort cell lines by the expression value of CDK4, click on the arrows at the right of the RNAseq expr header for CDK4 (A) in the table view. Cell line HCC827 is identified as the Lung adenocarcinoma cell line with higher CDK4 expression. 20

29 21

Use case 2: Antibody drug conjugate (ADC) target analysis (1) Volcano plot This is a step-by-step guide on how to use the OASIS web portal to identify and evaluate ERBB2 as a target gene for the ADC

30 Use case 2: Antibody drug conjugate (ADC) target analysis (1) Volcano plot This is a step-by-step guide on how to use the OASIS web portal to identify and evaluate ERBB2 as a target gene for the ADC modality. First, we use the Volcano plot to identify significantly up-regulated transmembrane genes in breast cancer tumors compared to normal tissues. Click on Plots to access the Volcano plot. In the drop-down menu click on the Volcano plot (Differential Expression across genes in a project) option to open the query interface to the Volcano plot. Click on Breast Invasive Carcinoma (TCGA) (A). In the Restrict Search section one can choose to show only genes that belong to a specific pathway or gene signature, genes that are known cancer genes (defined by Cancer Gene Census) or genes that belong to the Surfaceome, Secretome or Immunome categories. To show only transmembrane genes select YES from the Is Surfaceome filter (B) and click the Go button (C). 22

In the Volcano plot one can search for a gene of interest by using the Gene Search function from the options menu at the top of the plot. Click on Gene Search (A).

31 In the Volcano plot one can search for a gene of interest by using the Gene Search function from the options menu at the top of the plot. Click on Gene Search (A). In the pop-up search box type ERBB2 (B) and click on Search (C). ERBB2 it will be highlighted in the plot with a pop-up box showing the name of the gene and a change of color in the corresponding data point (D). 23

32 You can mouse over the data point representing ERBB2 to obtain a pop-up box with more information. From the pop-up box one can access the Gene report (A), a summary of alterations affecting the gene across all datasets. Links to the copy number Bar plot and to the expression box plots and Bar plots are also available (B). These analyses will provide information at the sample level. A link to the differential expression result data in tabular format is also available (C). 24

(2) Gene expression Box plot Follow the Expression (RNASeq) across samples (boxplot) link from the Volcano plot or click in Plots and Box plot (RNA-Seq Expression across cancer types) in the

33 (2) Gene expression Box plot Follow the Expression (RNASeq) across samples (boxplot) link from the Volcano plot or click in Plots and Box plot (RNA-Seq Expression across cancer types) in the navigation menu to access the Box plot. The Box plot can be used to visualize gene-level expression based on either Microarray or RNA-Seq experiments for individual samples grouped by cancer type and status (tumor/normal). This tool allows us to compare ERBB2 expression between tumor and normal samples and across cancer types. 25

34 Select Breast Invasive Carcinoma (TCGA) from the Select Datasets section (A). Add additional datasets by Ctrl+click on Gastric Adenocarcinoma (TCGA), Lung Adenocarcinoma (TCGA) and Lung Squamous Cell Carcinoma (TCGA). Type ERBB2 into the gene symbol text box (C) and click the Go button (D). A new Box plot will appear with the data for the selected dataset split in tumor (red) and normal (green) samples. Click on Sort by Name (A) option menu at the top of the plot to sort the datasets in alphabetic order. Click on Switch to Log(2) Scale (B) option on the menu at the top of the plot to change the values on the Y-axis from linear to log2 scale. 26

35 The Box plot analysis suggests that ERBB2 is significantly up-regulated in tumor vs. normal in Breast Invasive Carcinoma, Gastric Adenocarcinoma and Lung Adenocarcinoma. Click the blue Back button (A) on the top right of the plot to go back to the Box plot menu and add new datasets. 27

36 Ctrl+click Ovarian Serous Cystadenocarcinoma (TCGA) (A). As there are no normal samples available in this dataset, you can integrate tumor data with the normal reference data from GTEx. Do Ctrl+click on Normal Tissue expression (GTEx) (A). Type ERBB2 into the gene symbol text box (B) and click the Go button (C). 28

37 Click on Sort by Name option menu at the top of the plot to sort the datasets in alphabetic order. The Ovarian tumor samples will be displayed side-by-side with the Ovary tissue samples from GTEx. With the Mode: Zoom option selected (A), draw a rectangle around the ovary samples to zoom into this part of the plot. 29

38 Click on Switch to Log(2) Scale (A) option on the menu at the top of the plot to change the values on the Y-axis from linear to log2 scale. The integrative analysis of Ovarian Serous Cystadenocarcinoma (TCGA) and normal Ovary (GTEx) data demonstrates that ERBB2 is up-regulated in Ovarian tumors than in normal tissue. (3) Scatter plot 30

39 The Scatter plot enables us to visualize the correlation between ERBB2 copy number and expression at the sample level. Using the scatter plot one can identify a subset of cases where ERBB2 over-expression is driven by amplification. To go to the Scatter plots click on Plots in the navigation menu and then click on Scatter plot (Copy number vs. gene expression). From the Select Datasets section click on Breast Invasive Carcinoma (TCGA) (A). Add additional datasets by Ctrl+click on Gastric Adenocarcinoma (TCGA), Lung Adenocarcinoma (TCGA), Lung Squamous Cell Carcinoma (TCGA) and Ovarian Serous Cystadenocarcinoma (TCGA). Once the datasets have been selected type ERBB2 into the gene symbol text box (B) and select RNASeq (RSEM) from the Expression data type drop-down menu (C). Click on Go (D) to produce the plot. 31

40 The Scatter plot shows the copy number values (log2 ratio) on the X-axis and the expression values on the Y-axis. For RNA-Seq (RSEM) data, gene expression is in the unit of TPM (Transcripts-Per-Million). In the plot each data point represents a single sample with the straight line representing the linear fit. As shown in the legend, the Pearson correlation coefficient between copy number and expression of ERBB2 is 0.86 in the Breast Invasive Carcinoma (TCGA) dataset, suggesting that amplification induces CDK4 over-expression. 32

(4) OASIS-print OASIS-print enables users to visualize multiple types of alterations affecting a list of genes across all the samples in a single dataset.

41 (4) OASIS-print OASIS-print enables users to visualize multiple types of alterations affecting a list of genes across all the samples in a single dataset. Here OASIS-print can be used to identify breast cancer cell lines that are HER2 positive and ER negative with wild-type PIK3CA. To access OASIS-print, click on Analysis in the navigation menu and select OASIS-print. In the Select Datasets section click on Cell Lines (CCLE) (A). In the Entries with following IDs text box type ERBB2, ESR1, PIK3CA (B). Filter the cell lines by cancer type by selecting breast in the Cancer type filter (C) and click on the Go button (D). 33

42 The OASIS-print results page is divided into two sections. The upper section provides a graphical summary of the data where the each column of dots represents a single sample and each row represents a gene. Alterations are color-coded and color gradients are used to represent the alteration values with darker colors corresponding to higher values and lighter colors corresponding to lower values. The lower section provides the query results in a tabular format where each row represents a sample and each column represent a gene and alteration type. Values in each table cell are also color coded to highlight altered samples. Both the graphic display and the results table are fully interactive. 34

43 From this analysis we can identify breast cancer cell lines that are HER2 positive and ER negative and wild-type for PIK3CA such as AU565, HCC1569 and SKBR3. 35

44 Do a mouse-over on a single sample to obtain a pop-up box with description about cell line name, copy number and expression values. Click on the cell line name to open a Sample report containing more information on the cell line of interest. 36

45 Click on the Mut. column for PIK3CA to sort cell lines in the table and graphical views by the amino acid mutation value. We can identify two cell lines (EFM192A and JIMT1) harboring the PIK3CA C420R mutation and also ERBB2 amplification and over-expression. 37

46 38

Use Case 3: Selecting cell lines based on multi-gene omics data This analysis vignette illustrates how to use the OASIS Treemap feature to access annotated mutations in Lung Adenocarcinoma Cell lines

47 Use Case 3: Selecting cell lines based on multi-gene omics data This analysis vignette illustrates how to use the OASIS Treemap feature to access annotated mutations in Lung Adenocarcinoma Cell lines from CCLE. The Data summary consists of two treemaps, the upper one showing primary tumor data classified by cancer types and the lower one showing cell line data classified by projects as the first level of the hierarchy. First, click on the Cancer Cell Lines (CCLE) dataset in the lower Treemap (A) to enter the next level - cancer type. Here CCLE cell line data is classified and color coded by different cancer types. Click on Lung LUAD (A) to obtain information about available data types in lung adenocarcinoma cell lines. 39

You can see there are 65 unique Lung adenocarcinoma cell lines available in CCLE, with mutation data reported for 55 cell lines, expression (microarray) data for 62

To access the mutation alteration report, click on the box that says Mutation: 55 (A).

48 You can see there are 65 unique Lung adenocarcinoma cell lines available in CCLE, with mutation data reported for 55 cell lines, expression (microarray) data for 62 cell lines, CNV data for 58 cell lines and RNA-Seq expression data for 53 cell lines. To access the mutation alteration report, click on the box that says Mutation: 55 (A). The mutation alteration report provides a table where each row represents a single mutation event with information about sample name, gene name, amino acid change, cancer gene status classification (from the Cancer Gene Census), mutation type and mutation consequence. Click on the Cancer gene status field (A) to re-order rows. 40

49 Mutations affecting oncogenes are now shown at the top of the list. Due to the large data size and performance considerations, OASIS caps the size of the Alteration report table at 1,000 rows. To obtain the full list of mutations, click on the Download data button (A). 41

As illustrated by the analysis vignettes, the following analyses and functionalities are only available in OASIS: - Use case 1, Step (1) Pan-cancer report: In a single interface, obtain a panoramic

50 As illustrated by the analysis vignettes, the following analyses and functionalities are only available in OASIS: - Use case 1, Step (1) Pan-cancer report: In a single interface, obtain a panoramic view of multiple genetic alterations affecting multiple genes, then drill into detailed alteration data for one of the genes and learn about important gene characteristics such as small molecule druggability and oncogene status. - Use case 1, Step (2) Copy number Bar plot: After the first analysis creates the plot, users can flexibly select a subset of data points including samples or genes to obtain detailed data and summary statistics. With another click, users can launch additional analyses on any of the selected data point. - Use case 1, Step (5) Volcano plot: Identifying CDK4 as one of the differentially expressed cancer genes in tumor vs. normal tissues in Lung Squamous Cell Carcinoma (TCGA). - Use case 1, Step (6) OASIS-print: Identifying cell lines with highest levels of CDK4 amplification by sorting the cell lines by copy number values while reviewing the genetic alteration statuses of 42

other genes such as RB1. Graphic view is rearranged based on sorting of molecular data such as mutation status or expression value in the table view.

51 other genes such as RB1. Graphic view is rearranged based on sorting of molecular data such as mutation status or expression value in the table view. - Use case 2, Step (2) Gene expression Box plot: Comparing gene expression (e.g. CDK4, ERBB2) derived from RNA-Seq data in tumors vs. adjacent normal samples across multiple cancer types using TCGA data. Integrate expression data from TCGA tumors and GTEx normal tissues when TCGA does not have normal samples for a cancer type such as Glioblastoma or Ovarian cancer. - Use case 2, Step (3) Scatter plot: In the same plot, users can select one or multiple cancer types and visualize correlation patterns by clicking on corresponding categories in the legend. - Use case 3, Treemap: Access and browse any omics data or sample annotation in OASIS using three or fewer clicks from the Home page. Use Case 4: Downloading OASIS-Genomics source code The source code for the OASIS-Genomics web application is available as a git repository from Sourceforge.net. The code is distributed in 3 different folders: BioMart: Contains the code corresponding to the webserver and the main components of the front end. This is the customized code from BioMart (Instructions on how to configure the server and running it can be found at the BioMart website ( Database: SQL code and schema design necessary to implement the OASIS-genomics database/warehouse back end. The current version of the code is designed to run on Oracle 11. Oasiswidgets: Code for some of the custom built functionalities in the OASIS-genomics web portal, including code for OASIS-print, Pan-cancer report, and mutation summary. To access the code repository click on the following link or copy and paste it to your favorite web browser: From the main page, click on the Browse Code button: 43

Once the protocol has been selected, copy the text in the access text box on the top right.

52 Once on the Git repository page, select the type of download/protocol you want to use from the top left options (e.g HTTP). Once the protocol has been selected, copy the text in the access text box on the top right. If working with windows, you can paste the selected code to your favorite git tool. Shown in the example is TortoiseGit ( 44

53 From linux/unix you can copy the text onto the terminal and download the code into the selected folder: Use Case 5: Programmatic Access API OASIS provides direct programmatic access to all the data stored within the database through the builtin BioMart API. The BioMart API Consists of four parts: REST API, SOAP API, SPARQL API and Java API. All four APIs have access to the same methods, so you can choose the one you are most comfortable with. More information on the BioMart API and how to use it can be found in the BioMart documentation here: (1) Use of Web service to retrieve the data used to build the following visualization: ene_ensembl_oasis18brcatcga%2chsapiens_gene_ensembl_oasis20coadtcga%2chsapiens_gene_ensem bl_oasis40stadtcga%2chsapiens_gene_ensembl_oasis30lihctcga%2chsapiens_gene_ensembl_oasis31lu adtcga%2chsapiens_gene_ensembl_oasis32lusctcga%2chsapiens_gene_ensembl_oasis33ovtcga&hsapi ens_gene_ensembl int_cnv_exp dm gene_symbol=met&hsapiens_gene_ensembl int_cnv_exp dm expression_type=rsem&preview=true 45

54 (2) REST Access example: Copy and paste the following code in a text file: <!DOCTYPE Query><Query client="webbrowser" processor="tsv" limit="-1" header="1"><dataset name="hsapiens_gene_ensembl_oasis18brcatcga,hsapiens_gene_ensembl_oasis20coadtcga,hsapiens_g ene_ensembl_oasis40stadtcga,hsapiens_gene_ensembl_oasis30lihctcga,hsapiens_gene_ensembl_oasis 31luadtcga,hsapiens_gene_ensembl_oasis32lusctcga,hsapiens_gene_ensembl_oasis33ovtcga" config="gene_ensembl_config_3_1_1"><filter name="hsapiens_gene_ensembl int_cnv_exp dm gene_symbol" value="met"/><filter name="hsapiens_gene_ensembl int_cnv_exp dm expression_type" value="rsem"/><attribute name="hsapiens_gene_ensembl int_cnv_exp dm gene_symbol"/><attribute name="hsapiens_gene_ensembl int_cnv_exp dm log_ratio"/><attribute name="hsapiens_gene_ensembl int_cnv_exp dm normalized_expression_level"/><attribute name="hsapiens_gene_ensembl int_cnv_exp dm specimen_id"/><attribute name="hsapiens_gene_ensembl int_cnv_exp dm cancer_type"/><attribute name="hsapiens_gene_ensembl int_cnv_exp dm specimen_origin"/><attribute name="hsapiens_gene_ensembl int_cnv_exp dm exp_outlier_status"/><attribute name="cancertype"/><attribute name="hsapiens_gene_ensembl int_cnv_exp dm expression_type"/><attribute name="hsapiens_gene_ensembl int_cnv_exp dm copy_number"/><attribute name="sampledataset"/></dataset></query> In a terminal window run the following command where example_file.txt is the file name used to save the code above. % curl --data-urlencode query@example_file.xml 46

% curl --data-urlencode query@example_file.

55 Running the code will print the results to the terminal. To save the output data to a file run the command below: % curl --data-urlencode query@example_file.xml > output_file.txt 47

Cancer Informatics Lecture

Cancer Informatics Lecture Mayo-UIUC Computational Genomics Course June 22, 2018 Krishna Rani Kalari Ph.D. Associate Professor 2017 MFMER 3702274-1 Outline The Cancer Genome Atlas (TCGA) Genomic Data Commons