Predicting clinical outcomes in neuroblastoma with genomic data integration

Similar documents
Multi-omics integration for neuroblastoma clinical endpoint prediction

R2: web-based genomics analysis and visualization platform

Augmented Medical Decisions

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Colon cancer subtypes from gene expression data

Models of cell signalling uncover molecular mechanisms of high-risk neuroblastoma and predict outcome

Author's response to reviews

Predicting Kidney Cancer Survival from Genomic Data

Chromosome 1p and 11q Deletions and Outcome in Neuroblastoma

Nature Medicine: doi: /nm.3967

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

Predicting Diabetes and Heart Disease Using Features Resulting from KMeans and GMM Clustering

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Bayesian Prediction Tree Models

Evaluation of public cancer datasets and signatures identifies TP53 mutant signatures with robust prognostic and predictive value

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

The 16th KJC Bioinformatics Symposium Integrative analysis identifies potential DNA methylation biomarkers for pan-cancer diagnosis and prognosis

Automated Estimation of mts Score in Hand Joint X-Ray Image Using Machine Learning

Integrative analysis of survival-associated gene sets in breast cancer

Integration of multiple types of genetic markers for neuroblastoma may contribute to improved prediction of the overall survival

SUPPLEMENTARY INFORMATION

Surgical Outcomes of Patients with Neuroblastoma in a Tertiary Centre in Hong Kong: A 12-year Experience

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

SUPPLEMENTARY APPENDIX

Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

SNPrints: Defining SNP signatures for prediction of onset in complex diseases

RNA-seq: filtering, quality control and visualisation. COMBINE RNA-seq Workshop

L. Ziaei MS*, A. R. Mehri PhD**, M. Salehi PhD***

Multi-Stage Stratified Sampling for the Design of Large Scale Biometric Systems

An Improved Algorithm To Predict Recurrence Of Breast Cancer

High-fidelity Imaging Response Detection in Multiple Sclerosis

Predicting Non-Small Cell Lung Cancer Diagnosis and Prognosis by Fully Automated Microscopic Pathology Image Features

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

Supplemental Table 1 Molecular Profile of the SCLC Cell Line Panel

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

CLASSIFICATION OF BREAST CANCER INTO BENIGN AND MALIGNANT USING SUPPORT VECTOR MACHINES

Expanded View Figures

Automatic Lung Cancer Detection Using Volumetric CT Imaging Features

Assessing Functional Neural Connectivity as an Indicator of Cognitive Performance *

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

CNV PCA Search Tutorial

Yingying Wei George Wu Hongkai Ji

SSM signature genes are highly expressed in residual scar tissues after preoperative radiotherapy of rectal cancer.

Integrated Analysis of Copy Number and Gene Expression

Introduction to Discrimination in Microarray Data Analysis

MR-Radiomics in Neuro-Oncology

Sound Texture Classification Using Statistics from an Auditory Model

Supplemental Information For: The genetics of splicing in neuroblastoma

CoINcIDE: A framework for discovery of patient subtypes across multiple datasets

Cancer Informatics Lecture

The Good News. More storage capacity allows information to be saved Economic and social forces creating more aggregation of data

Feature Vector Denoising with Prior Network Structures. (with Y. Fan, L. Raphael) NESS 2015, University of Connecticut

Utilizing Posterior Probability for Race-composite Age Estimation

Supplementary. properties of. network types. randomly sampled. subsets (75%

Journal: Nature Methods

Applying One-vs-One and One-vs-All Classifiers in k-nearest Neighbour Method and Support Vector Machines to an Otoneurological Multi-Class Problem

Selection of Patient Samples and Genes for Outcome Prediction

Nature Structural & Molecular Biology: doi: /nsmb.2419

Chapter 4 Cellular Oncogenes ~ 4.6 -

ARTICLE RESEARCH. Macmillan Publishers Limited. All rights reserved

Good Old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers q

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

Invasive breast cancer: stratification of histological grade by gene-based assays: a still relevant example from an older data set

EMOTION CLASSIFICATION: HOW DOES AN AUTOMATED SYSTEM COMPARE TO NAÏVE HUMAN CODERS?

Genetic alterations of histone lysine methyltransferases and their significance in breast cancer

Methods for Predicting Type 2 Diabetes

ncounter Assay Automated Process Immobilize and align reporter for image collecting and barcode counting ncounter Prep Station

ECG Beat Recognition using Principal Components Analysis and Artificial Neural Network

Supplementary Online Content

Biomarker adaptive designs in clinical trials

J Clin Oncol 29: by American Society of Clinical Oncology INTRODUCTION

EEG Features in Mental Tasks Recognition and Neurofeedback

EXPression ANalyzer and DisplayER

Nature Genetics: doi: /ng Supplementary Figure 1. HOX fusions enhance self-renewal capacity.

Lecture Outline Biost 517 Applied Biostatistics I

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

The Tail Rank Test. Kevin R. Coombes. July 20, Performing the Tail Rank Test Which genes are significant?... 3

Evidence Based Medicine

Published Ahead of Print on October 3, 2011 as /JCO J Clin Oncol by American Society of Clinical Oncology INTRODUCTION

Data Mining in Bioinformatics Day 4: Text Mining

Big Image-Omics Data Analytics for Clinical Outcome Prediction

Predictive Assays in Radiation Therapy

Comparison of Triple Negative Breast Cancer between Asian and Western Data Sets

A Quick-Start Guide for rseqdiff

I SPY 2 TRIAL. How it Works April 9, 2013

Research Article Breast Cancer Prognosis Risk Estimation Using Integrated Gene Expression and Clinical Data

EECS 433 Statistical Pattern Recognition

Risk-prediction modelling in cancer with multiple genomic data sets: a Bayesian variable selection approach

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

R2 Training Courses. Release The R2 support team

Survival Prediction Models for Estimating the Benefit of Post-Operative Radiation Therapy for Gallbladder Cancer and Lung Cancer

Clustering with Genotype and Phenotype Data to Identify Subtypes of Autism

The Statistical Analysis of Failure Time Data

Supplement to SCnorm: robust normalization of single-cell RNA-seq data

Supplementary Table S1. List of PTPRK-RSPO3 gene fusions in TCGA's colon cancer cohort. Chr. # of Gene 2. Chr. # of Gene 1

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

Transcription:

Predicting clinical outcomes in neuroblastoma with genomic data integration Ilyes Baali, 1 Alp Emre Acar 1, Tunde Aderinwale 2, Saber HafezQorani 3, Hilal Kazan 4 1 Department of Electric-Electronics Engineering, Antalya International University, Antalya, Turkey 2 Institute of Applied Sciences, Antalya International University, Antalya, Turkey 3 Graduate School of Informatics, Department of Health Informatics, Middle East Technical University, Ankara, Turkey 4 Department of Computer Engineering, Antalya International University, Antalya, Turkey These authors have contributed equally to this work. To whom correspondence should be addressed; E-mail: hilal.kazan@antalya.edu.tr Neuroblastoma is a heterogeneous disease with diverse clinical outcomes. Recently collected genome-wide datasets provide opportunities to infer neuroblastoma subtypes more accurately than existing classification of risk groups. To this end, we used machine learning techniques to predict overall survival and event-free survival profiles of patients. Using the model that we trained on SEQC cohort, we can predict patient survival in an independent cohort with high accuracy (AUROC: 0.96) indicating the applicability of the model to different datasets. Additionally, we used unsupervised learning techniques that can effectively integrate multiple high-dimensional datasets to identify subgroups of patients with distinct survival profiles after stratification based on MYCN expression. These subgroups can improve treatment stratification of neuroblastoma patients. Introduction Neuroblastoma is the second most common solid tumor in childhood. The disease can have a large variety of clinical outcomes ranging from spontaneous regression to relentless progression despite extensive therapies. Chromosomal amplification of the MYCN locus occurs in 25 of all neuroblastomas and is associated with poor prognosis (1). Apart from MYCN amplification, a limited set of additional variables such as age at diagnosis, stage of disease etc. are used to stratify patients into distinct risk groups. Recent progress on high-throughput technologies enables the collection of genome-wide measurements across large set of patients in cohorts. We utilized the diverse data types provided by the SEQC cohort (i.e., neuroblastoma challenge in CAMDA 2017) to develop statistical 1

models that can predict clinical outcomes in neuroblastoma. Also, we employed an unsupervised learning strategy to identify subgroups that have significantly diverse survival profiles. Results Validation of the SEQC model on an independent cohort We first performed supervised learning using support vector machines (SVM) within the SEQC dataset. The mean cross-validation accuracy of this model is close to the best accuracy reported for the same dataset (2). We then used our model trained on SEQC data to predict overall survival (i.e., occurrence of death from disease) and event-free survival (i.e., occurrence of progression, relapse or death) profiles of patients in an independent cohort that is called Versteeg dataset hereafter. This dataset includes the geneexpression measurements and clinical data for 88 patients. Figure 1a and 1b shows the ROC curve for predicting overall survival and event-free survival profiles, respectively. Figure 1: a) ROC curve for predicting overall survival profiles of patients in Versteeg dataset. b) ROC curve for predicting event-free survival profiles of patients in Versteeg dataset. Clustering patients into subgroups with multi-view kernel k-means MYCN expression is one of the best predictors of survival in neuroblastoma. However, there is still some variability in survival profiles after stratification based on MYCN expression. We aimed to explain this remaining variability with RNA-seq data. As such, we further clustered the patients that have low and high MYCN expression (N= 90 and N=408 respectively). The threshold to split the patients based on MYCN expression was chosen to minimize the log-rank test p-value from Kaplan-Meier analysis. As expected, there is a high overlap between the patients with high MYCN expression and patients with MYCN amplification. We used multi-view kernel k-means (MKKM) to integrate two versions of the same RNA-seq dataset that are processed in different ways (See Methods). 2

The number of clusters was selected as 2 based on mean silhouette score. The weights of the views for the two versions of the RNA-seq data were 0.43 and 0.57 (i.e., MAV and RPM processing of RNA-seq data) for patients with low MYCN expression. Similarly, the view weights were 0.46 and 0.54 when clustering the patients with high MYCN expression. These weights give better silhouette scores over uniform weights indicating the advantage of using multi-view kernel k-means. Also, we used all the available features rather than going through an initial feature selection procedure. Figure 2a and 2b plot the survival curves of identified clusters from patients with low and high MYCN expression, respectively. Log-rank test gives p-values of 0.06 and 0, respectively. We also tried including the microarray data in addition to RNA-seq datasets; however this did not improve the silhouette score and the log-rank test p-value. We re- Figure 2: a) Survival curves of MKKM-inferred subgroups identified within the patients with high MYCN expression. Log-rank test p-value is 0.06. b) Survival curves of MKKM-inferred subgroups identified within the patients with low MYCN expression. Log-rank test p-value is 0. peated the same analysis with the subset of the patients (145 patients) for which acgh data is available. We again split the patients into two groups based on MYCN expression. Because the group of patients with high MYCN expression is small (N=24), we only cluster the patients with low MYCN expression (N = 121). Figure 3 shows the survival curves of identified clusters for patients with low MYCN expression. The difference between Figure3a and 3b is due to the inclusion of acgh data. Namely, including acgh data in addition to RNA-seq datasets results in a lower p-value (0.07 vs 0.0008). Methods Datasets We downloaded the RNA-seq, microarray and acgh datasets for the SEQC cohort from CAMDA website. We used two versions of the RNA-seq data: SEQC NB MAV G log2.txt 3

Figure 3: a) Survival curves of MKKM-inferred subgroups identified within the patients with low MYCN expression. RNA-seq and acgh dataseta are combined. Log-rank test p-value is 0.0008. View weights are 0.45 (MAV) and 0.55 (RPM). b) Survival curves of MKKM-inferred subgroups identified within the patients with low MYCN expression. Only RNA-seq datasets are used. Log-rank test p-value is 0.07. View weights are 0.28 (MAV), 0.34 (RPM) and 0.37 (acgh). downloaded from CAMDA website and GSE62564 SEQC NB RNASeq log2rpm.txt downloaded from GEO website for entry GSE62564. We downloaded the Versteeg dataset from R2 database (http:// r2.amc.nl). The GEO accession number for Versteeg microarray data is GSE16476 (3). We used the survival package in R to perform the Kaplan-Meier analysis (4). Machine Learning Methods We used the SVC function available in Python s scikit-learn library for prediction on Versteeg dataset. Linear and RNF kernels were used to train the OS and EFS models respectively. The best gamma parameter for the RBF kernel and C parameter was set with cross-validation. For multi-view kernel k-means we used the code provided in the paper by Gonen et al (5). Discussion The availability of genome-wide datasets for cancer patients have increased rapidly in recent years. Methods that can effectively integrate these datasets can improve our understanding of cancer development and progression. To this end, we used supervised and unsupervised learning strategies to predict patient survival in neuroblastoma. Our supervised model can accurately predict overall survival and event-free survival profiles of neuroblastoma patients in an independent cohort. We also inferred subgroups that have distinct survival rates by combining multiple multi-dimensional datasets. Including 4

microarray data in addition to RNA-seq datasets provided no improvement in silhouettescore or log-rank test p-value indicating redundancy between the two datasets. However, including acgh data in addition to RNA-seq datasets have improved the silhouette-score and resulted in a lower log-rank test p-value. Overall, our results suggest that integration of multi-modal datasets can improve subtype definition in cancer. References 1. RC Seeger, GM Brodeur, H Sather, A Dalton, SE Siegel, KY Wong, and D Hammond. Association of multiple copies of the n-myc oncogene with rapid progression of neuroblastomas. N. Engl. J. Med., 483:589, 2012. 2. F Hertwig J Thierry-Mieg W Zhang et. al. W Zhang, Y Yu. Comparison of rna-seq and microarray-based models for clinical endpoint prediction. Genome Biology, 16:133, 2015. 3. JJ Molenaar, J Kosterand, DA Zwijnenburg, and P van Sluis et. al. Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis. Nature, 483:589, 2012. 4. TM Therneau. A Package for Survival Analysis in S, 2015. version 2.38. 5. M Gonen and AA Margolin. Localized data fusion for kernel k-means clustering with application to cancer biology. Advances in Neural Information Processing Systems, 2014. 5