REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE

Similar documents
Predictive Biomarkers

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION

A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system

Analyzing Gene Expression Data: Fuzzy Decision Tree Algorithm applied to the Classification of Cancer Data

Active Learning with Support Vector Machine Applied to Gene Expression Data for Cancer Classification

Algorithms Implemented for Cancer Gene Searching and Classifications

CANCER CLASSIFICATION USING SINGLE GENES

Hybridized KNN and SVM for gene expression data classification

A Biclustering Based Classification Framework for Cancer Diagnosis and Prognosis

Multiclass microarray data classification based on confidence evaluation

Simple Decision Rules for Classifying Human Cancers from Gene Expression Profiles

Package propoverlap. R topics documented: February 20, Type Package

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

Gene expression correlates of clinical prostate cancer behavior

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Classification of cancer profiles. ABDBM Ron Shamir

AUTHOR PROOF COPY ONLY

International Journal of Pure and Applied Mathematics

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

HYBRID SUPPORT VECTOR MACHINE BASED MARKOV CLUSTERING FOR TUMOR DETECTION FROM BIO-MOLECULAR DATA

THE gene expression profiles that are obtained from

An Improved Algorithm To Predict Recurrence Of Breast Cancer

An entropy-based improved k-top scoring pairs (TSP) method for classifying human cancers

Introduction to Discrimination in Microarray Data Analysis

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Augmented Medical Decisions

Increasing Efficiency of Microarray Analysis by PCA and Machine Learning Methods

NIH Public Access Author Manuscript Best Pract Res Clin Haematol. Author manuscript; available in PMC 2010 June 1.

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Applications of Causal Discovery Methods in Biomedicine

Published in the Russian Federation Modeling of Artificial Intelligence Has been issued since ISSN: Vol. 6, Is. 2, pp.

Predicting Breast Cancer Survivability Rates

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Comparison of discrimination methods for the classification of tumors using gene expression data

Tissue Classification Based on Gene Expression Data

Development of Soft-Computing techniques capable of diagnosing Alzheimer s Disease in its pre-clinical stage combining MRI and FDG-PET images.

Accuracy-Rejection Curves (ARCs) for Comparing Classification Methods with a Reject Option

T. R. Golub, D. K. Slonim & Others 1999

Cancer is the fourth most common disease and the. Genomic Processing for Cancer Classification and Prediction

Classification consistency analysis for bootstrapping gene selection

A Survey on Detection and Classification of Brain Tumor from MRI Brain Images using Image Processing Techniques

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

A novel approach to feature extraction from classification models based on information gene pairs

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

A DATA MINING APPROACH FOR PRECISE DIAGNOSIS OF DENGUE FEVER

CANCER PREDICTION SYSTEM USING DATAMINING TECHNIQUES

Efficient Classification of Cancer using Support Vector Machines and Modified Extreme Learning Machine based on Analysis of Variance Features

Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis

Predicting Kidney Cancer Survival from Genomic Data

Evaluating Classifiers for Disease Gene Discovery

Data complexity measures for analyzing the effect of SMOTE over microarrays

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Cancer Gene Extraction Based on Stepwise Regression

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

SVM-Kmeans: Support Vector Machine based on Kmeans Clustering for Breast Cancer Diagnosis

MACHINE LEARNING BASED APPROACHES FOR CANCER CLASSIFICATION USING GENE EXPRESSION DATA

Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Journal of Engineering Technology

Detection of Cognitive States from fmri data using Machine Learning Techniques

International Journal of Advance Engineering and Research Development A THERORETICAL SURVEY ON BREAST CANCER PREDICTION USING DATA MINING TECHNIQUES

Molecular classi cation of cancer types from microarray data using the combination of genetic algorithms and support vector machines

A Fuzzy Improved Neural based Soft Computing Approach for Pest Disease Prediction

Accurate molecular classification of cancer using simple rules.

Predicting Malignancy from Mammography Findings and Image Guided Core Biopsies

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY

National Surgical Adjuvant Breast and Bowel Project (NSABP) Foundation Annual Progress Report: 2009 Formula Grant

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

Package golubesets. August 16, 2014

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Automatic Detection of Epileptic Seizures in EEG Using Machine Learning Methods

Inter-session reproducibility measures for high-throughput data sources

Classification with microarray data

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Certificate Courses in Biostatistics

ABSTRACT I. INTRODUCTION. Mohd Thousif Ahemad TSKC Faculty Nagarjuna Govt. College(A) Nalgonda, Telangana, India

A Strategy for Identifying Putative Causes of Gene Expression Variation in Human Cancer

Prediction Models of Diabetes Diseases Based on Heterogeneous Multiple Classifiers

IN SPITE of a very quick development of medicine within

Colon cancer subtypes from gene expression data

IJESRT. Scientific Journal Impact Factor: (ISRA), Impact Factor: 1.852

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

BIOINFORMATICS ORIGINAL PAPER

Intelligent Patient Profiling for Diagnosis, Staging and Treatment Selection in Colon Cancer

Comparison Classifier: Support Vector Machine (SVM) and K-Nearest Neighbor (K-NN) In Digital Mammogram Images

VeriStrat Poor Patients Show Encouraging Overall Survival and Progression Free Survival Signal; Confirmatory Phase 2 Study Planned by Year-End

Ensemble methods for classification of patients for personalized. medicine with high-dimensional data

Improved Intelligent Classification Technique Based On Support Vector Machines

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Analysis of Classification Algorithms towards Breast Tissue Data Set

NAÏVE BAYESIAN CLASSIFIER FOR ACUTE LYMPHOCYTIC LEUKEMIA DETECTION

Extraction of Informative Genes from Microarray Data

Clinical Utility of Diagnostic Tests

SubLasso:a feature selection and classification R package with a. fixed feature subset

Diagnosis Of Ovarian Cancer Using Artificial Neural Network

NIH Public Access Author Manuscript Bioanalysis. Author manuscript; available in PMC 2011 March 16.

A NOVEL CLASSIFICATION MODEL FOR ANALYSIS OF A CRIME USING NAÏVE BYES AND KNN IN DATA MINING

Transcription:

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 1 Biomarker discovery has opened new realms in the medical industry, from patient diagnosis and treatment, to drug development and testing. However, through these advances the capacity to discover biomarkers panels has often been constrained by the employed methodologies. Current approaches to biomarker panel discovery A number of different machine learning, clustering and statistical approaches can be used for biomarker selection, including traditional methods such as: top scoring pair (TSP), decision trees (DT), naïve bayes (NB), prediction analysis of microarrays (PAM), support vector machine (SVM) and others. But these traditional methods can be difficult to interpret, use many biomarkers, and yield low accuracies, including sensitivity and specificity. For the medical industry, from diagnostics, to pharmaceutical developers, to labs, this translates into a costly process that leads to a harder path through regulatory approval. One weakness of traditional biomarker discovery techniques is the invariant approach. Testing for individual biomarkers, one at a time, is not only cumbersome and costly; it neglects the complex, interrelated nature of those markers. By capturing the relationships between multiple biomarkers, a more nuanced and precise evaluation can be conducted, which takes into account the interactions between potential biomarkers in determining patient outcomes. Another weakness of traditional biomarker discovery is the constraints of the statistical techniques typically employed. Inherent to these methods are numerous assumptions, which can constrain the potential information embedded in the data, clouding the potential results. The SimplicityBio Biomarker Optimization Software System A new multivariable approach to biomarker discovery has emerged to resolve these weaknesses, using SimplicityBio s proprietary Biomarker Optimization Software System (BOSS) we are able to find the perfect balance between accuracy and quantity of biomarkers. The core of this is the co-evolutionary fuzzy modeling method Fuzzy CoCo1. Around this method several steps are performed to select the best combination of biomarkers. BOSS performs two phases: 1 st 2 nd Exploratory-modeling: Potential signatures are created by testing billions of panels of biomarkers with Fuzzy CoCo. Fuzzy CoCo uses an artificial evolution approach, which allows populations of signatures to evolve, mate, and migrate, with only the most robust signatures surviving at the end. Signature-selection: A reduced number of signatures are. This family of signatures represents several characteristics, so it is possible to have signatures that are more sensible, sensitive, or with fewer variables than others. The final selection is made taking into account the needs of the client.

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 2 The success of BOSS lies in its ability to minimize the number of rules and variables used in multivariate signatures, while maintaining exceptional accuracy, including sensitivity and specificity. The method yields a family of models, which can be isolated to meet the specific needs of the client. By reducing the number of rules and variables in each of the family s signature, testing costs will be reduced, both on the development end and consumer end. A cleaner, more concise resultant model can also aid developers in navigating the regulatory approval process. Testing SimplicityBio s Biomarker Optimization Software To test its efficacy, BOSS was compared with other biomarker discovery methods and s such as TSP, k-tsp, DT, NB, K-NN, PAM, SVM, MOE, Bagging C4.5, AdaBoost C4.5, KEM Biomarker from Ariana Pharma, AHC, Single C4.5, fsvm, and Fuzzy Logic for six seminal, published datasets. In comparing SimplicityBio s biomarker discovery with other methods for published datasets, BOSS consistently yields lower numbers of variables, while matching or exceeding the accuracy of the other methods. Across the six datasets, BOSS achieved an accuracy of 95.83% or higher which exceeded or met the accuracy of every other method it was compared to. But the key to BOSS s superiority is not just its exceptional accuracy, it is its ability to constrain the number of variables in each model. LEUKEMIA (Golub et al.2) Includes 38 observations, each of which is described by the gene expression levels of 7,129 genes and a class attribute with the two distinct labels of acute myeloid leukemia and lymphoblastic leukemia. Acute myeloid and lymphoblastic leukemia (Golub et al.) BOSS 100.00% 2 SimplicityBio1 NB 100.00% * Tan et al.8 SVM 100.00% 8 Guyon et al.9 PAM 97.22% 2296 Tan et al. k-tsp 95.83% 18 Tan et al. K-NN 84.82% * Tan et al. Fuzzy logic 79.00% 2 Ohno-Machado et al.10 DT 73.81% 2 Tan et al. In the comparison using the leukemia dataset, BOSS achieved an accuracy of 100% using 2 variables. SVM, another method that achieved this level of accuracy, used 8 variables. Other methods which used only 2 variables, Fuzzy logic and DT, only achieved accuracies of 79% and 73.81% respectively.

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 3 COLON CANCER (Alon et al.3) Includes 62 observations made up of 40 tumor samples and 22 normal samples. There are approximately 6,000 genes represented in each sample in the dataset. Colon Cancer (Alon et al.) BOSS 94.14% 27 SimplicityBio TSP 91.10% 2 Tan et al. k-tsp 90.30% 2 Tan et al. Fuzzy logic 90.00% 17 Huerta et al.11 PAM 85.48% 15 Tan et al. SVM 82.26% * Tan et al. DT 80.65% 3 Tan et al. K-NN 74.19% * Tan et al. NB 58.06% * Tan et al. Despite using more variables, BOSS outperforms the other datasets in terms of accuracy. PROSTATE CANCER (Singh et al.4) Includes 52 prostate tumor samples and 50 non-tumor prostate samples with a total of 12,600 genes. Prostate Tumor (Singh et al.) BOSS 97.29% 2 SimplicityBio TSP 95.00% 2 Tan et al. MC-SVM 92.00% * Statnikov et al.12 k-tsp 91.00% 2 Tan et al. PAM 91.00% 47 Tan et al. SVM 91.00% * Tan et al. NN 91.00% * Statnikov et al. DT 87.00% 4 Tan et al. KNN 85.00% * Statnikov et al. NB 62.00% * Tan et al. BOSS achieves an accuracy of 97.29% for the prostate cancer dataset using 2 variables. The only method using fewer variables TSP compromises accuracy to do so.

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 4 LUNG CANCER (Gordon et al.5) Includes 52 prostate tumor samples and 50 non-tumor prostate samples with a total of 12,600 genes. Lung cancer (Gordon et al.) BOSS 100.00% 2 SimplicityBio PAM 99.45% 15 Tan et al. SVM 99.45% * Tan et al. k-tsp 98.90% 2 Tan et al. K-NN 98.34% * Tan et al. TSP 98.30% 2 Tan et al. NB 97.79% * Tan et al. DT 96.13% 3 Tan et al. MOE 91.00% 2 Wang & Palade13 With the only 100% accuracy result for the methods tested in the lung cancer dataset, BOSS uses 2 variables. Four methods use the same or fewer variables k-tsp, TSP, DT, and MOE however they have significantly lower accuracies of 98.90%, 98.30%, 96.13% and 91.00% respectively. Breast CANCER (Van de Vijver et al.6) Includes 295 samples made up of 151 lymph-node negative disease and 144 with lymph-node positive disease with a total of 70 genes. Breast cancer (van de Vijver et al.6) BOSS 95.83% 31 SimplicityBio Bagging C4.5 89.47% * Tan & Gilbert14 AdaBoost C4.5 89.47% * Tan & Gilbert BOSS 87.50% 9 SimplicityBio KEM Biomarker 85.89% 13 Guergova-Kuras et al.15 AHC 83.33% 70 van de Vijver et al. Single C4.5 63.16 * Tan & Gilbert Single C4.5 63.16% * Tan & Gilbert Here are presented two signatures discovered by BOSS. The first one has the highest accuracy (95.83%) but not the lowest number of variables. The second one has a lower number of variables (9) and an accuracy superior to KEM Biomarker who presents the lower number of variables among the other methods.

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 5 Ovarian CANCER (Zhou et al.7) Includes 94 samples made up of 44 samples from women diagnosed with serous papillary ovarian cancer and 50 healthy women with a total of 3,017 mass spectrometry signatures. Ovarian Cancer (Zhou et al.7) BOSS 100.00% 10 SimplicityBio1 fsvm 100.00% 3017 Zhou et al. KEM Biomarker 92.97% 13 Guergova-Kuras et al. The ovarian cancer dataset exemplifies the importance of reducing the number of variables used in modeling. While fsvm achieves an accuracy of 100% to match that of BOSS it uses 300x the number of variables. As exemplified by these six datasets, BOSS consistently has the highest accuracy of any method tested, with lower or comparable numbers of variables used. Even when BOSS uses slightly more variables, an increase of 1 to 2 variables is a modest tradeoff for higher accuracy. When minimizing the number of variables used is the goal, BOSS can still produce exceptional accuracy results. Summary BOSS is the next stage in the evolution of biomarker discovery technology. The co-evolutionary engine behind BOSS continually drives discovery models toward more elegant, simple, and powerful solutions to better meet the needs of clients.

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 6 About SimplicityBio SimplicityBio is a Swiss biomarker panel discovery company. SimplicityBio s Biomarker Optimization Software System (BOSS) allows you to take full advantage of multiple data types and unbalanced data sets, while answering your production, regulatory and IP requirements. To do so, our discovers robust, highly specific and sensitive biomarker panels. Leaving you to choose the one that answers your needs. BOSS brings a unique and powerful combination of machine learning, evolutionary algorithms and fuzzy logic to the biological world, and is thus able to discover new robust multi-biomarker panels and improve existing ones. Our clients and partners range from research institutions, to diagnostic, companion diagnostic, prognostic and pharmaceutical companies. Contact us: Route de l'ile-aux-bois 1A 1870 Monthey Switzerland info@simplicitybio.com visit: www.simplicitybio.com

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 7 s: [1] Barreto-Sanz, M. A., Bujard, A., & Pena-Reyes, C. A. (2012, November). Evolving very-compact fuzzy models for gene expression data analysis. InBioinformatics & Bioengineering (BIBE), 2012 IEEE 12th International Conference on (pp. 356-361). IEEE. [2] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,... & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science,286(5439), 531-537. [3] Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. [4] Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C.,... & Sellers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2), 203-209. [5] Gordon, G. J., Jensen, R. V., Hsiao, L. L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S.,... & Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer research, 62(17), 4963-4967. [6] Van De Vijver, M. J., He, Y. D., van't Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W.,... & Bernards, R. (2002). A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25), 1999-2009. [7] Zhou, M., Guan, W., Walker, L. D., Mezencev, R., Benigno, B. B., Gray, A.,... & McDonald, J. F. (2010). Rapid mass spectrometric metabolic profiling of blood sera detects ovarian cancer with high accuracy. Cancer Epidemiology Biomarkers & Prevention, 19(9), 2262-2271. [8] Tan, A. C., Naiman, D. Q., Xu, L., Winslow, R. L., & Geman, D. (2005). Simple decision rules for classifying human cancers from gene expression profiles.bioinformatics, 21(20), 3896-3904. [9] Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1-3), 389-422. [10] Ohno-Machado, L., Vinterbo, S., & Weber, G. (2002). Classification of gene expression data using fuzzy logic. Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology, 12(1), 19-24. [11] Huerta, E., Duval, B., & Hao, J. K. (2008). Fuzzy logic for elimination of redundant information of microarray data. Genomics, proteomics & bioinformatics, 6(2), 61-73. [12] Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5), 631-643. [13] Wang, Z., & Palade, V. (2010, December). Multi-objective evolutionary algorithms based interpretable fuzzy models for microarray gene expression data analysis. In Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on (pp. 308-313). IEEE. [14] Tan, A. C., & Gilbert, D. (2003). Ensemble machine learning on gene expression data for cancer classification. [15] Guergova-Kuras, M., Schneider, M. P., Jullian, N., & Afshar, M. (2014). 667: Shorter multimarker signatures: a new tool to facilitate cancer diagnosis.european Journal of Cancer, (50), S160.

REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 8 APPENDIX A Acronym TSP k-tsp DT NB K-NN PAM SVM MOE Bagging C4.5 AdaBoost C4.5 KEM Biomarker from Ariana Pharma AHC Single C4.5 fsvm Fuzzy Logic MC-SVM BOSS Technique of Platform Top scoring pair k- Top scoring pair C4.5 decision trees Naïve Bayes K-nearest neighbor Prediction analysis of microarrays Support Vector Machines Multi-objectiive Evolucionary Algorithms and Fuzzy Logic Knowledge Extraction and Management Aglomerative hierchical clutering algorithm Functional Support Vector Machine Multiclass support vector machine Biomarker Optimization Software System