REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE
REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 1 Biomarker discovery has opened new realms in the medical industry, from patient diagnosis and treatment, to drug development and testing. However, through these advances the capacity to discover biomarkers panels has often been constrained by the employed methodologies. Current approaches to biomarker panel discovery A number of different machine learning, clustering and statistical approaches can be used for biomarker selection, including traditional methods such as: top scoring pair (TSP), decision trees (DT), naïve bayes (NB), prediction analysis of microarrays (PAM), support vector machine (SVM) and others. But these traditional methods can be difficult to interpret, use many biomarkers, and yield low accuracies, including sensitivity and specificity. For the medical industry, from diagnostics, to pharmaceutical developers, to labs, this translates into a costly process that leads to a harder path through regulatory approval. One weakness of traditional biomarker discovery techniques is the invariant approach. Testing for individual biomarkers, one at a time, is not only cumbersome and costly; it neglects the complex, interrelated nature of those markers. By capturing the relationships between multiple biomarkers, a more nuanced and precise evaluation can be conducted, which takes into account the interactions between potential biomarkers in determining patient outcomes. Another weakness of traditional biomarker discovery is the constraints of the statistical techniques typically employed. Inherent to these methods are numerous assumptions, which can constrain the potential information embedded in the data, clouding the potential results. The SimplicityBio Biomarker Optimization Software System A new multivariable approach to biomarker discovery has emerged to resolve these weaknesses, using SimplicityBio s proprietary Biomarker Optimization Software System (BOSS) we are able to find the perfect balance between accuracy and quantity of biomarkers. The core of this is the co-evolutionary fuzzy modeling method Fuzzy CoCo1. Around this method several steps are performed to select the best combination of biomarkers. BOSS performs two phases: 1 st 2 nd Exploratory-modeling: Potential signatures are created by testing billions of panels of biomarkers with Fuzzy CoCo. Fuzzy CoCo uses an artificial evolution approach, which allows populations of signatures to evolve, mate, and migrate, with only the most robust signatures surviving at the end. Signature-selection: A reduced number of signatures are. This family of signatures represents several characteristics, so it is possible to have signatures that are more sensible, sensitive, or with fewer variables than others. The final selection is made taking into account the needs of the client.
REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 2 The success of BOSS lies in its ability to minimize the number of rules and variables used in multivariate signatures, while maintaining exceptional accuracy, including sensitivity and specificity. The method yields a family of models, which can be isolated to meet the specific needs of the client. By reducing the number of rules and variables in each of the family s signature, testing costs will be reduced, both on the development end and consumer end. A cleaner, more concise resultant model can also aid developers in navigating the regulatory approval process. Testing SimplicityBio s Biomarker Optimization Software To test its efficacy, BOSS was compared with other biomarker discovery methods and s such as TSP, k-tsp, DT, NB, K-NN, PAM, SVM, MOE, Bagging C4.5, AdaBoost C4.5, KEM Biomarker from Ariana Pharma, AHC, Single C4.5, fsvm, and Fuzzy Logic for six seminal, published datasets. In comparing SimplicityBio s biomarker discovery with other methods for published datasets, BOSS consistently yields lower numbers of variables, while matching or exceeding the accuracy of the other methods. Across the six datasets, BOSS achieved an accuracy of 95.83% or higher which exceeded or met the accuracy of every other method it was compared to. But the key to BOSS s superiority is not just its exceptional accuracy, it is its ability to constrain the number of variables in each model. LEUKEMIA (Golub et al.2) Includes 38 observations, each of which is described by the gene expression levels of 7,129 genes and a class attribute with the two distinct labels of acute myeloid leukemia and lymphoblastic leukemia. Acute myeloid and lymphoblastic leukemia (Golub et al.) BOSS 100.00% 2 SimplicityBio1 NB 100.00% * Tan et al.8 SVM 100.00% 8 Guyon et al.9 PAM 97.22% 2296 Tan et al. k-tsp 95.83% 18 Tan et al. K-NN 84.82% * Tan et al. Fuzzy logic 79.00% 2 Ohno-Machado et al.10 DT 73.81% 2 Tan et al. In the comparison using the leukemia dataset, BOSS achieved an accuracy of 100% using 2 variables. SVM, another method that achieved this level of accuracy, used 8 variables. Other methods which used only 2 variables, Fuzzy logic and DT, only achieved accuracies of 79% and 73.81% respectively.
REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 3 COLON CANCER (Alon et al.3) Includes 62 observations made up of 40 tumor samples and 22 normal samples. There are approximately 6,000 genes represented in each sample in the dataset. Colon Cancer (Alon et al.) BOSS 94.14% 27 SimplicityBio TSP 91.10% 2 Tan et al. k-tsp 90.30% 2 Tan et al. Fuzzy logic 90.00% 17 Huerta et al.11 PAM 85.48% 15 Tan et al. SVM 82.26% * Tan et al. DT 80.65% 3 Tan et al. K-NN 74.19% * Tan et al. NB 58.06% * Tan et al. Despite using more variables, BOSS outperforms the other datasets in terms of accuracy. PROSTATE CANCER (Singh et al.4) Includes 52 prostate tumor samples and 50 non-tumor prostate samples with a total of 12,600 genes. Prostate Tumor (Singh et al.) BOSS 97.29% 2 SimplicityBio TSP 95.00% 2 Tan et al. MC-SVM 92.00% * Statnikov et al.12 k-tsp 91.00% 2 Tan et al. PAM 91.00% 47 Tan et al. SVM 91.00% * Tan et al. NN 91.00% * Statnikov et al. DT 87.00% 4 Tan et al. KNN 85.00% * Statnikov et al. NB 62.00% * Tan et al. BOSS achieves an accuracy of 97.29% for the prostate cancer dataset using 2 variables. The only method using fewer variables TSP compromises accuracy to do so.
REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 4 LUNG CANCER (Gordon et al.5) Includes 52 prostate tumor samples and 50 non-tumor prostate samples with a total of 12,600 genes. Lung cancer (Gordon et al.) BOSS 100.00% 2 SimplicityBio PAM 99.45% 15 Tan et al. SVM 99.45% * Tan et al. k-tsp 98.90% 2 Tan et al. K-NN 98.34% * Tan et al. TSP 98.30% 2 Tan et al. NB 97.79% * Tan et al. DT 96.13% 3 Tan et al. MOE 91.00% 2 Wang & Palade13 With the only 100% accuracy result for the methods tested in the lung cancer dataset, BOSS uses 2 variables. Four methods use the same or fewer variables k-tsp, TSP, DT, and MOE however they have significantly lower accuracies of 98.90%, 98.30%, 96.13% and 91.00% respectively. Breast CANCER (Van de Vijver et al.6) Includes 295 samples made up of 151 lymph-node negative disease and 144 with lymph-node positive disease with a total of 70 genes. Breast cancer (van de Vijver et al.6) BOSS 95.83% 31 SimplicityBio Bagging C4.5 89.47% * Tan & Gilbert14 AdaBoost C4.5 89.47% * Tan & Gilbert BOSS 87.50% 9 SimplicityBio KEM Biomarker 85.89% 13 Guergova-Kuras et al.15 AHC 83.33% 70 van de Vijver et al. Single C4.5 63.16 * Tan & Gilbert Single C4.5 63.16% * Tan & Gilbert Here are presented two signatures discovered by BOSS. The first one has the highest accuracy (95.83%) but not the lowest number of variables. The second one has a lower number of variables (9) and an accuracy superior to KEM Biomarker who presents the lower number of variables among the other methods.
REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 5 Ovarian CANCER (Zhou et al.7) Includes 94 samples made up of 44 samples from women diagnosed with serous papillary ovarian cancer and 50 healthy women with a total of 3,017 mass spectrometry signatures. Ovarian Cancer (Zhou et al.7) BOSS 100.00% 10 SimplicityBio1 fsvm 100.00% 3017 Zhou et al. KEM Biomarker 92.97% 13 Guergova-Kuras et al. The ovarian cancer dataset exemplifies the importance of reducing the number of variables used in modeling. While fsvm achieves an accuracy of 100% to match that of BOSS it uses 300x the number of variables. As exemplified by these six datasets, BOSS consistently has the highest accuracy of any method tested, with lower or comparable numbers of variables used. Even when BOSS uses slightly more variables, an increase of 1 to 2 variables is a modest tradeoff for higher accuracy. When minimizing the number of variables used is the goal, BOSS can still produce exceptional accuracy results. Summary BOSS is the next stage in the evolution of biomarker discovery technology. The co-evolutionary engine behind BOSS continually drives discovery models toward more elegant, simple, and powerful solutions to better meet the needs of clients.
REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 6 About SimplicityBio SimplicityBio is a Swiss biomarker panel discovery company. SimplicityBio s Biomarker Optimization Software System (BOSS) allows you to take full advantage of multiple data types and unbalanced data sets, while answering your production, regulatory and IP requirements. To do so, our discovers robust, highly specific and sensitive biomarker panels. Leaving you to choose the one that answers your needs. BOSS brings a unique and powerful combination of machine learning, evolutionary algorithms and fuzzy logic to the biological world, and is thus able to discover new robust multi-biomarker panels and improve existing ones. Our clients and partners range from research institutions, to diagnostic, companion diagnostic, prognostic and pharmaceutical companies. Contact us: Route de l'ile-aux-bois 1A 1870 Monthey Switzerland info@simplicitybio.com visit: www.simplicitybio.com
REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 7 s: [1] Barreto-Sanz, M. A., Bujard, A., & Pena-Reyes, C. A. (2012, November). Evolving very-compact fuzzy models for gene expression data analysis. InBioinformatics & Bioengineering (BIBE), 2012 IEEE 12th International Conference on (pp. 356-361). IEEE. [2] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,... & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. science,286(5439), 531-537. [3] Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96(12), 6745-6750. [4] Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C.,... & Sellers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2), 203-209. [5] Gordon, G. J., Jensen, R. V., Hsiao, L. L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S.,... & Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer research, 62(17), 4963-4967. [6] Van De Vijver, M. J., He, Y. D., van't Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W.,... & Bernards, R. (2002). A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25), 1999-2009. [7] Zhou, M., Guan, W., Walker, L. D., Mezencev, R., Benigno, B. B., Gray, A.,... & McDonald, J. F. (2010). Rapid mass spectrometric metabolic profiling of blood sera detects ovarian cancer with high accuracy. Cancer Epidemiology Biomarkers & Prevention, 19(9), 2262-2271. [8] Tan, A. C., Naiman, D. Q., Xu, L., Winslow, R. L., & Geman, D. (2005). Simple decision rules for classifying human cancers from gene expression profiles.bioinformatics, 21(20), 3896-3904. [9] Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1-3), 389-422. [10] Ohno-Machado, L., Vinterbo, S., & Weber, G. (2002). Classification of gene expression data using fuzzy logic. Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology, 12(1), 19-24. [11] Huerta, E., Duval, B., & Hao, J. K. (2008). Fuzzy logic for elimination of redundant information of microarray data. Genomics, proteomics & bioinformatics, 6(2), 61-73. [12] Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5), 631-643. [13] Wang, Z., & Palade, V. (2010, December). Multi-objective evolutionary algorithms based interpretable fuzzy models for microarray gene expression data analysis. In Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on (pp. 308-313). IEEE. [14] Tan, A. C., & Gilbert, D. (2003). Ensemble machine learning on gene expression data for cancer classification. [15] Guergova-Kuras, M., Schneider, M. P., Jullian, N., & Afshar, M. (2014). 667: Shorter multimarker signatures: a new tool to facilitate cancer diagnosis.european Journal of Cancer, (50), S160.
REINVENTING THE BIOMARKER PANEL DISCOVERY EXPERIENCE 8 APPENDIX A Acronym TSP k-tsp DT NB K-NN PAM SVM MOE Bagging C4.5 AdaBoost C4.5 KEM Biomarker from Ariana Pharma AHC Single C4.5 fsvm Fuzzy Logic MC-SVM BOSS Technique of Platform Top scoring pair k- Top scoring pair C4.5 decision trees Naïve Bayes K-nearest neighbor Prediction analysis of microarrays Support Vector Machines Multi-objectiive Evolucionary Algorithms and Fuzzy Logic Knowledge Extraction and Management Aglomerative hierchical clutering algorithm Functional Support Vector Machine Multiclass support vector machine Biomarker Optimization Software System