International Journal of Pure and Applied Mathematics

Size: px

Start display at page:

Download "International Journal of Pure and Applied Mathematics"

Anne Williams
5 years ago
Views:

Volume 119 No. 12 2018, 12505-12513 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu Analysis of Cancer Classification of Gene Expression Data A Scientometric Review 1 Joseph M.

1 Volume 119 No , ISSN: (on-line version) url: ijpam.eu Analysis of Cancer Classification of Gene Expression Data A Scientometric Review 1 Joseph M. De Guia, 2 Madhavi Devaraj, PhD 1, 2 School of IT, Mapua University, Muralla St., Intramuros, Manila Philippines 1 jmdeguia@mapua.edu.ph, 2 mdevaraj@mapua.edu.ph Abstract The discovery of diseases at molecular level is the main interest of this study and a great challenge for researchers in the field of bioinformatics and cancer classification. Understanding the genes that contributes to the cancer malady is a great challenge to many researchers. This paper analyzed the published papers on cancer classification using the scientometric approach. Scopus was used to search the published papers on cancer classification. The cancer classification was used as the target keyword for the query of the online database search. The result of the search analysis is a Scopus dataset and CRExplorer visualization analysis to identify the important attributes of the cancer classification researches and its impact and global contribution. The cancer classifier models were also presented and evaluation were discussed. A proposed conceptual and system model were presented from the discussion and results of the papers extracted from the scientometric analysis. Keywords: scientometric, cancer classification, machine learning, microarray, gene expression, genomics. 1.Introduction Cancer is considered one of the deadliest genetic maladies of the human genome and has been the research interest in the field of medicine for the past decades. The World Health Organization (WHO) reported cancer (14 million new cases in 2012) is a major cause or morbidity and mortality that accounts as the second leading cause of death worldwide resulting 8.8 million deaths in 2015 [15]. The World Cancer Report described cancer as a global problem because it affects the whole greater population. It was projected by the same report that cancer incidence will increase to 20 million new cases by 2025 [16]. There are several known published literatures on cancer classification that used new approaches in molecular level. The important goal in cancer research is to identify the specific genes that contribute to cancer disease. New approaches in technology reveal the molecular level of cancer classification wherein thousands of genes at a time can be subjected for analysis in a single chip called microarray. Microarrays are microscopic slides that contain ordered series of samples of DNA (Deoxyribonucleic acids), RNA (Ribonucleic acids), protein, or tissue and others [6]. A single chip microarray can measure the gene expression of 30,000 gene sample that represents most of the human genome [4]. The surveyed literature identified in this paper used gene expression data in classifying cancer genes using statistical and machine learning techniques. The discussions and experiments described higher data computational analysis using pattern recognition, supervised and unsupervised learning are the most common algorithms implemented. There are several approaches that dealt with algorithms of cancer classification and were successful in implementing these models to classify wide variety of cancer types and others provide biological interpretation based on the prediction outcomes. To have an in-depth analysis of the literature of cancer classification using gene expression data, scientometric method was used. Scientomertics is branch of Science that clarifies about the input and output of a certain structure which is organized in order [9]. The discussion related to scientometrics, bibliometric and informatics that deals with the study and flow of the production of the literatures. The scientometric analysis provide synchronous and diachronist analyses. The result of the analysis was based on the statistical evolution of the published papers from different sources and journals using several parameters to get insights on its progression. The challenge of cancer classification using the microarray is the application of model based selection and prediction algorithm that will classify the cancer genes at molecular level using gene Page 1 of

2 expression data. The computation time, classification accuracy, and its biological relevance in the cancer classification was still in question. This paper will present answers through survey and analysis of papers published for cancer classification, gene expression data, and machine learning. The main goal of this study is to explore and analyze the published papers on cancer classification using gene expression data. The method used was descriptive following the Scientometric process. The Elsevier Scopus published literatures were collected based on the search keywords cancer classification gene expression data. The scope of this paper is to present the cancer classification using gene expression data. The searched query on cancer classification, gene expression data, algorithms, used in this paper were based on selected cancer classification models. The coverage of the publication used in the search and analysis were from Related Works There are volumes of published research and articles about cancer and associating the word cancer genome when searched on the Internet. Using Google, it returned 1.2 million searched items (as of March 8, 2018) using the keyword cancer genome. In Google scholar, it s 3.2 million where most of the cited items are cancer genome, proteomics, microarray, machine learning algorithms and others. 2.1 Cancer genome studies Cancer in the medical term is abnormal state of a cell or a group of cells that mutates and destroys other tissues in the human body. Cancer is not just one disease but of many diseases. There are more than 100 different types of cancer. The main categories of cancer include: carcinoma, cancer of the skin and lining of the internal organs; sarcoma, cancer of the bone muscle, blood vessels and connecting tissue; leukemia, cancer of the blood; lymphoma and myeloma, cancer of the immune system; and the central nervous system cancers that begins in the brain and spinal cord; and other tumors [17]. The genome-wide association studies (GWAS) contributed to the rapid discovery of genetic disease [18]. The cancer research accelerated the reporting of GWAS resulted in the investigation of genetic analysis. In 2007 GWAS publication, there are about 40 unmistakable hereditary loci have been convincingly distinguished for in excess of two dozen distinct cancers [18,19]. The International Cancer Genome Consortium has a catalogue of cancer mutations of more than 25,000 tumors of 25 cancer types [20]. 2.2 Microarray, Gene Expression Data, and Knowledge Discovery Microarrays are microscopic slides that contain an ordered series of samples such as DNA, RNA, protein and tissues [6]. The sample placed into the slide such as DNA microarray; RNA microarray and others will be the type of microarray. The most commonly used microarray is the DNA microarray. The DNA is spotted on the slides and chemically synthesized long oligonucleotides or enzymes. DNA is held in place by chemically reactive aldehydes or primary amines or either synthesized by photolithographic process. The cancer gene expression is made possible from the Internet cancer genomic data [21]. Most of the data available are breast and lung cancer data sets and others have less than 100 sample sizes. Microarray profiling innovation, which has been most generally used to study gene expression in cancer. This is a critical advancement in genomic marker research used in clinical practice. The microarray processing of class comparison, discovery, and prediction of the cancer gene expression, both supervised and unsupervised methods of analysis were used. The knowledge discovery using statistical and machine learning models predict and classify the gene expression data. Other implementations of cancer classification are clustering and visualization annotation for biological interpretation. 3. Proposed work This paper mainly explored and analyzed published papers in cancer classification using gene expression data. The survey method used was Scientometric. The published papers in cancer classification can provide insights in relation to its impact to the research domain using the citations. The h-index is a representation of an achievement of the author whose work was being cited by other researchers. The h-index is an algorithm that measures the impact of the researcher or author where the paper was referred. Page 2 of

To illustrate the conceptual framework, the following activities were fully discussed in the methodology. The data gathering process will take place in Scopus database.

3 To illustrate the conceptual framework, the following activities were fully discussed in the methodology. The data gathering process will take place in Scopus database. The output of the search results will be based on the topic of interest. The second process is the data pre-processing where the dataset will be extracted, organized and cleaned. The third process is the data interpretation of the processed data by using the Scopus analyzer search results to compare the pre-processed data. The last process is the post-processing where the result of the data sets was examined using analysis and visualization tools to reveal the insights related to the citations, content relationships and to determine the cancer classifier outcome. A conceptual and system model using supervised learning and gene selection method based on the evaluation method was proposed. 4. Results and Discussion 4.1 Scopus search results The collected related work in Scopus using the initial search query returned 31,028 documents published from The first Scopus dataset was examined to determine the frequency of the keywords used in the search query TITLE-ABS-KEY. Using the TITLE-ABSTRACT-AUTHOR KEYWORDS-INDEX KEYWORDS. Then next task is to analyze the sources. The comparison of sources was analyzed using the compare source publication. Scopus compare sources calculations and compare at least top ten sources according to the set parameters. These includes CiteScore, SJR, SNIP, Cites, Documents, Percentage cited, and Percentage review. CiteScore is the average number of citations for a year published in certain journal in the preceding three years. Refer to figure 2 for the Scopus result for search analysis documents per year source. This shows that the topic of interest has still the traction each year. Refer to figure 3 for the CiteScore publication by year. Bioinformatics is the leading source of research paper and the highest paper related to the topic. Figure 2. Documents per year by source (Scopus, 2018) Figure 3. CiteScore publication by year (Scopus,2018) In table 3 the top 10 ten sources and citations on Scopus search analysis were Bioinformatics, BMC Bioinformatics, PLoS Computational Biology, Neurocomputing, Journal of Biomedical Informatics, Journal of Computational Biology, Annual International Conference of the IEEE Engineering in Medicine and Biology Proceedings, Artificial Intelligence in Medicine, Computers in Biology and Medicine, Artificial Intelligence in Medicine. The top source title Bioinformatics has 117 documents, 6.73 Citescore, 96,864 citations. Figure 4. Document by affiliation (Scopus, 2018) Figure 5. Documents by country/territory (Scopus, 2018) Other search analysis available are document affiliation, country and territory, document type, and subject area. In figure 4, analysis for the documents by affiliation, Chinese Academy of Science published 400 papers on year 2000 to Whereas the United States (figure 5) produced the highest Page 3 of

number of papers (7,894) related on the topic compared with China (4,225) on the same period. The document type with the highest output was article 51.8% (15,630) followed by conference papers 42.

4 number of papers (7,894) related on the topic compared with China (4,225) on the same period. The document type with the highest output was article 51.8% (15,630) followed by conference papers 42.3% (12,769). 4.2 CRExplorer The Cited References Explorer (CRExplorer) [8] is another tool that can analyze the historical roots of fields, topics, or researches by Reference Publication Year Spectroscopy (RPYS). The method used for the analysis based on RPYS is the frequency of references cited in the researches in terms of the publication years. The reference data set used on this analysis was extracted from Scopus. Refer to figure 6 for the RPYS analysis of the topic. There were 53,658 cited references in the 51 years period and 18 different citing publications years with the total of 1447 documents see figure 8. The researchers on cancer classification using gene expression started it traction on 1995 with its influential work on 1999 of Golub, et.al followed by Guyon, Alon, Nguyen, Dudoit, Statnikov, et. al. The peak of the research on 2005 with the use of microarray more datasets and the evaluation using classification methods. The work of Statnikov et.al marked its constant cited reference of the citing years thereafter. These papers used comprehensive evaluations and classification methods for microarray gene expression cancer diagnosis. The evaluations and classification method were discussed in section 5. Figure 7. Citation of cancer classification gene Figure 8. Cited References dataset cancer classification expression CRExplorer (CRExplorer, 2018) gene expression data (CRExplorer, 2018) 5. Survey of the cancer classification results The following discussion summarizes the discussion and results from the papers sourced from the scientometric analysis. 5.1 Cancer classification methods evaluation The weighted voting gene selection works well for classifying binary data [10]. This method is a correlation based classifier where a class label assigned on weighted voting of the informative genes. This method works well with some data such as leukemia. The disadvantage of this method was it is not effective in more than 2 classes of data set. The similarity based classifiers, KNN and CAST are not affected by the noise and bias in data. KNN is tolerable to noise because it makes use of several training data to determine the test class label. CAST is a cluster based on separable groups containing normal and tumor samples. KNN use less computing time than CAST because of the similarity score evaluation performed on every test and training. These methods are not scalable and not practical for cancer classification because of it use too much computation time. The max-margin classifiers SVM and Boosting are ideal with the high dimensional data, noise and sparseness as well as overfitting avoidance. SVM has the advantage of selecting small number of support vectors of the learning algorithm against the large training set. However, SVM is limited only for binary class problems. Boosting improves the classification accuracy through number of folds of class training. But the repeated classification of weighted training consumes much time effort. Bayesian network (BN) and Neural nets (NN) can be applied to multiclass classifier. The disadvantage of the process is a black box and not capable to reveal any biological information in the data. Decision trees (DT) can be interpreted its meaning and does not require parameter. Trees can be generated right away as the data size increases. DT algorithms are good classifier in terms of scalability. Page 4 of

5 In terms of the classifier accuracy, experiments made by [11] SVM has the highest accuracy (leukemia and ovarian data set). CAST method is better (colon data set). While Boosting is more better to outperform NN for leukemia, ovarian, colon) data set. Similarly, NB outperform GS approach for leukemia and ovarian data sets. The opposite for colon where GS did great compare to NB [12]. Table 4 presented the summary of the cancer classifiers survey result [3]. This shows that there is no cancer classifier that is superior to all on the models. This can be a research topic to explore on cancer classifier s accuracy and biological meaning that points to a new classification algorithm or modify the existing algorithm to fit the bio-relevant answer to cancer classifiers. The limited number of cancer database and data sets varies from each type of cancer genes from each source. Table 4. Summary of the cancer classification survey Classification method Multiclass Strategy Evaluation Biological meaning Scalability SVM No Max-Margin No Good Boosting Yes Max-Margin Yes Class dependent DT Yes Entropy function Yes Good KNN Yes Similarity No Not scalable CAST Yes Similarity No Not scalable GS No Weighted voting Yes Fair FLDA Yes Discriminant Analysis No Fair NN Yes Perceptron NB Yes Distribution modeling No Fair 5.2 Gene selection The feature selection is an important process in the cancer classification to help eliminate problems in data set noise and over fitting of the classifier. In addition, this will reveal the biorelevant information to make use of DT to see the actual view of gene movement and value. The gene selection reduces the large attribute space that helps the classifier to improve the accuracy [2] [10] [13] [14]. The gene feature ranking approach measures the correlation of class labels and attribute values. Using the GS selection method [10] with the correlation is simple to implement but has a disadvantage for mistaken selection of cancer gene values for normal and tumor types. Using the linear discriminant function weights for training classifiers, the weights of the features are directly proportional to the important class labels identified. Comparing the genes selected using NB and GS election method [15]. NB classifier accuracy is better and has more genes variety than GS. Another approach is the gene subset ranking (GSR). In this method, the group of genes are clustered together to obtain a best classifier. Lastly the recursive feature elimination (REF) makes the elimination process retain the best classification power. This is also used in SVM classification as a cost function on the subset ranking. The REF and GSR works great in cancer classification compared to IGR. 6. Proposed model The proposed conceptual model for the cancer classifier using the gene feature selection and the system model for the design and process flow discussion were presented in detail in this section. A general model for the supervised classification prediction is shown in figure 8. It starts by loading the gene expression dataset which is known and normalized from a microarray gene expression platform. The dataset was divided into train and test subsets if enough samples are available. If not enough samples are available, cross-validation is used to held the sample and the predictor is trained on the remaining samples and classified by the predictor and then repeated iteratively. Once proper training set has been defined, a selection method is used. This step facilitates the training of the most classification algorithms, but some can deal with thousands of variables effectively. Subsequent validation in feature selection of gene is informative in the classification, and No Fair Page 5 of

6 classifiers are built based on these datasets. If there is a need to fine tune the vectors or parameters when training the predictor, several models are built against this marked gene and the final model chosen, the one that minimizes the total error in the cross-validation. Figure 8. Logical design for supervised learning Figure 9. Conceptual Model - System Flow The graphical representation of the classifier shows the structure of the system design and processes. This was used to design the system and understand how the system will work and develop the pseudo code in order to properly program each of the module. Figure 9 shows the conceptual system model. The supervised learning for the gene expression data starts from the microarray results wherein gene expression data came from. The microarray host high dimensional data, that is the gene expression data, it is important to select and extract the right genes to be used in the classification system. The process involves selecting the discriminative genes related to classification form the gene expression data, then trains the classifier and classifies the new data using learned classifier. The feature or gene selection chooses an optimal subset of genes for further processing. This process can be further subdivided into algorithms which evaluate and compare the predictive power of the individual gene, and algorithms which compare groups of genes. Gene ranking assumes independence among the genes expressions and differ the gene ranking parameter. The second algorithm uses forward election and backward elimination. This approach identifies complex relationships among genes at a cost significantly higher computational complexity. The feature or gene extraction uses an algorithm that builds linear mapping transformation for reducing the dimensionality of the input gene space. The diagnostic test used in the transformation of the gene expression measurements achieved accurate subset of mapped space. From the given list of gene, the classifier or predictor will make a decision to categorize the gene pattern on the prediction stage. If there is a need to further reduce the data to the required sample space, it will return to the dimensionality reduction. Evaluation of the predictor will cross-validate a classifier on the training dataset and testing the classifier to overcome over-fitting. The proposed system use normalized and ready to load input data for the supervised classification system. Normalized data set means no preprocessing and cleaning of the data needed. The data files contain normalized expression values of genes from the microarray. Figure 10. Proposed System Model The system model is made up of the dimensional reduction module reducing the noise of Figure 11. The learning machine that illustrates the reduction of input vectors and classify the vectors in preparation for the prediction Page 6 of

7 high-dimensional space of vectors as input data and the transformed vectors. Figure 9 shows Dimensionality reduction as an independent process of the design of the classifier that leads to overfitting. Independent feature subset selection is performed using the training data set. If the resulting features are then used to cross-validate a classifier, the same data would have been used for training and testing. This is over-fitting and may produce heavily biased error estimates. The learning machine builds dimensionality reduction and classifier simultaneously estimating the error rate on both stages. The cross-validation learning data set will first be used to compute the dimensionality reduction parameters, and then the computed transformation of the vectors will be applied to the learning and validation datasets. This follows the building of the classifier using the learned dataset and testing it using the mapped dataset. Thus, the validation subset is never used for either dimensionality reduction or classifier learning, but exclusively for testing. This is a statistically rigorous approach to machine learning which avoids over-fitting. Gene extraction refers to algorithms which build a linear mapping transformation for reducing the dimensionality of the input gene space. Note that a diagnostic test incorporating such a transformation utilizes all of the input gene expression measurements. 7. Conclusion This paper presented the comprehensive scientometric analysis of cancer classification using gene expression data. The analysis presented were based on the search analysis feature of Scopus and CRExplorer. The Scopus dataset provided the rich collection of documents related to cancer classification. The results pointed that most of the papers related to cancer classification are published as articles and conference papers contributed to Bioinformatics and BMC Bioinformatics title references. There was a revelation in the document analysis that the Chinese Academy of Sciences were mostly affiliated with the published papers. However, in terms of global contribution the United Sates was far by 50% to China on this account. The topic about cancer classification was still a topic on interest by most researchers on this filed. However, there was a decline according to CRExplorer s analysis after its peak on The survey analysis of the cancer classification model evaluation presented the advantage and disadvantages of each. Gene selection is an important phase in the preprocessing and cancer classification. The next step is to investigate the topic and related topic to cancer classification techniques specific to cancer genome or type of cancer disease. A review of the result of the experiments made to provide an insight regarding the machine learning techniques and other factors such as evaluation methods pertaining to efficiency of computation time, accuracy, and biological relevance. A comprehensive survey will also be helpful to make a conclusive review of the cancer classification. Finally, the proposed system model described is a concept for cancer classification. This system model explored the issues presented in the cancer classifier model and proposed to implement the supervised classification model. It is recommended to verify the system model and test the accuracy of the proposed logical and system flow designs. References [1] Brown, M. Grundy, W. Lin, D., Cristianni, N. (1999). Support vector machine classification of microarray gene expression data. Technical report. University of California. Sta. Cruz. [2] Dudoit, S. Fridlyand J., & Speed, T. (2000). Comparison and discrimination methods for the classification of tumors using gene expression data. Technical report no.56. Berkeley. Department of Statistics., Univ. California, 43. [3] Han, J., Lu, Y. (2003). Cancer classification using gene expression data. Information Systems, Vol. 28 (4), [4] Ramaswamy S., Tamayo, P. & Rifkin, R. (2001) Multiclass cancer diagnosis using tumor gene expression signatures. PNAS Vol. 98(26), [5] Shapiro, GP., Tamayo, P. (2005). Microarray data mining: Facing challenges, Vol 5 (2), 1-3. [6] Wong, G (2005). Introduction. In Minna Laine. DNA Microarray data analysis (15-24). Helsinki: CSC- Scientific computing Inc. Page 7 of

8 [7] Scopus (2018). Scopus. Elsevier. Accessed on Mar 8, 2018 from [8] CRExplorer (2018) Accessed on Mar 20, 2018 from [9] Bala A, Gupta B M. Mapping of Indian neuroscience research: A scientometric analysis of research output during Neurol India 2010; 58:35-41 [10] Golub, R., Slonim, D., Tamayo, P. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, pages [11] Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini (2000). Tissue class cation with gene expression profiles. In Proc. of the Fourth Annual Int. Conf. on Computational Molecular Biology. [12] Keller A., Schummer, M., Hood, L., and Ruzzo. W. (2000). Bayesian classification of DNA array expression data. Technical report, University of Washington [13] Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2000). Gene selection for cancer classification using support vector machines. Machine Learning. [14] Campbell, C., Li, Y., and Tipping, M (2001). An efficient feature selection algorithm for classification of gene expression data., Machine Learning. [15] World Health Organization (WHO) (2018). Cancer Fact Sheet, Feb.2018 Media Center. Accessed on March 8, 2018 from [16] Stewart, B. and Wild, C. (2014). World Cancer Report International Agency for Research on Cancer (IARC), World Health Organization (WHO). WHO Press. [17] National Cancer Institute (2018). What is cancer. Accessed on March 20, 2018 from [18] Chung CC, Magalhaes WC, GonzalezBosquet J, Chanock SJ (2010). Genomewide association studies in cancer current and future directions. Carcinogenesis, 31: bgp273 PM [19] Hindorff LA, Gillanders EM, Manolio TA (2011). Genetic architecture of cancer and other complex diseases: lessons learned and future directions. Carcinogenesis, 32: carcin/bgr056 PMID: [20] Hudson TJ, Anderson W, Artez A et al.; International Cancer Genome Consortium (2010). International network of cancer genome projects. Nature, 464: PMID: [21] Chin L, Hahn WC, Getz G, Meyerson M (2011). Making sense of cancer genomic data. Genes Dev, 25: org/ /gad PMID: Authors Biography Joseph M. De Guia is a PhD student from Mapua University, Manila Philippines. He is currently a faculty member of the School of Information Technology of same university. Joseph finished his master s degree in Information Technology (MIST) at Carnegie Mellon University and Masters in Computer Science (MSCS) at Mapua University. His research interests are data mining, artificial intelligence, information security, big data analytics, IT infrastructure and digital innovation. His research papers in digital health and enterprise architecture has been presented in international and local conferences. Dr. Madhavi Devaraj received doctoral degree in computer science from Dr. A.P.J Abdul Kalam Technical University, Lucknow, India. She has also completed Masters in Computer Applications and MPhil in Computer Science from Madurai Kamaraj University, Madurai, India. Currently, she is distinguished professor in computer science department at Mapua University, Manila, Philippines. She has been assistant professor in computer science department at Invertis University, India and Babu Banarasi Das University, India, previously. Her research interests include Text analytics, Scientometric Analysis, Opinion Mining, Sentiment Analysis, Information Extraction, Neural Networks, Artificial Intelligence, Machine Learning and Big Data Analysis. Page 8 of

9 12513

10 12514

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology