Nearest Shrunken Centroid as Feature Selection of Microarray Data Myungsook Klassen Computer Science Department, California Lutheran University 60 West Olsen Rd, Thousand Oaks, CA 91360 mklassen@clunet.edu Nyunsu Kim Computer Science Department, California Lutheran University 60 West Olsen Rd, Thousand Oaks, CA 91360 nyunsuk@clunet.edu Abstract The nearest shrunken centroid classifier uses shrunken centroids as prototypes for each class and test samples are classified to belong to the class whose shrunken centroid is nearest to it. In our study, the nearest shrunken centroid classifier was used simply to select important genes prior to classification. Random Forest, a decision tree based classification algorithm, is chosen as a classifier to seven cancer microarray data for correct diagnosis. was also performed using the nearest shrunken centroid classifier and its results are compared to those from random Forest. Our study demonstrates that the nearest shrunken centroid classifier is simple, yet efficient in selecting important genes, but does not perform well as a classifier. We report that performance of Random Forest as a classifier is far superior to that of Shrunken centroid classifier. 1. Introduction The development of microarray technology made analysis of gene expressions possible. Analyzing gene expression data from microarray devices has many important applications in medicine and biology: the diagnosis of disease, accurate prognosis for particular patients, and understanding the response of a disease to drugs, to name a few. However the amount of data in each microarray presents significant challenges to data mining. Microarray data typically has many attributes and few samples because of the difficulty of collecting and processing samples, especially for human data. The typical number of samples is small-less than 100. In contrast, the number of attributes which represent genes in microarray data is very large-typically tens of thousands. This causes several problems. With irrelevant and redundant attributes, the data set is unnecessarily large and takes too long to perform classification. Having so many attributes and so few samples creates a high lelihood of finding false positive with over fitting. Reducing the number of attributes is an important step before applying classification methods. It should be done by preserving a maximum of discriminant information to improve the learning accuracy. Good attributes must have the same expression pattern for all samples of the same class, but different expression patterns for examples belonging to different classes. Tibshirani et al [11] proposed the nearest shrunken centroid method for class prediction in DNAmicroarray studies. It uses shrunken centroids as prototypes for each class and identifies subsets of genes that best characterize each class. It "shrinks" each of the class centroids toward the overall centroid for all classes by a threshold. This shrinkage makes the classifier more accurate by eliminating the effect of noisy genes and as a result it does automatic gene selection. The gene expression profile of a new sample is compared to each of these class centroids. The class whose centroid that it is closest to, in squared distance, is the predicted class for that new sample. This part is the same as the usual nearest centroid rule. This type of classifier is sensitive to a small disturbance and performance is inferior to contemporary classifiers such as neural networks, support vector machines, and decision trees where complex criteria is used for classification. Random Forest is a statistical method for classification and was first introduced by Leo Breiman in 2001[2]. It is a decision tree based supervised learning algorithm. Its algorithm is an ensemble classification which is unsurpassable in accuracy among current data mining algorithms. It has been used for intrusion detection[14], probability estimation[15], and classification[13], but is not widely used in the microarray data classification problems until recently. Reported results[12][13] of a Random Forest classifier with microarray data are very promising. In this paper, we make an attempt to improve microarray data classification rates. The shrunken centroid is used to extract good attributes and Random Forest is used as a classifier. results from this new architecture are compared to those obtained by solely using the shrunken centroid method as a classifier. This paper is organized as follows. We present previous relevant work in section 2. Then we discuss the concept of the Random Forest classifier and the
shrunken centroid in section 3. Seven microarray data sets used in this paper are described in Section 4. Numerical results are presented in Section 5, followed by discussions and conclusions in Section 6. 2. Related Works Random Forest has been used in several problems. Zhang and Zulkernine used it for network intrusion detection[14]. Prediction of a chemical compound s quantitative or categorical biological activity was performed by Svetn et al[13]. Random Forest performance was investigated by Klassen [9] with different parameter values, the number of trees and attributed used to split a node in a tree. Several research works were focused on reducing attributes: Genetic Algorithm[10] and wrapper approach[6], support vector machine[3], maximum lelihood[5], statistical methods[1][11], neural networks[7], and probability[15]. Hoffmann[4] used pediatric acute lymphoblastic leukemia (ALL) to determine a small set of genes for optimal subgroup distinction and achieved very high overall ALL subgroup prediction accuracies of about 98%. Uriarte et al[23] investigated the use of Random Forest for classification of nine microarray data including leukemia, colon, lymphoma, and SRBCT and reported classification rates of 94.9%, 87.3%, 99.1%, 99.23%, and 99.79% for leukemia, colon, lymphoma, prostate, and SRBCT, respectively. Prostate cancer genes expression data was tested with Multilayer Perceptron to distinguish grade levels by Moradi at el[16] and an average of 85% classification rate was reported. In Khan s work[7], the principal component analysis (PCA) with SRBCT tumor data[25] reduced the dimensionality to 10 dominant PCA components. Using these neural networks, all 20 test samples were correctly classified. Klassen[8] reported 100% classification rate with SRBCT data using Levenberg-Marquardt (LM) algorithm Back Propagation neural networks after attribute selection with the nearest shrunken centroid method. Tibshirani et al [11] reported 100% classification rate with SRBCT data and 94% with leukemia data with the nearest shrunken centroid classifier. Feng Chu [24] used support vector machine with 4 effective feature reduction methods and reported 100%, 94.12%, and 100% for SRBCT, leukemia, lymphoma respectively. 3. Background 3.1 Random Forest Random Forest is a meta-learner which consists of many individual tree. Each tree votes on an overall classification for the given set of data and the Random Forest algorithm chooses the individual classification with the most votes. There are two different sources of randomness in Random Forest: random training set(bootstrap) and random selection of attributes. Using a random selection of attributes to split each node yields favorable error rates and are more robust with respect to noise. These attributes form nodes using standard tree building methods. Diversity is obtained by randomly choosing attributes at each node of a tree and using the attributes that provide the highest level of learning. Each tree is grown to the fullest possible without pruning until no more nodes can be created due to information loss. With a lesser number of attributes used for split, correlation between any two trees decreases and the strength of a tree decreases. With a larger number of attributes, each tree strength increases. These two have a reverse effect on error rates of Random Forest: less correlation increases the error rate while less strength decreases the error rate. The optimal result should be used for the number of attributes to get good results.. 3.2. Shrunken Centroids There are two factors to consider in selecting good genes: within class distance and between classes distance. When expression levels of a gene for all samples in the same class are fairly consistent with a small variance, but are largely different among samples of different classes, the gene is considered as a good candidate for classification. The gene has discriminant information for different classes. In the nearest shrunken centroid method, variance within a class was further taken into consideration to measure the goodness of a gene within class. The difference between a class centroid and the overall centroid for a gene is divided by the within class variance to give a greater weight to a gene whose expressions are stable among samples in the same class. A threshold value is applied to the resulting normalized class centroid differences. If it is small for all classes, it is set to zero, meaning the gene is eliminated. This reduces the number of genes that are used in the final predictive model. The mathematics involved in the shrunken centroid algorithm can be found from Tibshirani s work [11]. Assume there are n patients, p genes and K classes where s i is the within class standard deviation for gene i K 2 1 2 1 1 si = ( xij xij ) and m n K = k k = j C n n. 0 1 k s is a small positive constant to avoid having a large d value when a gene has a small standard deviation s i. In the nearest shrunken centroid, for each gene i, d is evaluated in each class k to see if it is smaller k
than a chosen value, threshold Δ. If it is too small for all classes, the gene i is considered not very significant and is eliminated from the gene lists. Otherwise update d by sign( d )( d Δ ). The updated d is ' d afterwards. Once the optimal threshold is found, all centroids are called updated accordingly (shrunken centroid) using the equation ' ' = i + ( + 0) k i x x m s s d ' where d is the updated d. If a gene is shrunk to zero for all classes, then it is considered not important and is eliminated. After the centroids are determined through the training stage, classification can take place with new samples. Test samples are classified to belong to the class whose shrunken centroid is nearest to it. 4. Data SRBCT data[7]: small round blue cell tumor. It contained 88 samples from four types of cancer cells: 18 neuroblastoma (NB), 25 rhabdomyosarcoma (RMS), 29 the Ewing family of tumors (EWS), and 11 Burkitt lymphoma (BL). There are total 2287 genes. Acute data[17]: The total number of genes is 7129, and number of samples is 72, which are all acute leukemia patients, 47 acute lymphoblastic leukemia (ALL) or 28 acute myelogenous leukemia (AML). Prostate data[21]: The total number of genes is 10510 and the number of classes are two: tumor and normal. There are total 102 samples: 52 normal and 50 tumor. Lymphoma data[17]: Total number of genes to be tested is 4026 and the number of samples to be tested is 62. There are all together three types of lymphomas. The first category, Chronic Lymphocytic Lymphoma(CLL) has 11 patients, the second type Follicular Lymphoma (FL) has 42 and the third type Diffuse Large B-cell Lymphoma (DLBCL) has 9. Colon data [19]: This dataset contains 62 samples. Among them, 40 tumor biopsies are from tumors (labeled as "negative") and 22 normal (labeled as "positive") biopsies are from healthy parts of the colons of the same patients. The total number of genes to be tested is 2000. Lung cancer data[20] : There are 181 tissue samples among which 31 samples belong to MPM class and 150 belong to ADCA class. Each sample is described by 12533 genes. MLL_leukemia data[22] : contains three kinds of leukemia samples compared to the binary-class leukemia dataset. The dataset contains 72 leukemia samples with 24 ALL, 20 MLL and 28 AML. The number of genes is 1258. 5. Experiment and Results 5.1 Data Cleaning and Set up All data sets except SRBCT which already had the train set and the test set by the author [11] is divided into a train set and a test set by reserving ratios of sample classes. Roughly 60-80% of samples are used for train and the remaining are for test. The detail breakdown for each data set is shown in Table 1. Data sets were preprocessed to get rid of redundant data and to handle missing values. The freely available shrunken centroid algorithm software PAM which was used for our experiments has a function to handle missing values. It uses the k-nearest neighborhood algorithm to find the best value to fill in missing data. The default k-value of 10 was used in our experiments. Train sample SRBCT 63(23 EWS, 8 BL, 12 NB, 20 RMS) Acute 38(27 ALL, 11 AML) Prostate 80 (40 tumor, 40 normal) Lymphoma 47( 32 DLBCL, 7FL, 8 CLL) Colon 42 (30 tumor, 16 normal) Lung 149 (134 ADCA, 15 MPM) MLL 50 (16 ALL, 14 MLL, 20 AML) Test sample 20 (6 EWS, 3 BL,6 NB,5 RMS) 34(20 ALL, 14 AML) 22 (12 tumor, 12 normal) 15 (10 DLBCL, 2FL, 3CLL) 16 (10 tumor, 6 normal) 32(16 ADCA, 16 MPM) 22 ( 5 ALL, 6 MLL, 5 AML) Table 1: Train samples and test samples with number of class members 5.2 Gene Selection with shrunken centroid the Using the shrunken centroid algorithm, train data errors are plotted against threshold values as shown in Figure 1 with colon cancer. A range of threshold values from 2.5 to 3.8 show the lowest error. The largest threshold value was chosen in our study to obtain the smallest number of attributes to be used with the
Random Forest classifier. In Figure 1, the threshold value 3.809 was chosen for colon cancer. The same process was performed for all 7 data sets and Table 2 shows the number of genes selected. Figure 1: Colon cancer nearest shrunken centroid training errors Threshold Number of genes SRBCT 4.34 43 Acute 5.4 21 prostate 6.33 6 Lymphoma 6.84 25 Colon 3.809 16 Lung 13.35 5 ML 6.68 52 Table 2: Threshold value and the number of genes selected in shrunken centroid. 5.3 with Random Forest and its results In our study, Random Forest in WEKA 3.5.6 software developed by the University of Waato was used. There are two parameters which affect a performance of Random Forest. One is the number of attributes (called Mtry in WEKA) to select attributes at random to split a node in a tree and the other is the number of trees to be generated in a forest. The number of tree depth was set to 0 for an unlimited tree depth. Forests with three different numbers of trees,10, 15 and 20, were initially used and twenty Mtry values from 1 to 20 were used for each tree value. Since number of attributes in 7 data sets are between 5 and 52 with an average of 24, MTry values up to 20 is sufficient. When needed, the number of trees was changed to explore further. Statistics values such as average, min, max and standard deviation of classification rates along with the total number of times of 100% classification rates were obtained from 20 Mtry values were gathered. A summary of results is shown in Table 3. Table 3 shows the obtained highest classification rate and a number of trees giving the highest classification rate. If several trees were given 100% classification rate, the smallest is shown in the table. It also shows the number of times 100% classification rates occurred from 20 different Mtry values for the tree. With SRBCT and lymphoma, we were able to obtain 100% classification rate with all tree values and with several Mtry values. Mtry values play an important role while the number of trees are less so. A difference between the highest classification rate and the lowest one is larger than those from other data sets. With acute leukemia, prostate, colon and MLL_leukemia, classification rates didn t reach 100%. Random Forest with Mtry values and the number of trees didn t perform well. Also the difference between the highest classification rate and the lowest is small compared to that from SRBCT and lymphoma. The interesting case is lung cancer. The number of tree of 10 demonstrated an excellent result of 100% classification rate with almost all of Mtry values. But with tree values of 15 and 20, we were not able to get 100% classification rate. Highest rate 100% classification occurrence Tree size SRBCT 1.0 7 out of 20 15 Acute 0.9411 0 10 prostate 0.9091 0 20 Lymphoma 1.0 6 10 Colon 0.8125 0 9 Lung 1 16 out of 20 10 ML 1 1 out of 20 15 Table 3: Test Sample classification rates with Random Forest 5.4 with the shrunken centroid classifier We ran the shrunken centroid classifier with the same train and test sets we used with Random Forest. Threshold values were kept the same. rates we obtained are shown in Table 4 along with those with Random Forest. Table 4 shows that the shrunken centroid classifier gave 86.6%, 93.7% and 95.4% classification rates with
three data sets, Lymphoma, Lung, MLL_ respectively while Random Forest produced 100% for all three. With and Prostate, both gave the same classification rates of 94.11% and 90.91% respectively. With colon cancer, the classification rate with Shrunken Centroid is 75% while 81% with Random Forest. Threshol d Shrunken Centroid rates Random Forest Classificati on rates SRBCT 4.34 1.0 1.0 Acute 5.4 0.9411 0.9411 prostate 6.33 0.9091 0.9091 Lymphoma 6.846 0.866 1.0 Colon 3.809 0.75 0.8125 Lung 13.35 0.937 1.0 ML 6.68 1.0 1.0 Table 4: Test Samples classification rates with shrunken centroid and Random Forest. 6. Evaluation and discussion rates we obtained with Random Forest are the same as or higher than those with shrunken centroid with all data sets but Acute data and prostate data. The number of genes for these two data sets are relatively small, 21 for Acute, and 6 for prostate. We further investigated how different numbers of genes affect classification rates. 6.1 A larger number of attributes for Random Forest A range of threshold values gives the largest classification rate as shown in Figure 1, and we chose a few different threshold values which were less than the largest value to generates more attributes. The idea is that more attributes may increase classification rates with the Random Forest classifier. Table 5 shows prostate classification rates increased to 96.77% with larger threshold values. With the largest value 6.33, 6 genes were selected and its classification rate was 90.91%(see Table 3). With the shrunken centroid classifier, the rate went up to 93.55% from 90.91% (see Table 3), but then went down to 83.87%. With acute leukemia, Random Forest was able to classify all test samples correctly, but the shrunken centroid classifier went down to 90% from 94.11% (see Table 3) and didn t change at all with different threshold values. The shrunken centroid classifies test samples to the nearest shrunken centroid by using a simple standardized squared distance. This works well when genes with nonzero components in each class are mutually exclusive, such as the case with SRBCT data. Otherwise using the simple distance is not a good measure for classification. A sample with average values of all genes and a sample with a few large gene values and a few small gene values may end up in the same class. The second term used in the shrunken centroid, frequency of a class in the population may not be of help when population has a similar class ratio. Threshold No. of genes Shrunken Centroid rate Random Forest rate 4.29 37 83.87% 96.77% (ntree=14, Mtry=2) 4.74 24 87.1% 96.77% (ntree=9, Mtry=4) 5.65 7 93.55% 96.77% (ntree=8, Mtry=5) 6.10 5 90.33% 93.54% (ntree=8, Mtry=2) Table 5: Prostate classification rates with different threshold values Threshold No. of genes Shrunken Centroid rate Random Forest rate 4.07 69 90% 100% ( ntree=5, Mtry=3) 4.38 52 90% 95% ( ntree=5, Mtry=3) 4.70 39 90% 100% ( ntree=6, Mtry=4) 5.64 17 90% 95% ( ntree=6, Mtry=2) Table 6:Acute classification rates with different threshold values Random Forest uses decision-tree based supervised learning algorithm and is a meta learner to make a classification decision collectively based on each tree vote. Problems mentioned above with the shrunken centroid will not take place. Genes selected for
splitting a node in a tree will properly assign sample into a right decision node. 6.2 k-nearest neighbor algorithm for handling Missing values the Preprocessing data is an important part of data mining and can affect training and testing processes. We explored a goodness of k-nearest neighbor to fill in a missing value. We chose 62 samples with no missing values from the Colon data. We chose one gene T53360 and deleted values 131.92, and 233.05 from randomly selected samples sample 5 (tumor), and sample 48(normal) respectively. We filled in missing values with k-values from 1 to 10. The k value giving the best matching with the original value we deleted is different for each class. Therefore an average from all 10 k-values is used and is shown in Table 7. The table shows that k-nearest neighbor computes much closer values to the original values than a simple average method does. A similar result was obtain with Lymphoma. Sample 5(tumor) Sample 48(normal) Original value deleted k-nearest neighbor average Simple average 131.92 125.131 216.35 233.05 252.406 213.53 Table 7: Effectiveness of the k-nearest neighbor method handling missing values with colon cancer. 6. References [1] E. Blair, and R. Tibshirani, Machine learning methods applied to DNA microarray data can improve the diagnosis of cancer. SIGKDD Explorations, Vol. 5,Issue 2, 2002, pp 48-55. [2] L. Breiman, Random Forest, Machine Learning, 45(1), 2001, pp5-32. [3] I. Guyon, Gene selection for cancer classification using support vector machines, Machine Learning, Vol. 46, 2002, pp 389-422. [4] K. Hoffmann, Translating microarray data for diagnostic testing in childhood leukaemia, BMC Cancer, 2006. 6:229. [5] T. Ideker, Testing for differentially-expressed genes by maximum lelihood analysis of microarray data, Journal of Computational Biology, Vol. 7, 2000, pp 805-817. [6] Inza, Gene selection by sequential wrapper approaches in microarray cancer class prediction, Journal of Intelligent and Fuzzy Systems, 2002, pp. 24-34. [7] J. Khan, and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine, Vol. 7, No. 6, 2001, pp 673-679. [8] M. Klassen, of Cancer Microarray data using neural network, Proceedings of IADIS International Conference Applied Computing, 2007 [9] M. Klassen, M.Cummings, G. Seldana, Investigation of Random Forest performance with cancer microarray data. Proceedings of ISCA 24th International Conference ON Computers and Their Applications 2008. [10] L. Li, Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method, Combinatorial Chemistry and High Throughput Screening. 2001, pp. 727-739. [11] R. Tibshirani, Dignosis of multiple cancer types by shrunken centroids of gene expression, PNAS, Vol. 99, No. 10, 2002, pp. 6567-6572. [12] T. Shi, Tumor classification by tissue microarray profiling:random Forest clustering applied to renal cell carcinoma, Modern Pathology. Vol. 18, 2005, pp.547-557 [13] V. Svetn, Random Forest: A and Regression Tool for compound classification and QSAR modeling, J. Chem. Inf. Computer Science, 43, 2003, pp. 1947-1958. [14] J. Zhang and M. Zulkernine, A Hybrid Network Intrusion Detection Technique Using Random Forest, Proceedings of the First International Conference on Availability, Reliability and Security (ARES'06), Vol., 0. 2006, pp. 262 269. [15] Ting-Fan Wu, Chih-Jen Lin and Ruby C. Weng, probability estimates for multi-class classification by pair wise coupling, The journal of machine learning research, Volume 5, 2004. [16] M. Moradi, P. Mousavi, and P. Abolmaesoumi. pathological distinction of prostate cancer tumors based on DNA microarray data. CSCBC 2006 conference proceedings.ontario, Canada. [17] Goloub, et al. Molecular of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 15 October 1999.Vol. 286. no. 5439, pp. 531 537. [18] Alizadeh, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000 Feb 3;403(6769):503-11. [19] Alon, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues. Proceedings of the National Academy of Sciences, 1999. [20] Gordon, et al. Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma. Cancer Research, 2002 AACR. [21] Singh, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002 Mar;1(2):203-9. [22] Armstrong, et al. A gene expression profile analysis of acute lymphoblastic leukemia suggests a new subset of leukemia. The Scientist 2001, 2(1):20011205-02. [23] Ramon Uriarte and Sara Andres. gene selection and classification of microarray data using Random Forest BMC Bioinformatics 2006. Vol 7, No 3. [24] Feng Chu and Lipo Wang. Application of support vector machines to cancer classification with microarray data. International Journal of Neural systems, 2005, Vol 5, No 6.