Gene Selection Based on Mutual Information for the Classification of Multi-class Cancer

Gene Selecton Based on Mutual Informaton for the Classfcaton of Mult-class Cancer Sheng-Bo Guo,, Mchael R. Lyu 3, and Tat-Mng Lok 4 Department of Automaton, Unversty of Scence and Technology of Chna, Hefe, Anhu, 3006, Chna sbguo@m.ac.cn Intellgent Computaton Lab, Hefe Insttute of Intellgent Machnes, Chnese Academy of Scences, P.O. Box 30, Hefe, Anhu, 3003, Chna 3 Computer Scence & Engneerng Dept., The Chnese Unversty of Hong Kong, Shatn, Hong Kong 4 Informaton Engneerng Dept., The Chnese Unversty of Hong Kong, Shatn, Hong Kong Abstract. Wth the development of mrocarray technology, mcroarray data are wdely used n the dagnoses of cancer subtypes. However, people are stll facng the complcated problem of accurate dagnoss of cancer subtypes. Buldng classfers based on the selected key genes from mcroarray data s a promsng approach for the development of mcroarray technology; yet the selecton of non-redundant but relevant genes s complcated. The selected genes should be small enough to allow dagnoss even n regular laboratores and deally dentfy genes nvolved n cancer-specfc regulatory pathways. Instead of the tradtonal gene selecton methods used for the classfcaton of two categores of cancers, n the present paper, a novel gene selecton algorthm based on mutual nformaton s proposed for the classfcaton of mult-class cancer usng mcroarray data, and the selected key genes are fed nto the classfer to classfy the cancer subtypes. In our algorthm, mutual nformaton s employed to select key genes related wth class dstncton. The applcaton on the breast cancer data suggests that the present algorthm can dentfy the key genes to the BRCA mutatons/brca mutatons/the sporadc mutatons class dstncton snce the result of our proposed algorthm s promsng, because our method can perform the classfcaton of the three types of breast cancer effectvely and effcently. And two more mcroarray datasets, leukema and ovaran cancer data, are also employed to valdate the performance of our method. The performances of these applcatons demonstrate the hgh qualty of our method. Based on the present work, our method can be wdely used to dscrmnate dfferent cancer subtypes, whch wll contrbute to the development of technology for the recovery of the cancer. Introducton Mcroarray technology, a recent development n expermental molecular bology, provdes bomedcal researchers the ablty to measure expresson levels of thousands of genes smultaneously. Such gene expresson profles are used to understand the D.-S. Huang, K. L, and G.W. Irwn (Eds.): ICIC 006, LNBI 45, pp. 454 463, 006. Sprnger-Verlag Berln Hedelberg 006

Gene Selecton Based on Mutual Informaton 455 molecular varatons among dsease related cellular processed, and also to help the ncreasng development of dagnostc tools and classfcaton platforms n the cancer research. Wth the development of the mcroarray technology, the necessary processng and analyss methods grow ncreasngly crtcal. It becomes gradually urgent and challengng to explore the approprate approaches because of the large scale of mcroarray data comprsed of the large number of genes compared to the small number of samples n a specfc experment. For the data obtaned n a typcal experment, only some of genes are useful to dfferentate samples among dfferent classes, but many other genes are rrelevant to the classfcaton. Those rrelevant genes not only ntroduce some unnecessary nose to gene expresson data analyss, but also ncrease the dmensonalty of the gene expresson matrx, whch results n the ncrease of the computatonal complexty n varous consequent researches such as classfcaton and clusterng. As a consequence, t s sgnfcant to elmnate those rrelevant genes and dentfy the nformatve genes, whch s a feature selecton problem crucal n gene expresson data analyss [, ]. In the present paper, we propose a novel gene selecton method based on the mutual nformaton for the mult-class cancer classfcaton usng mcroarray data. Our method frstly calculates the mutual nformaton (MI) between the dscreted gene expresson profles and the cancer class label vector for all the samples. Then, the genes are ranked accordng to the calculated MI. These selected genes wth hgh ranks are fed nto the nearest neghbor method. The rest of the present paper s organzed as follows. In secton, we frst ntroduce the method to dscretze the gene expresson data, and then we n detal formulate the prncple of GSMI. Secton 3 descrbes the test statstcs. And Secton 4 descrbes the experment. In Secton 5, GSMI s appled to analyze the breast cancer dataset. Secton 6 contans the conclusons. Methods Among the many thousands of genes smultaneously measured n a specfc mcroarray experment, t s mpossble that all of ther expressons are related to a partcular partton of the samples. In the analyss of a bologcal system, the followng rules of thumb regardng gene functons are often assumed. ) A gene can be n ether the on or off state; ) not all genes smultaneously respond to a sngle physologcal event; 3) gene functons are hghly redundant [3]. Accordng to these assumptons, we consder the genes as random varables wth two values, n whch denotes the on state and 0 denotes the off state. As a consequence, the gene expresson data can be dscretzed nto two states 0 and, respectvely. The dscretzaton of the gene expresson data wll be formulated later n Secton 3. Assume that a mcroarray dataset can be represented as a G S matrx A wth generc element a gs representng the expresson level of the gene, g n sample, s. All the samples are dvded nto n categores, and wth the class label denoted by C wth ts element a gs standng for the class of th sample. From the bologcal pont of vew, those genes, havng hgher mutual agreement wth class label of the cancer mcroarray

456 S.-B. Guo, M.R. Lyu, and T.-M. Lok data, contrbute more sgnfcantly on the classfcaton of the cancer subtypes. Consequently, these genes should be selected as the key genes and used to the sequent classfcaton and clusterng. Accordng to the nformaton theory, mutual nformaton can be used to measure the mutual agreement between two obect models. We then employ the mutual nformaton to rank every gene accordng to mutual nformaton between the gene and the class label of the cancer mcroarray data. Based on the nformaton-theoretc prncple of mutual nformaton, the mutual nformaton of two random varables X and h w / w wth a ont probablty mass functon pxy (, ) and margnal probablty mass functons px ( ) and py ( ) s defned as [4]: px (, y) IXY ( ; ) pxy (, )log. () ( ) ( ) x, y px py Let us suppose that the doman of G, {,..., G}, s dstcretzed nto two ntervals. After dscretzaton, the domans of all the genes can be represented by dom( G) { vk}, k, where v 0 and v 0. Denoted by σ the SELECT operaton from relatonal algebra and S denote the cardnalty of set S [8]. The probablty of a gene n mcroarray data havng G v, {,..., G}, k {,} s then gven by: k σ G ( ) k A P( G vk ). σ ( A ) G Φ () And the ont probablty of the gene n the gene expresson data has G vk and the class label C c, {,..., n} s calculated by: σ G ( ) vk C c A P( G vk C c). σ ( A) G NULL (3) Defnton. The nterdependence measure I between the gene and the class label, G and C, {,..., G}, s defned as: PG ( v C c) IG ( : C) PG ( v C c)log ( ) ( ) n k k k l PG vk PC c. (4)

Gene Selecton Based on Mutual Informaton 457 IG ( : C) measures the average reducton n uncertanty about G that results from learnng the value of C [9]. If IG ( : C) > IG ( : C),, {,..., G},, the dependence of G and the class label C s greater than the dependence of G and C. Before rankng the genes accordng to the mutual nformaton, the redundancy n the mcroarray should be decreased because of the fundamental prncple of mcroarray technology. Due to the prncple of mcorarray technology, the gene expresson matrx contans hgh redundancy snce some genes are measured more than once. Defnton. The mutual nformaton matrx of the mcroarray, named as M wth ts element m, s gven by: PG ( v G v ) m P( G v G v )log. (5) ( ) ( ) k l k l k l PG vk PG vl For smplcty, the mutual nformaton matrx M s normalzed to M element m gven as follows, wth ts m * m (6) HG (, G) where the ont entropy of the gene G and gven by: G s denoted by HG (, G ), whch s HGG (, ) PG ( v G v )log PG ( v G v ) k l k l k l. (7) The redundancy n the mcroarray data s reduced by the followng method. In the matrx M, the elements on the dagonal are all wth the same value. These rows wthout contanng the value less than 0.95 are labeled n the normalzed mutual nformaton matrx except these elements n the dagonal of the matrx. We then select these genes correspondng to these rows. Due to the error n the process of computng the mutual nformaton, the cutoff value s set to 0.95 so that the redundancy can be reduced as much as possble. Otherwse, f the cutoff value s set to, the redundancy cannot be reduced to the expected extent. After selectng the genes wth lttle redundancy, f any, the selected genes (SGS) are ranked accordng to the nterdependence measure I between the gene expresson profles and the class label. Then the SGS are used to tran the RBF neural network to classfy the cancer subtypes n the desgned experment formulated n the next secton.

458 S.-B. Guo, M.R. Lyu, and T.-M. Lok 3 Test Statstcs A general statstcal model for gene expresson values wll be frstly ntroduced followed by several test statstcs n ths secton. Assume that there are more than two knds of dstnct tumor tssue classes for the problem under consderaton and there are p genes (varables) and n tumor mrna samples (observatons). After ntroducng the novel gene selecton method, we now turn to some test statstcs used for testng the equalty of the class means for a fxed gene. The followng fve parametrc test statstcs wll be consdered [0]. 3. ANOVA F Test Statstcs The defnton of ths test s gven by: ( n k) n ( Y Y ) F ( k ) ( n ) s. (8) n k, and where Y. Y / n, Y.. ny / n 3. Brown-Forsythe Test Statstc Brown-Forsythe Test Statstc [] s gven by: n ( ) /( ). s Y Y n B n( Y Y ) ( n / n) s. (9) 3.3 Welch Test Statstcs Welch test statstcs [] s defned as wy ( hy) W ( k ) + ( k )( k+ ) ( n ) ( h ) (0) wth w n / s and h w / w.

Gene Selecton Based on Mutual Informaton 459 4 Introducton to the Experment To evaluate the performance of GSMI, we appled t to the well-known gene expresson data sets: the breast cancer data [5], n whch RNA from samples of prmary breast tumors from 7 carrers of the BRCA mutaton, 8 carres of the BRCA mutaton, and 7 patents wth sporadc cases of breast caner have been hybrdzed to a cdna mcroarray contanng 65 complementary DNA clones of 536 genes [6]; Leukema 7 wth 687 genes, 38 ALL-Bcell, 9 ALL-Tcell, and 5 AML, and Ovaran wth 79 genes, 7 epthelal ovaran cancer cases, 5 normal tssues, and 4 malgnant epthelal ovaran cell lnes [3]. Before calculated the mutual nformaton, the mcroarray expresson level should frstly be preprocessed accordng to an alternatve dea of the Optmal Class-Dependence Dscretzaton Algorthm (OCDD) []. OCDD s a new method to convert varables nto dscrete varables for nductve machne learnng, whch can thus be employed for pattern classfcaton problems. The dscretzaton process s formulated as an optmzaton problem, then the normalzed mutual nformaton that measures the nterdependence between the class labels and the varable to dscretzed as the obectve functon, and then teratve dynamc programmng s appled to fnd ts optmum [4]. For each contnuous gene expresson profle n mcroarray expresson matrx A, ts doman s typcally dscretzed nto two ntervals for gene selecton, whch are denoted by 0 and, respectvely. We then use the normalzed mutual nformaton measure that reflects nterdependence between the class label and the attrbute to be dscretzed as the obectve functon to fnd a global optmal soluton separatng the doman of the gene expresson data. We employ the nearest neghbor method to classfy the cancer cases wth dfferent cancer subtypes. The leave-one-out cross-valdaton (LOOCV) s used to evaluate the accuracy of classfcaton. 5 Experment Results and Dscusson By usng our method, genes are ranked by the mutual nformaton between the genes and the class label. Then, the nearest neghbor classfer s employed as the benchmark to classfy the three cancer mcroarray expresson datasets. In the classfcaton performance evaluaton process, we employed LOOCV, whch s a wdely used method for evaluatng the performance of the classfcaton of gene expresson data [7]. The results of our method on the three datasets are gven n Fgure, Fgure and Fgure 3, respectvely. From fgure, the classfcaton error rate mnmzed to 0% when 8 genes are selected accordng to our method, but the genes selected by all the test statstcs used for classfcaton are not as effectve as ours snce the classfcaton accuraces maxmze to 73% at 404 genes for the ANOVA test statstc, 78% at 9 genes for Brown-Forsythe test statstc, 73% at 6 genes for Welch test statstc. From fgure, the classfcaton error rate mnmzed to.39% when 9 genes are

460 S.-B. Guo, M.R. Lyu, and T.-M. Lok selected accordng to our method, but the genes selected by all the test statstcs used for classfcaton are not as effectve as ours snce the classfcaton accuraces maxmze to 85% at 70 genes for the Brown-Forsythe test statstc, 7% at 56 genes for ANOVA test statstc, 80.5% at 354 genes for Welch test statstc. From fgure 3, the classfcaton error rate mnmzed to 0 % when genes are selected accordng to our method, but the genes selected by all the test statstcs used for classfcaton are not as effectve as ours snce the classfcaton accuraces maxmze to 97.3% at 38 genes for the Welch test statstc, 97.3% at 370 genes for ANOVA test statstc, 94.45% at 05 genes for Brown-Forsythe test statstc. The result of the classfcaton of breast cancer subytpes 0.9 0.8 The classfcaton accuracy 0.7 0.6 0.5 0.4 0.3 0. Mutual Informaton ANOVA F test statstc Brown Forsythe test statstc Welch test statstc 0. 0 50 00 50 00 50 300 350 400 450 500 The number of the selected genes used for the classfcaton of breast cancer subtypes Fg.. Comparson on the breast cancer cases Demonstrated by these results, the present method based on the mutual nformaton obvously outperforms the three test statstcs. The maor reason for the superorty of our method s that mutual nformaton between the genes and the class labels reflects the potental relaton and correlaton, and thus ndcates the dscrmnablty of genes. What s more, there s no assumpton about the probablty dstrbuton of the mcroarray data. The three test statstcs are based on the default probablty dstrbuton, but t s not clear whether mcroarray data are accordng to the default probablty tll now wthout adequate samples of the cancer cases.

Gene Selecton Based on Mutual Informaton 46 The result of the classfcaton of Leukema cancer cases 0.9 0.8 Classfcaton accuracy 0.7 0.6 0.5 0.4 Mutual Informaton ANOVA F test statstc Brown Forsythe test statstc Welch test statstc 0.3 0 50 00 50 00 50 300 350 400 450 500 The number of selected genes Fg.. Comparson on the leukema cancer cases The result of the classfcaton of ovaran cancer cases 0.95 0.9 The classfcaton accuracy 0.85 0.8 0.75 0.7 0.65 0.6 Mutual Informaton ANOVA F test statstc Brown Forsythe test statstc Welch test statstc 0.55 0.5 0 50 00 50 00 50 300 350 400 450 500 The number of the selected genes used for the classfcaton of ovaran cancer cases Fg. 3. Comparson on the ovaran cancer cases

46 S.-B. Guo, M.R. Lyu, and T.-M. Lok 6 Conclusons In ths paper, contrary to other work managng dscrmnatng one cancer subtype from the other, we present the novel method for the classfcaton of the mult-class cancer subtypes, whch wll contrbute the development of the technology for the research and cure of cancer. In our work, an nformaton theoretc approach s proposed for gene selecton algorthm based on the mutual nformaton, whch proves promsng n the classfcaton of mult-class cancer mcroarray datasets. The mutual nformaton between the gene expresson data and the class label s calculated. And the genes are selected accordng to the calculated mutual nformaton after removng the redundancy. The successful applcatons on the breast, as well as ovaran and leukema cancer datasets prove that our algorthm s effectve, robust and approprate to the classfcaton of the mult-class cancers snce t can dscovery the nformatve key genes. Future work wll concentrate on the further research on the method for extractng the key features from gene expresson data. And we wll try to fully evaluate the performance of the classfcaton by employng SVM and neural network, such as RBF neural network. Furthermore, we wll also concentrate on the research of bologcal sgnfcance of the found key genes and try to fnd the specfc feature of the key genes and understand the functon of these genes. By dong so, the mcroarray technology can be fully used. Acknowledgement The authors are grateful to Dechang Chen for sharng the mcorarray data sets of breast, ovaran as well as leukema wth us. References. Ben-Dor, A.: Tssue Classfcaton wth Gene Expresson Profles, Journal of Computatonal Bology, 7 (000) 559-583. Weston, J., Mukheree, S., Chapelle, O., Pontl, M., Poggo, T., Vapnk, V.: Feature Selecton for SVMs. In Advances n Neural Informaton Processng Systems, MIT Press, 3 (00) 3. Xng, E. P., Rchard, M. K.: CLIFF: Clusterng of Hgh-Dmensonal Mcroarray Data va Iteratve Feature Flterng Usng Normalzed Cuts. Bonformatcs, 7 () (00) 306-35 4. Cover, T., Thomas J.: Elements of Informaton Theory. John Wley and Sons, Inc (99) 5. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bttner, M., Smon, R., Meltzer, P., Gusterson, B., Esteller, M., Raffeld, M.: Gene-Expresson Profles n Heredtary Breast Cancer. New Eng. J. Med. 344 (00) 539-548 6. Nathale, P., Frank, D. S., Johan, A. K. S., Bart, L. R. D. M.: Systematc Benchmarkng of Mcroarray Data Classfcaton: Assessng the Role of Non-lnearty and Dmensonalty Reducton. Bonformatcs, 0 (7) (004) 385-395 7. Smon, R.: Supervsed Analyss when The Number of Canddate Features Greatly Exceeds the Number of the Cases. SIGKDD Exploratons, 5 ( ) (003) 3-36

Gene Selecton Based on Mutual Informaton 463 8. Au, W. H., Keth, C.C. C., Andrew, K.C. W., Wang, Y.: Attrbute Clusterng for Groupng, Selecton and Classfcaton of Gene Expresson Data. IEEE/ACM Transactons on computatonal bology and bonformatcs, Aprl-June, () (005) 83-0 9. MacKay, D. J. C.: Informaton Theory, Inference, and Learnng Algorthm. Cambrdge Unv. Press (003) 0. Chen, D. C., Lu, Z. Q., Ma, X. B., Hua, D.: Selectng Genes by Test Statstcs. Journal of Bomedcne and Botechnology, (005) 3-38. Brown, M. B., Forsythe, A. B.: The Small Sample Behavor of Some Statstc whch Test the Equalty of Several Means. Technometrcs (974) 9-3. Welch, B. L.: On the Comparson of Several Mean Values: An Alternatve Approach. Bometrka., 38 (95) 330-336 3. Chen, D. C., Hua, D., Jaques, R., Cheng, X. Z.: Gene Selecton for Mult-class Predcton of Mcroarray Data. Bonformatcs Conference, 003, CSB 03, Proceedngs of the 003 IEEE, (003) 49-495 4. Lu, L., Andrew, K.C. W., Wang, Y.: A Global Optmal Algorthm for Class-Dependent Dscretzaton of Contnuous Data, 8 () (004) 5-70