Gene Selection Based on Mutual Information for the Classification of Multi-class Cancer

Similar documents
Study and Comparison of Various Techniques of Image Edge Detection

A Support Vector Machine Classifier based on Recursive Feature Elimination for Microarray Data in Breast Cancer Characterization. Abstract.

A Support Vector Machine Classifier based on Recursive Feature Elimination for Microarray Data in Breast Cancer Characterization. Abstract.

Survival Rate of Patients of Ovarian Cancer: Rough Set Approach

Biomarker Selection from Gene Expression Data for Tumour Categorization Using Bat Algorithm

Lymphoma Cancer Classification Using Genetic Programming with SNR Features

AN ENHANCED GAGS BASED MTSVSL LEARNING TECHNIQUE FOR CANCER MOLECULAR PATTERN PREDICTION OF CANCER CLASSIFICATION

Reconstruction of gene regulatory network of colon cancer using information theoretic approach

IMPROVING THE EFFICIENCY OF BIOMARKER IDENTIFICATION USING BIOLOGICAL KNOWLEDGE

Parameter Estimates of a Random Regression Test Day Model for First Three Lactation Somatic Cell Scores

Modeling the Survival of Retrospective Clinical Data from Prostate Cancer Patients in Komfo Anokye Teaching Hospital, Ghana

Joint Modelling Approaches in diabetes research. Francisco Gude Clinical Epidemiology Unit, Hospital Clínico Universitario de Santiago

A MIXTURE OF EXPERTS FOR CATARACT DIAGNOSIS IN HOSPITAL SCREENING DATA

Cancer Classification Based on Support Vector Machine Optimized by Particle Swarm Optimization and Artificial Bee Colony

Statistically Weighted Voting Analysis of Microarrays for Molecular Pattern Selection and Discovery Cancer Genotypes

A Computer-aided System for Discriminating Normal from Cancerous Regions in IHC Liver Cancer Tissue Images Using K-means Clustering*

CLUSTERING is always popular in modern technology

Nonstandard Machine Learning Algorithms for Microarray Data Mining. Byoung-Tak Zhang

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Research Article Statistical Analysis of Haralick Texture Features to Discriminate Lung Abnormalities

Physical Model for the Evolution of the Genetic Code

Copy Number Variation Methods and Data

Prediction of Total Pressure Drop in Stenotic Coronary Arteries with Their Geometric Parameters

AUTOMATED DETECTION OF HARD EXUDATES IN FUNDUS IMAGES USING IMPROVED OTSU THRESHOLDING AND SVM

A Classification Model for Imbalanced Medical Data based on PCA and Farther Distance based Synthetic Minority Oversampling Technique

JOINT SUB-CLASSIFIERS ONE CLASS CLASSIFICATION MODEL FOR AVIAN INFLUENZA OUTBREAK DETECTION

Optimal Planning of Charging Station for Phased Electric Vehicle *

310 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

Modeling Multi Layer Feed-forward Neural. Network Model on the Influence of Hypertension. and Diabetes Mellitus on Family History of

Using Past Queries for Resource Selection in Distributed Information Retrieval

INITIAL ANALYSIS OF AWS-OBSERVED TEMPERATURE

ENRICHING PROCESS OF ICE-CREAM RECOMMENDATION USING COMBINATORIAL RANKING OF AHP AND MONTE CARLO AHP

Project title: Mathematical Models of Fish Populations in Marine Reserves

FAST DETECTION OF MASSES IN MAMMOGRAMS WITH DIFFICULT CASE EXCLUSION

Journal of Engineering Science and Technology Review 11 (2) (2018) Research Article

A New Diagnosis Loseless Compression Method for Digital Mammography Based on Multiple Arbitrary Shape ROIs Coding Framework

Boosting for tumor classification with gene expression data. Seminar für Statistik, ETH Zürich, CH-8092, Switzerland

Introduction ORIGINAL RESEARCH

Association Analysis and Distribution of Chronic Gastritis Syndromes Based on Associated Density

Classification of Breast Tumor in Mammogram Images Using Unsupervised Feature Learning

Estimation for Pavement Performance Curve based on Kyoto Model : A Case Study for Highway in the State of Sao Paulo

ARTICLE IN PRESS. computer methods and programs in biomedicine xxx (2007) xxx xxx. journal homepage:

EVALUATION OF BULK MODULUS AND RING DIAMETER OF SOME TELLURITE GLASS SYSTEMS

Subject-Adaptive Real-Time Sleep Stage Classification Based on Conditional Random Field

Automated and ERP-Based Diagnosis of Attention-Deficit Hyperactivity Disorder in Children

4.2 Scheduling to Minimize Maximum Lateness

Feature Selection for Predicting Tumor Metastases in Microarray Experiments using Paired Design

Using a Wavelet Representation for Classification of Movement in Bed

Detection of Lung Cancer at Early Stage using Neural Network Techniques for Preventing Health Care

INTEGRATIVE NETWORK ANALYSIS TO IDENTIFY ABERRANT PATHWAY NETWORKS IN OVARIAN CANCER

An Approach to Discover Dependencies between Service Operations*

A New Machine Learning Algorithm for Breast and Pectoral Muscle Segmentation

Combined Temporal and Spatial Filter Structures for CDMA Systems

Estimation of Relative Survival Based on Cancer Registry Data

Evaluation of the generalized gamma as a tool for treatment planning optimization

INTRAUTERINE GROWTH RESTRICTION (IUGR) RISK DECISION BASED ON SUPPORT VECTOR MACHINES

Using the Perpendicular Distance to the Nearest Fracture as a Proxy for Conventional Fracture Spacing Measures

Comparison among Feature Encoding Techniques for HIV-1 Protease Cleavage Specificity

Price linkages in value chains: methodology

DETECTION AND CLASSIFICATION OF BRAIN TUMOR USING ML

Experimental Study of Dielectric Properties of Human Lung Tissue in Vitro

Towards Prediction of Radiation Pneumonitis Arising from Lung Cancer Patients Using Machine Learning Approaches

Proceedings of the 6th WSEAS Int. Conf. on EVOLUTIONARY COMPUTING, Lisbon, Portugal, June 16-18, 2005 (pp )

*VALLIAPPAN Raman 1, PUTRA Sumari 2 and MANDAVA Rajeswari 3. George town, Penang 11800, Malaysia. George town, Penang 11800, Malaysia

Study on Psychological Crisis Evaluation Combining Factor Analysis and Neural Networks *

Fast Algorithm for Vectorcardiogram and Interbeat Intervals Analysis: Application for Premature Ventricular Contractions Classification

Nonlinear Modeling Method Based on RBF Neural Network Trained by AFSA with Adaptive Adjustment

A Linear Regression Model to Detect User Emotion for Touch Input Interactive Systems

Lateral Transfer Data Report. Principal Investigator: Andrea Baptiste, MA, OT, CIE Co-Investigator: Kay Steadman, MA, OTR, CHSP. Executive Summary:

Evaluation of Literature-based Discovery Systems

Insights in Genetics and Genomics

Machine Understanding - a new area of research aimed at building thinking/understanding machines

PERFORMANCE EVALUATION OF DIVERSIFIED SVM KERNEL FUNCTIONS FOR BREAST TUMOR EARLY PROGNOSIS

A GEOGRAPHICAL AND STATISTICAL ANALYSIS OF LEUKEMIA DEATHS RELATING TO NUCLEAR POWER PLANTS. Whitney Thompson, Sarah McGinnis, Darius McDaniel,

Computing and Using Reputations for Internet Ratings

THE NATURAL HISTORY AND THE EFFECT OF PIVMECILLINAM IN LOWER URINARY TRACT INFECTION.

Pattern Recognition for Robotic Fish Swimming Gaits Based on Artificial Lateral Line System and Subtractive Clustering Algorithms

NUMERICAL COMPARISONS OF BIOASSAY METHODS IN ESTIMATING LC50 TIANHONG ZHOU

ARTICLE IN PRESS Biomedical Signal Processing and Control xxx (2011) xxx xxx

Available online at ScienceDirect. Procedia Computer Science 46 (2015 )

A comparison of statistical methods in interrupted time series analysis to estimate an intervention effect

Improvement of Automatic Hemorrhages Detection Methods using Brightness Correction on Fundus Images

Automatic Labelling and BI-RADS Characterisation of Mammogram Densities

Effects of Estrogen Contamination on Human Cells: Modeling and Prediction Based on Michaelis-Menten Kinetics 1

Prognosis and Diagnosis of Breast Cancer Using Interactive Dashboard Through Big Data Analytics

A Neural Network System for Diagnosis and Assessment of Tremor in Parkinson Disease Patients

Performance Evaluation of Public Non-Profit Hospitals Using a BP Artificial Neural Network: The Case of Hubei Province in China

Journal of Engineering Science and Technology Review 11 (2) (2018) Research Article

Resampling Methods for the Area Under the ROC Curve

A REVIEW OF ARTIFICIAL FISH SWARM OPTIMIZATION METHODS AND APPLICATIONS

We analyze the effect of tumor repopulation on optimal dose delivery in radiation therapy. We are primarily

A Geometric Approach To Fully Automatic Chromosome Segmentation

Research Article Computational Analysis of Specific MicroRNA Biomarkers for Noninvasive Early Cancer Detection

EXAMINATION OF THE DENSITY OF SEMEN AND ANALYSIS OF SPERM CELL MOVEMENT. 1. INTRODUCTION

econstor Make Your Publications Visible.

USING DIFFERENTIAL GEOMETRIC LARS ALGORITHM TO STUDY THE EXPRESSION PROFILE OF A SAMPLE OF PATIENTS WITH LATEX-FRUIT SYNDROME

A Heuristic Method of the Optimal Matching for the Two Unstructured Systems

The Effect of Fish Farmers Association on Technical Efficiency: An Application of Propensity Score Matching Analysis

Balanced Query Methods for Improving OCR-Based Retrieval

Transcription:

Gene Selecton Based on Mutual Informaton for the Classfcaton of Mult-class Cancer Sheng-Bo Guo,, Mchael R. Lyu 3, and Tat-Mng Lok 4 Department of Automaton, Unversty of Scence and Technology of Chna, Hefe, Anhu, 3006, Chna sbguo@m.ac.cn Intellgent Computaton Lab, Hefe Insttute of Intellgent Machnes, Chnese Academy of Scences, P.O. Box 30, Hefe, Anhu, 3003, Chna 3 Computer Scence & Engneerng Dept., The Chnese Unversty of Hong Kong, Shatn, Hong Kong 4 Informaton Engneerng Dept., The Chnese Unversty of Hong Kong, Shatn, Hong Kong Abstract. Wth the development of mrocarray technology, mcroarray data are wdely used n the dagnoses of cancer subtypes. However, people are stll facng the complcated problem of accurate dagnoss of cancer subtypes. Buldng classfers based on the selected key genes from mcroarray data s a promsng approach for the development of mcroarray technology; yet the selecton of non-redundant but relevant genes s complcated. The selected genes should be small enough to allow dagnoss even n regular laboratores and deally dentfy genes nvolved n cancer-specfc regulatory pathways. Instead of the tradtonal gene selecton methods used for the classfcaton of two categores of cancers, n the present paper, a novel gene selecton algorthm based on mutual nformaton s proposed for the classfcaton of mult-class cancer usng mcroarray data, and the selected key genes are fed nto the classfer to classfy the cancer subtypes. In our algorthm, mutual nformaton s employed to select key genes related wth class dstncton. The applcaton on the breast cancer data suggests that the present algorthm can dentfy the key genes to the BRCA mutatons/brca mutatons/the sporadc mutatons class dstncton snce the result of our proposed algorthm s promsng, because our method can perform the classfcaton of the three types of breast cancer effectvely and effcently. And two more mcroarray datasets, leukema and ovaran cancer data, are also employed to valdate the performance of our method. The performances of these applcatons demonstrate the hgh qualty of our method. Based on the present work, our method can be wdely used to dscrmnate dfferent cancer subtypes, whch wll contrbute to the development of technology for the recovery of the cancer. Introducton Mcroarray technology, a recent development n expermental molecular bology, provdes bomedcal researchers the ablty to measure expresson levels of thousands of genes smultaneously. Such gene expresson profles are used to understand the D.-S. Huang, K. L, and G.W. Irwn (Eds.): ICIC 006, LNBI 45, pp. 454 463, 006. Sprnger-Verlag Berln Hedelberg 006

Gene Selecton Based on Mutual Informaton 455 molecular varatons among dsease related cellular processed, and also to help the ncreasng development of dagnostc tools and classfcaton platforms n the cancer research. Wth the development of the mcroarray technology, the necessary processng and analyss methods grow ncreasngly crtcal. It becomes gradually urgent and challengng to explore the approprate approaches because of the large scale of mcroarray data comprsed of the large number of genes compared to the small number of samples n a specfc experment. For the data obtaned n a typcal experment, only some of genes are useful to dfferentate samples among dfferent classes, but many other genes are rrelevant to the classfcaton. Those rrelevant genes not only ntroduce some unnecessary nose to gene expresson data analyss, but also ncrease the dmensonalty of the gene expresson matrx, whch results n the ncrease of the computatonal complexty n varous consequent researches such as classfcaton and clusterng. As a consequence, t s sgnfcant to elmnate those rrelevant genes and dentfy the nformatve genes, whch s a feature selecton problem crucal n gene expresson data analyss [, ]. In the present paper, we propose a novel gene selecton method based on the mutual nformaton for the mult-class cancer classfcaton usng mcroarray data. Our method frstly calculates the mutual nformaton (MI) between the dscreted gene expresson profles and the cancer class label vector for all the samples. Then, the genes are ranked accordng to the calculated MI. These selected genes wth hgh ranks are fed nto the nearest neghbor method. The rest of the present paper s organzed as follows. In secton, we frst ntroduce the method to dscretze the gene expresson data, and then we n detal formulate the prncple of GSMI. Secton 3 descrbes the test statstcs. And Secton 4 descrbes the experment. In Secton 5, GSMI s appled to analyze the breast cancer dataset. Secton 6 contans the conclusons. Methods Among the many thousands of genes smultaneously measured n a specfc mcroarray experment, t s mpossble that all of ther expressons are related to a partcular partton of the samples. In the analyss of a bologcal system, the followng rules of thumb regardng gene functons are often assumed. ) A gene can be n ether the on or off state; ) not all genes smultaneously respond to a sngle physologcal event; 3) gene functons are hghly redundant [3]. Accordng to these assumptons, we consder the genes as random varables wth two values, n whch denotes the on state and 0 denotes the off state. As a consequence, the gene expresson data can be dscretzed nto two states 0 and, respectvely. The dscretzaton of the gene expresson data wll be formulated later n Secton 3. Assume that a mcroarray dataset can be represented as a G S matrx A wth generc element a gs representng the expresson level of the gene, g n sample, s. All the samples are dvded nto n categores, and wth the class label denoted by C wth ts element a gs standng for the class of th sample. From the bologcal pont of vew, those genes, havng hgher mutual agreement wth class label of the cancer mcroarray

456 S.-B. Guo, M.R. Lyu, and T.-M. Lok data, contrbute more sgnfcantly on the classfcaton of the cancer subtypes. Consequently, these genes should be selected as the key genes and used to the sequent classfcaton and clusterng. Accordng to the nformaton theory, mutual nformaton can be used to measure the mutual agreement between two obect models. We then employ the mutual nformaton to rank every gene accordng to mutual nformaton between the gene and the class label of the cancer mcroarray data. Based on the nformaton-theoretc prncple of mutual nformaton, the mutual nformaton of two random varables X and h w / w wth a ont probablty mass functon pxy (, ) and margnal probablty mass functons px ( ) and py ( ) s defned as [4]: px (, y) IXY ( ; ) pxy (, )log. () ( ) ( ) x, y px py Let us suppose that the doman of G, {,..., G}, s dstcretzed nto two ntervals. After dscretzaton, the domans of all the genes can be represented by dom( G) { vk}, k, where v 0 and v 0. Denoted by σ the SELECT operaton from relatonal algebra and S denote the cardnalty of set S [8]. The probablty of a gene n mcroarray data havng G v, {,..., G}, k {,} s then gven by: k σ G ( ) k A P( G vk ). σ ( A ) G Φ () And the ont probablty of the gene n the gene expresson data has G vk and the class label C c, {,..., n} s calculated by: σ G ( ) vk C c A P( G vk C c). σ ( A) G NULL (3) Defnton. The nterdependence measure I between the gene and the class label, G and C, {,..., G}, s defned as: PG ( v C c) IG ( : C) PG ( v C c)log ( ) ( ) n k k k l PG vk PC c. (4)

Gene Selecton Based on Mutual Informaton 457 IG ( : C) measures the average reducton n uncertanty about G that results from learnng the value of C [9]. If IG ( : C) > IG ( : C),, {,..., G},, the dependence of G and the class label C s greater than the dependence of G and C. Before rankng the genes accordng to the mutual nformaton, the redundancy n the mcroarray should be decreased because of the fundamental prncple of mcroarray technology. Due to the prncple of mcorarray technology, the gene expresson matrx contans hgh redundancy snce some genes are measured more than once. Defnton. The mutual nformaton matrx of the mcroarray, named as M wth ts element m, s gven by: PG ( v G v ) m P( G v G v )log. (5) ( ) ( ) k l k l k l PG vk PG vl For smplcty, the mutual nformaton matrx M s normalzed to M element m gven as follows, wth ts m * m (6) HG (, G) where the ont entropy of the gene G and gven by: G s denoted by HG (, G ), whch s HGG (, ) PG ( v G v )log PG ( v G v ) k l k l k l. (7) The redundancy n the mcroarray data s reduced by the followng method. In the matrx M, the elements on the dagonal are all wth the same value. These rows wthout contanng the value less than 0.95 are labeled n the normalzed mutual nformaton matrx except these elements n the dagonal of the matrx. We then select these genes correspondng to these rows. Due to the error n the process of computng the mutual nformaton, the cutoff value s set to 0.95 so that the redundancy can be reduced as much as possble. Otherwse, f the cutoff value s set to, the redundancy cannot be reduced to the expected extent. After selectng the genes wth lttle redundancy, f any, the selected genes (SGS) are ranked accordng to the nterdependence measure I between the gene expresson profles and the class label. Then the SGS are used to tran the RBF neural network to classfy the cancer subtypes n the desgned experment formulated n the next secton.

458 S.-B. Guo, M.R. Lyu, and T.-M. Lok 3 Test Statstcs A general statstcal model for gene expresson values wll be frstly ntroduced followed by several test statstcs n ths secton. Assume that there are more than two knds of dstnct tumor tssue classes for the problem under consderaton and there are p genes (varables) and n tumor mrna samples (observatons). After ntroducng the novel gene selecton method, we now turn to some test statstcs used for testng the equalty of the class means for a fxed gene. The followng fve parametrc test statstcs wll be consdered [0]. 3. ANOVA F Test Statstcs The defnton of ths test s gven by: ( n k) n ( Y Y ) F ( k ) ( n ) s. (8) n k, and where Y. Y / n, Y.. ny / n 3. Brown-Forsythe Test Statstc Brown-Forsythe Test Statstc [] s gven by: n ( ) /( ). s Y Y n B n( Y Y ) ( n / n) s. (9) 3.3 Welch Test Statstcs Welch test statstcs [] s defned as wy ( hy) W ( k ) + ( k )( k+ ) ( n ) ( h ) (0) wth w n / s and h w / w.

Gene Selecton Based on Mutual Informaton 459 4 Introducton to the Experment To evaluate the performance of GSMI, we appled t to the well-known gene expresson data sets: the breast cancer data [5], n whch RNA from samples of prmary breast tumors from 7 carrers of the BRCA mutaton, 8 carres of the BRCA mutaton, and 7 patents wth sporadc cases of breast caner have been hybrdzed to a cdna mcroarray contanng 65 complementary DNA clones of 536 genes [6]; Leukema 7 wth 687 genes, 38 ALL-Bcell, 9 ALL-Tcell, and 5 AML, and Ovaran wth 79 genes, 7 epthelal ovaran cancer cases, 5 normal tssues, and 4 malgnant epthelal ovaran cell lnes [3]. Before calculated the mutual nformaton, the mcroarray expresson level should frstly be preprocessed accordng to an alternatve dea of the Optmal Class-Dependence Dscretzaton Algorthm (OCDD) []. OCDD s a new method to convert varables nto dscrete varables for nductve machne learnng, whch can thus be employed for pattern classfcaton problems. The dscretzaton process s formulated as an optmzaton problem, then the normalzed mutual nformaton that measures the nterdependence between the class labels and the varable to dscretzed as the obectve functon, and then teratve dynamc programmng s appled to fnd ts optmum [4]. For each contnuous gene expresson profle n mcroarray expresson matrx A, ts doman s typcally dscretzed nto two ntervals for gene selecton, whch are denoted by 0 and, respectvely. We then use the normalzed mutual nformaton measure that reflects nterdependence between the class label and the attrbute to be dscretzed as the obectve functon to fnd a global optmal soluton separatng the doman of the gene expresson data. We employ the nearest neghbor method to classfy the cancer cases wth dfferent cancer subtypes. The leave-one-out cross-valdaton (LOOCV) s used to evaluate the accuracy of classfcaton. 5 Experment Results and Dscusson By usng our method, genes are ranked by the mutual nformaton between the genes and the class label. Then, the nearest neghbor classfer s employed as the benchmark to classfy the three cancer mcroarray expresson datasets. In the classfcaton performance evaluaton process, we employed LOOCV, whch s a wdely used method for evaluatng the performance of the classfcaton of gene expresson data [7]. The results of our method on the three datasets are gven n Fgure, Fgure and Fgure 3, respectvely. From fgure, the classfcaton error rate mnmzed to 0% when 8 genes are selected accordng to our method, but the genes selected by all the test statstcs used for classfcaton are not as effectve as ours snce the classfcaton accuraces maxmze to 73% at 404 genes for the ANOVA test statstc, 78% at 9 genes for Brown-Forsythe test statstc, 73% at 6 genes for Welch test statstc. From fgure, the classfcaton error rate mnmzed to.39% when 9 genes are

460 S.-B. Guo, M.R. Lyu, and T.-M. Lok selected accordng to our method, but the genes selected by all the test statstcs used for classfcaton are not as effectve as ours snce the classfcaton accuraces maxmze to 85% at 70 genes for the Brown-Forsythe test statstc, 7% at 56 genes for ANOVA test statstc, 80.5% at 354 genes for Welch test statstc. From fgure 3, the classfcaton error rate mnmzed to 0 % when genes are selected accordng to our method, but the genes selected by all the test statstcs used for classfcaton are not as effectve as ours snce the classfcaton accuraces maxmze to 97.3% at 38 genes for the Welch test statstc, 97.3% at 370 genes for ANOVA test statstc, 94.45% at 05 genes for Brown-Forsythe test statstc. The result of the classfcaton of breast cancer subytpes 0.9 0.8 The classfcaton accuracy 0.7 0.6 0.5 0.4 0.3 0. Mutual Informaton ANOVA F test statstc Brown Forsythe test statstc Welch test statstc 0. 0 50 00 50 00 50 300 350 400 450 500 The number of the selected genes used for the classfcaton of breast cancer subtypes Fg.. Comparson on the breast cancer cases Demonstrated by these results, the present method based on the mutual nformaton obvously outperforms the three test statstcs. The maor reason for the superorty of our method s that mutual nformaton between the genes and the class labels reflects the potental relaton and correlaton, and thus ndcates the dscrmnablty of genes. What s more, there s no assumpton about the probablty dstrbuton of the mcroarray data. The three test statstcs are based on the default probablty dstrbuton, but t s not clear whether mcroarray data are accordng to the default probablty tll now wthout adequate samples of the cancer cases.

Gene Selecton Based on Mutual Informaton 46 The result of the classfcaton of Leukema cancer cases 0.9 0.8 Classfcaton accuracy 0.7 0.6 0.5 0.4 Mutual Informaton ANOVA F test statstc Brown Forsythe test statstc Welch test statstc 0.3 0 50 00 50 00 50 300 350 400 450 500 The number of selected genes Fg.. Comparson on the leukema cancer cases The result of the classfcaton of ovaran cancer cases 0.95 0.9 The classfcaton accuracy 0.85 0.8 0.75 0.7 0.65 0.6 Mutual Informaton ANOVA F test statstc Brown Forsythe test statstc Welch test statstc 0.55 0.5 0 50 00 50 00 50 300 350 400 450 500 The number of the selected genes used for the classfcaton of ovaran cancer cases Fg. 3. Comparson on the ovaran cancer cases

46 S.-B. Guo, M.R. Lyu, and T.-M. Lok 6 Conclusons In ths paper, contrary to other work managng dscrmnatng one cancer subtype from the other, we present the novel method for the classfcaton of the mult-class cancer subtypes, whch wll contrbute the development of the technology for the research and cure of cancer. In our work, an nformaton theoretc approach s proposed for gene selecton algorthm based on the mutual nformaton, whch proves promsng n the classfcaton of mult-class cancer mcroarray datasets. The mutual nformaton between the gene expresson data and the class label s calculated. And the genes are selected accordng to the calculated mutual nformaton after removng the redundancy. The successful applcatons on the breast, as well as ovaran and leukema cancer datasets prove that our algorthm s effectve, robust and approprate to the classfcaton of the mult-class cancers snce t can dscovery the nformatve key genes. Future work wll concentrate on the further research on the method for extractng the key features from gene expresson data. And we wll try to fully evaluate the performance of the classfcaton by employng SVM and neural network, such as RBF neural network. Furthermore, we wll also concentrate on the research of bologcal sgnfcance of the found key genes and try to fnd the specfc feature of the key genes and understand the functon of these genes. By dong so, the mcroarray technology can be fully used. Acknowledgement The authors are grateful to Dechang Chen for sharng the mcorarray data sets of breast, ovaran as well as leukema wth us. References. Ben-Dor, A.: Tssue Classfcaton wth Gene Expresson Profles, Journal of Computatonal Bology, 7 (000) 559-583. Weston, J., Mukheree, S., Chapelle, O., Pontl, M., Poggo, T., Vapnk, V.: Feature Selecton for SVMs. In Advances n Neural Informaton Processng Systems, MIT Press, 3 (00) 3. Xng, E. P., Rchard, M. K.: CLIFF: Clusterng of Hgh-Dmensonal Mcroarray Data va Iteratve Feature Flterng Usng Normalzed Cuts. Bonformatcs, 7 () (00) 306-35 4. Cover, T., Thomas J.: Elements of Informaton Theory. John Wley and Sons, Inc (99) 5. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bttner, M., Smon, R., Meltzer, P., Gusterson, B., Esteller, M., Raffeld, M.: Gene-Expresson Profles n Heredtary Breast Cancer. New Eng. J. Med. 344 (00) 539-548 6. Nathale, P., Frank, D. S., Johan, A. K. S., Bart, L. R. D. M.: Systematc Benchmarkng of Mcroarray Data Classfcaton: Assessng the Role of Non-lnearty and Dmensonalty Reducton. Bonformatcs, 0 (7) (004) 385-395 7. Smon, R.: Supervsed Analyss when The Number of Canddate Features Greatly Exceeds the Number of the Cases. SIGKDD Exploratons, 5 ( ) (003) 3-36

Gene Selecton Based on Mutual Informaton 463 8. Au, W. H., Keth, C.C. C., Andrew, K.C. W., Wang, Y.: Attrbute Clusterng for Groupng, Selecton and Classfcaton of Gene Expresson Data. IEEE/ACM Transactons on computatonal bology and bonformatcs, Aprl-June, () (005) 83-0 9. MacKay, D. J. C.: Informaton Theory, Inference, and Learnng Algorthm. Cambrdge Unv. Press (003) 0. Chen, D. C., Lu, Z. Q., Ma, X. B., Hua, D.: Selectng Genes by Test Statstcs. Journal of Bomedcne and Botechnology, (005) 3-38. Brown, M. B., Forsythe, A. B.: The Small Sample Behavor of Some Statstc whch Test the Equalty of Several Means. Technometrcs (974) 9-3. Welch, B. L.: On the Comparson of Several Mean Values: An Alternatve Approach. Bometrka., 38 (95) 330-336 3. Chen, D. C., Hua, D., Jaques, R., Cheng, X. Z.: Gene Selecton for Mult-class Predcton of Mcroarray Data. Bonformatcs Conference, 003, CSB 03, Proceedngs of the 003 IEEE, (003) 49-495 4. Lu, L., Andrew, K.C. W., Wang, Y.: A Global Optmal Algorthm for Class-Dependent Dscretzaton of Contnuous Data, 8 () (004) 5-70