www.arpapress.com/volumes/vol8issue2/ijrras_8_2_02.pdf AN ENHANCED GAGS BASED MTSVSL LEARNING TECHNIQUE FOR CANCER MOLECULAR PATTERN PREDICTION OF CANCER CLASSIFICATION I. Jule 1 & E. Krubakaran 2 1 Department of Computer Scence, Argnar Anna Government Arts College, Musr 621 201, Inda 2 Senor Deputy General Manager, BHEL, Trchy 620 014, Inda ABSTRACT Cancer Classfcaton s becomng the crtcal bass n patent therapy. Researchers are made contnuously n developng and applyng the most accurate classfcaton algorthms based on the gene expresson profles of patents. Mcroarray technologes have made an enormous encroachment on cancer genome research. To predct the Cancer Classfcaton, there are two methods namely Sgnal-to-Nose Rato (SNR) based Genetc Algorthm on Gene Selecton (GAGS) and Mult-Task Support Vector Sample Learnng Technque (MTSVSL) had proposed. The GAGS s a Flter, whch s used to select target genes n the dagnoss of cancer. The MTSVSL Learnng Technque s a Wrapper, whch s based on Back Propagaton Neural Network and Lnear Support Vector Machne. Ths work yeld good classfcaton accuracy for Leukaema cancer genes. From the lterature survey, ths research work revealed that the classfcaton performance nterms of Accuracy and Error Rate could be mproved f Counter Propagaton Neural Network (CPNN) s combned wth MTSVSL nstead of BPNN. Ths s called as Enhanced MTSVSL (EMTSVSL) Learnng Technque. From the expermental result, t s establshed that ths proposed Technque acheves hgher classfcaton performance nterms Accuracy and Error Rate as compared wth exstng technque. Keywords: Gene Predcton, Genetc Algorthm Gene Selecton, Cancer Classfcaton, Mult Task Learnng, Support Vectors, Back Propagaton Neural Network, Counter Propagaton Neural Network. 1. INTRODUCTION Mcro array technologes, whch measure the expresson level for thousands of gene expresson smultaneously, have had a great mpact on cancer genome research over the past few years. The Mcroarray Gene Selecton[1,2,4,7] procedure s shown n the Fgure 1. Currently, mcroarray-based gene expresson proflng has been vewed as a promsng approach n predctng cancer classes and prognoss outcomes. In most cases, cancer dagnoss depends on the use of a complex combnaton of clncal and hstopathologcal data. However, t s often dffcult or mpossble to recognze a tumor type n some atypcal nstances. Large scale proflng of genetc expresson and genomc alternatons usng DNA mcroarrays can reveal the dfferences between normal and malgnant cells, genetc and cellular changes at each stage of tumor progresson and metastass, and the dfference among cancers of dfferent orgns. Cancer classfcaton s becomng the crtcal bass n patent therapy. Researchers are contnuously developng and applyng the most accurate classfcaton algorthms based on the gene expresson profles of patents. Several Data Mnng technques[1,2,4,5,7,8,9] such as Support Vector Machnes (SVM), K-Nearest Neghbors, Ensemble Rough Hypercubod Approach, Multple-Flter-Multple-Wrapper Approach, Prncpal Component Analyss (PCA), Nonnegatve Prncpal Component Analyss (NPCA), Nonparallel Plane Proxmal Classfer (NPPC), Back Propagaton Neural Network and Multple Flter wth Multple Wrapper (MFMW) had been proposed and appled n cancer dagnoss and classfcaton. In a mcroarray chp, the number of genes avalable s far greater than that of samples, a well-known problem called the curse of dmensonalty [8]. However, most genes n a mcroarray gve lttle benefts to the sample classfcaton problem. Therefore, pror to sample classfcaton, t s mportant to perform gene selecton whereby more nterpretable genes are dentfed as bomarkers, so that a more effcent, accurate, and relable performance n classfcaton can be expected. These bomarkers may also be useful for assessng dsease rsk [6] and understandng the basc bology of a dsorder [8]. There are, n general, two approaches to gene selecton, namely flters and wrappers [8]. The flter approach selects genes accordng to ther dscrmnatve powers wth regard to the class labels of samples. Methods such as Sgnal-to-Nose Rato (SNR), t-statstcs (TS), threshold number of msclassfcatons (TNoM) score, and F-test have been shown to be effectve scores for measurng the dscrmnatve power of genes. In all cases, genes are ranked accordng to ther statstcal scores, and a certan number of the hghest rankng genes are selected for the purpose of classfcaton. However, these Flters have faled to select 139
Jule & Krubakaran Cancer Molecular Pattern Predcton more nterpretable genes. To overcome ths dentfed problem, ths paper planned to focus Sgnal-to-Nose Rato (SNR) based Genetc Algorthm on Gene Selecton (GAGS), whch wll mprove the performance of Gene Selecton Technque whereby more nterpretable genes can be dentfed, so that a more effcent, accurate, and relable performance n classfcaton can be acheved. In the wrapper approach, genes are selected sequentally one by one so as to optmze the tranng accuracy of a partcular classfer [8]. That s, the classfer s frst traned usng one sngle gene, and ths tranng s performed for the entre orgnal gene set. The gene that gves the hghest tranng accuracy s selected. Then, a second gene s added to the selected gene and the gene that gves the hghest tranng accuracy for the two-gene classfer s chosen. Ths process s contnued untl a suffcently hgh accuracy s acheved wth a certan gene subset. From the lterature survey, t s observed the exstng classfers such as Support Vector Machne (SVM), k-nearest Neghbor have ts own lmtatons such as False Postve and False Negatve classfcaton. Fgure. 1. Mcroarray Gene Selecton Mechansm To overcome ths, Austn H and et.al., have proposed the MTSVSL Learnng Technque, whch s based on Back Propagaton Neural Network and Lnear Support Vector Machne. Ths work yeld good classfcaton performance n terms of Accuracy and Error Rate. 1.1 Obectve of ths Work However, from our lterature survey[1,2,8,9], t s dentfed that the performance of Mult-Task Support Vector Sample Learnng (MTSVSL) technque could be mproved f Counter Propagaton Neural Network s ntroduced wth Genetc Algorthm based Gene Selecton (GAGS) rather Back Propagaton Neural Network (BPNN), whch can be named as Extended MTSVSL Learnng Technque. Ths wll acheve to fnd an optmal nformaton gene subset, thereby avodng the over-fttng problem caused by attemptng to apply a large number of genes to a small number of samples. 2. BACKGROUND In ths Secton, the features of Sgnal-to-Nose (SNR) Gene Selecton Method, Genetc Algorthm based Gene Selecton (GAGS) method, Support Vector Samplng Technque (SVS) and Mult-Task Learnng (MTL) method are dscussed. 2.1 Sgnal-to-Nose (SNR) based Gene Selecton Method Gene Selecton s wdely used to select target genes n the dagnoss of cancer. One of the prme goals of gene selecton s to avod the over-fttng problems caused by the hgh dmensons and relatvely small number of samples of mcroarray data. Theoretcally, n cancer classfcaton, only nformatve genes whch are hghly related to partcular classes should be selected. In the study of Austn H and et.al., t had used Sgnal-to-Nose Rato (SNR) as the Gene Selecton method [1]. For each gene, ths work has normalzed the gene expresson data by subtractng the mean and then dvdng by the standard devaton of the expresson value. Every sample s labeled wth {+1,-1} as ether a normal or a cancer sample. The followng formula s used to calculate each gene s F score. 140
Jule & Krubakaran Cancer Molecular Pattern Predcton F g ) 1 ( g ) 1 ( 1 1 ( g) ( g ) ( g )..(1) The µand σ are the mean and standard devaton of the samples n each class (ether +1or -1) ndvdually. Ths work rank these genes wth an F score 2.2 Genetc Algorthm based Gene Selecton (GAGS) Technque The genetc algorthm[1] s an effectve algorthm n searchng complex hgh-dmensonal space and n fndng the optmal soluton. Austn H and et.al., proposed ths Genetc Algorthm based Gene Selecton method that can fnd the most nformatve gene set. The genetc algorthm s a type of evolutonary computng method wdely used n smulatng the process of natural selecton. The basc concept behnd the genetc algorthm s conssted of four steps. They are Populaton Reproducton Crossover And Mutaton Before begnnng the genetc algorthm, ths work has randomly separated the gene expresson data nto three parts. They are Testng Dataset, Tranng Dataset And Valdaton Dataset. The Testng Dataset s an ndependent dataset used purely for measurng the classfcaton performance. 2.3 Populaton Here, all the genes are randomly separated nto m chromosomes and each chromosome contans n genes. Each chromosome represents a possble gene subset. The system s desgned to set the value of m and n depends upon the requrement. 2.4 Reproducton In the bologcal evolutonary process, only the organsms that adapt to the envronment survve. Only chromosomes wth hgh ftness scores replcate and are passed onto the next stage. The ftness functon s defned as Ftness 1 2 ATR ATV 3 3..(2) where ATR s the predctve accuracy of the tranng dataset usng the support vector machne and ATV s the predctve accuracy of the valdaton dataset. The reproducton rate may nfluence the varety of chromosomes. If the varety of chromosomes s low, the genetc algorthm may catch a local optmum soluton nstead of a global optmum soluton. 2.5 Crossover After the reproducton phase, offsprngs are created by crossng over the parent chromosomes at the cross pont. The sngle-pont crossover approach was used. The crossover pont s randomly generated and two chromosomes are randomly selected to do so at ths pont 2.6 Mutaton To ncrease the possblty of fndng the optmal soluton, a mutaton phase s appled. We wll set P and p as the mutaton possblty of each chromosome and each gene respectvely. Here, every chromosome may generate a random number R, and f R > P then ths chromosome wll be added to the mutaton pool. Every gene n these chromosomes may also generate a random number r, where f r > p then the gene wll be replaced wth another randomly selected gene from the F-gene pool. 2.7 Mult-Task Support Vector Sample Learnng (MTSVSL) Ths Mult-Task Support Vector Sample Learnng (MTSVSL) has two methodologes[1]. These technologes combned together to mprove the classfcaton accuracy from the gene expresson data. The technologes are 141
Jule & Krubakaran Cancer Molecular Pattern Predcton Support Vector Sample (SVS) method and Mult-Task Learnng (MTL) Method. By usng ths approach, a classfer can learn two tasks. They are. the man task s whch knd of sample s ths? and the second task s s ths sample a support vector sample?. Ths work categorze the samples nto four classes, namely 1. The sample whch belongs to class 1 and s a support vector sample 2. The sample whch belongs to class 2 and s a support vector sample 3. The sample whch belongs to class 1 but s not a support vector sample 4. The sample whch belongs to class 2 but s not a support vector sample 2.8 Support Vector Samplng Technque (SVS) A bnary SVM[1,8,9] attempts to fnd a hyperplane whch maxmzes the margn between two classes (+1/-1). Let, 1,2..., 1,1, X R X Y Y, (3) be the gene expresson data wth postve and negatve class labels. The SVM learnng algorthm should fnd a maxmzed separatng hyperplane W * X +b = 0, where W s the n-dmensonal vector, whch s called the normal vector that s perpendcular to the hyperplane, and b s the bas. The SVM decson functon s showed n formula(4), where α s a postve real numbers and φ s mappng functon T W T ( X ) b Y( X ) ( X ) b (4) 1 Only ( X ) of α > 0 would be used, and these ponts are support vectors. The support vectors lay close to the separatng hyperplane. Here 0 < α < C, where C s the penalty parameter of Error Term. If α becomes zero, there s no nfluence to the hyperplane. 2.9 Mult-Task Learnng (MTL) method The prncple goal of mult-task learnng[1] s to mprove the performance of a classfer. The mult-task learnng technque can be consdered as an nductve transfer mechansm where the nductve transfer leverages addtonal sources of nformaton to mprove learnng performance wthn a current task. Varables whch were not used as the ntal nputs may contan some useful nformaton. Instead of dscardng these varables, MTL get the nductve transfer beneft from dscarded varables by usng them as an extra output. The Back Propagaton Neural Network (BPNN) s modeled as MTL and learn tasks. 2.10 Identfed Problems From our lterature survey, t s dentfed that the performance of Mult-Task Support Vector Sample Learnng (MTSVSL) technque s mproved as compared wth Back Propagaton Neural Networks. However, the learnng technque of MTSVSL has faled to select more nterpretable genes and hence unable to mprove the classfcaton accuracy. That s the Wrapper of ths system leads to poor Gene Classfcaton. Ths s the maor drawback. To overcome ths dentfed problem, ths paper planned to mprove the performance of Wrapper. 3. ENHANCED MTSVSL As stated n the prevous secton, the Mult-Task Support Vector Sample Learnng (MTSVSL) technque has two methodologes namely Support Vector Sample (SVS) Technque and Mult-Task Learnng (MTL) Technque. These technologes combned together to mprove the classfcaton accuracy of the gene expresson data. The man obectve of ths work s to mprove the performance of MTSVSL. That s ths work s mproved the performance of MTL wth Counter Propagaton Neural Networks. 3.1 The Prncple of Counter Propagaton Neural Networks The Counter-Propagaton Network s a combnaton of a porton of the Kohonen Self-Organzng Map [10] and Grossberg Outstar Structure [10]. Durng learnng, pars of the nput vector X and output vector Y are presented to the nput and nterpolaton layers, respectvely. These vectors propagate through the network n a counterflow manner to yeld the competton weght vectors and nterpolaton weght vectors. Once these weght vectors become stable, the learnng process s completed. The output vector Y 1 of the network correspondng to the nput vector X s then computed. The vector Y 1 s ntended to be an approxmaton of the output vector Y,.e. Y 1 Y = f(x). The equatons of the network are descrbed brefly as follows. 142
Jule & Krubakaran Cancer Molecular Pattern Predcton Let U = [u ] be the arbtrary ntal competton weght vector for the -th neuron n the competton layer where u s the weght connectng the -th neuron n the competton layer to the -th neuron n the nput layer. The Eucldean dstance between the nput vector X and the competton weght vector U of the -th neuron s calculated, That s d m 2 X U ( x u ) (5) 1 Once the dstance d for each neuron has been calculated, the neuron wth the shortest Eucldean dstance to X s selected to represent the wnnng neuron. As a result of the competton, the output of the wnnng neuron s set to unty and the outputs of the other neurons are set to zero. Thus, the output of the -th neuron n the competton layer can be expressed as 1.0 f d d for all Z (6) 0.0 otherwse The weght u connectng the -th neuron n the competton layer to the -th neuron n the nput layer s adusted based on the Kohonen learnng rule, that s u ( p 1) u ( x u ( p)) Z (7) where β s the learnng coeffcent and p s the teraton number. After the competton weght vector U stablzes, the nterpolaton layer starts to learn the desred output vector Y by adustng the nterpolaton weght vector. Let V = [v ] be the arbtrary ntal nterpolaton weght vector for the -th neuron n the nterpolaton layer where v s the weght connectng the -th neuron n the nterpolaton layer to the -th neuron n the competton layer. The weght v s adusted based on the Grossberg learnng rule, that s v ( p 1) v ( y v ( p)) Z (8) where γ s the learnng coeffcent. Ths s repeated untl the nterpolaton weght vector V converges to a preset value. The output vector Y 1 of the network correspondng to the nput vector X can be calculated usng a weghted summaton functon. The -th component y 1 of the output vector Y 1 can be expressed as y 1 v Z (9) In the foregong dscusson, the counter-propagaton network functons as a look-up table. The learnng process assocates the nput vector wth the correspondng output vector based on two well-known algorthms, namely the Kohonen self-organzng map for fndng the most smlar tranng vector and the Grossberg outstar map for proectng the correspondng output vector. Once the network s traned, the applcaton of an nput vector can quckly produce the correspondng output vector. Ths s the enhanced MTL 143
Jule & Krubakaran Cancer Molecular Pattern Predcton 4. EXPERIMENTAL RESULTS AND DISCUSSIONS We have been developed MTSVSL Tool wth NetBeans and t s confgured wth BoWeka0.6.1. As shown n the Fgure. 2. GAGS based MTSVSL Tool wth BoWeka0.6.1 Fgure. 3. MTSVSL SVM Classfcaton 144
Jule & Krubakaran Cancer Molecular Pattern Predcton Fgure 2, t conssts of two modules. The frst module s a Flterng Module, where GAGS s mplemented. In ths module, the Chromosome Sze can be fxed. The second module s the Wrapper Module, where MTSVSL and EMTSVSL have been mplemented. As shown n the Fgure 3, SVM wth BPNN s classfyng the Cancer Pattern from the Dataset. For expermental study, the work s consdered Leukaema Cancer Pattern Datasets and number of Top Genes are taken as 100 and 150. The Confuson Matrces and ther Accuracy and Error Rate are shown n the Fgure from Fgure. 4. to Fgure. 7. Fgure. 4. Confuson Matrx of MTSVSL for Leukaema Cancer pattern ( Top Genes : 100) From the Fgure 4, t s noted that the exstng GAGS based MTSVSL obtaned 93.8033 s the Classfcaton Accuracy and 0.06197 s the Error Rate for Leukaema Cancer pattern wth number of Top Genes are 100. And also observed that ths proposed GAGS based EMTSVSL Technque acheves 95.8443 and 0.4156 as ts Classfcaton Accuracy and Error Rate respectvely, whch s shown n the Fgure 5. It s revealed that our proposed work performs well as compared wth exstng system. Wth Top Genes as 150, the same experment s repeated, whch s shown n the Fgure 6 and Fgure 7 and also realzed that ths proposed work outperform GAGS based MTSVSL. Fgure. 5. Confuson Matrx of EMTSVSL for Leukaema Cancer pattern ( Top Genes : 100) Fgure. 6. Confuson Matrx of MTSVSL for Leukaema Cancer pattern ( Top Genes : 150) 145
Jule & Krubakaran Cancer Molecular Pattern Predcton Fgure. 7. Confuson Matrx of EMTSVSL for Leukaema Cancer pattern ( Top Genes : 150) 5. CONCLUSION Mcroarray technologes have made an enormous encroachment on cancer genome research. To predct the Cancer Classfcaton, the GAGS s used to select target genes n the dagnoss of cancer and the MTSVSL Learnng Technque based on Back Propagaton Neural Network and Lnear Support Vector Machne were mplemented for classfcaton. To mprove ts classfcaton accuracy, ths paper proposed an effcent enhanced MTSVSL (EMTSVSL) s proposed. From the expermental result, t s establshed that ths proposed Technque acheves hgher classfcaton accuracy wth less error rate as compared wth exstng MTSVSL Technque. For expermental study, the Leukaema Cancer Pattern s used. REFERENCES [1]. Austn H, Chen and Jen-Cheh Hsu, Explorng novel algorthms for the predcton of cancer classfcaton, Internatonal Conference on Software Engneerng and Data Mnng (SEDM), ISBN: 978-1-4244-7324-3 pp. 378 383, 2010. [2]. Statnkov A, Alfers CF, Tsamardnos I, Hardn D, Levy S, A comprehensve evaluaton of multcategory classfcaton methods for mcroarray gene expresson cancer dagnoss, Bonformatcs, 2005, vol. 21, pp. 631 643 [3]. Ramaswamy S. et al., Multclass cancer dagnoss usng tumour gene expresson sgnatures, Proc. Natl Acad. Sc. USA 98, 2001,_ pp. 15149 15154. [4]. Greer BT, Khan J, Dagnostc classfcaton of cancer usng DNA mcroarrays and artfcal ntellgence, Ann N Y Acad Sc, 2004, vol. 1020, pp. 49-66. [5]. Ramrez L, Durdle NG, Raso VJ, Hll DL, A support vector machnes classfer to assess the severty of dopathc scoloss from surface topology, IEEE Trans. Inf. Technol. Bomed., 2006, 10, no. 1, pp. 84-91, Jan. 2005. [6]. Y. Wang, I.V. Tetko, M.A. Hall, E. Frank, A. Facus, K.F.X. Mayer, and H.W. Mewes, Gene Selecton from Mcroarray Data for Cancer Classfcaton A Machne Learnng Approach, Computatonal Bology and Chemstry, vol. 29, no. 1, pp. 37-46, 2005. [7]. Rhodes, and et.al., Oncomne 3.0: Genes, Pathways, and Networks n a Collecton of 18,000 Cancer Gene Expresson Profles, Neoplasa, vol. 9, no. 2, pp. 166-180, 2007. [8]. Yukyee Leung and Yeungsam Hung, A Multple Flter Multple Wrapper to gene selecton and mcroarray data classfcaton, IEEE/ACM Transcatons computatonal Bology and Bonformatcs, VOL. 7, NO. 1, JANUARY-MARCH 2010. [9]. Mnghao Pao, Jong Bum Lee, Khald E.K. Saeed, and Keun Ho Ryu, Dscovery of sgnfcant classfcaton rules from Incrementally nducted decson tree ensemble for dagnoss of dsease. 2009. [10]. S.C. Juang, Y.S. Tarng, and H.R. L, A comparson between the back-propagaton and counter-propagaton networks n the modelng of the TIG weldng process, Journal of Materals Processng Technology, pp. 54 63, 1998. 146