Cancer Gene Extraction Based on Stepwise Regression

Size: px

Start display at page:

Download "Cancer Gene Extraction Based on Stepwise Regression"

Jane Osborne
5 years ago
Views:

1 Mathematical Computation Volume 5, 2016, PP.6-10 Cancer Gene Extraction Based on Stepwise Regression Jie Ni 1, Fan Wu 1, Meixiang Jin 1, Yixing Bai 1, Yunfei Guo 1 1. Mathematics Department, Yanbian University, , China guoyunfei0413@sina.com Abstract With the expansion of the gene expression profile database, in the case of as little as possible to lose information or to retain the most critical information, gene extraction has become a main direction for the scholars. This paper excludes 1561 irrelevant genes through the definition of weighted distance firstly, and then removes 252 redundant genes by Pearson's correlation coefficient. Finally by comparing the two methods, stepwise regression after clustering and only stepwise analysis, we obtain the best combination of 8 genes. Keywords: stepwise regression, cluster analysis, gene extraction 1 INTRODUCTION Golub studied two subtypes of leukemia, using "signal to noise ratio" to evaluate the effect of classification genes. 50 genes were selected as features and classified by the method of weighted voting in order to achieve good results. Ramaswamy used SVM and RFE feature selection method, a classification of the 14 tumor samples were carried out. Li Yingxin used the sequential floating search algorithm to search for the feature subset, and obtained 29 feature genes. Institute of Semiconductors, Chinese Academy of Sciences academician Wang Shoujue, who thought of the first two feature gene extraction method, did not consider the correlation between the selected genes, which will affect the classification results in a certain extent, the ideal gene model is proposed on the basis of correcting the deficiency. This paper combines the score of the ideal gene model and the variance which reacts the amount of gene information, in order to define the weighted distance, then uses the Pearson correlation coefficient, finally gets the 97 information genes. We apply stepwise regression method for variable selection to these 97 informative genes in different situations, and finally get the ideal combination of gene. 2 ELIMINATION OF IRRELEVANT GENES AND REDUNDANT GENES 2.1 Elimination Irrelevant Genes First, after deleting the repeated items of gene data, we got 1910 mutually different genes. The remaining 1910 gene expression profiles can be expressed by a matrix X [ M ij ], X ij indicates the level of gene expression values i-th gene in the j-th sample. When the amount of experimental samples is n, the genes that affect the cancer have p, we obtain the gene expression profile matrix X is a n*p order matrix. (Data from the 2010 National Graduate mathematical modeling contest A Title data.) In order to make better comparison, we first normalize the gene data, the model is stated as follows: 2 mij mmax mmin m ij, i [0,1910], j [0,] m m max min where, mmax, mmin are the largest and the smallest elements in the expression profile matrix X, m ij is the value of 6

2 the normalized m. ij The data we used have a total of samples, there are 2000 genes in each sample, and the first 22 samples are from the normal people, and the later 40 samples are from the cancer patients., we put two samples together, which can be expressed by y1, y2,, y. The expression level of gene v in each sample was regarded as a vector of1*, denoted by Mv { v1, v2, v}. 1) Defining an ideal gene: s 1 s 2 s 22 1, s 23 s 24 s 1 If the gene x plays a role in cancer,then it can either promote or inhibit. So the more cancer information the X carries, the smaller the angle (Acute angle) between it and the ideal gene is, and the absolute value of the cosine value closer to 1. 2) Defining the Angle Between the Gene v and the Ideal Gene s : M M cos, v s 2 2 MV vi Ms si Mv Ms i1 i1 3) Defining the Distance Between the Gene v and the Ideal Gene s : d 2 vi ei i 1 2 vi ei i1 ( ),cos 0 ( ),cos 0 4) Defining the Similarity Between the Gene v and the Ideal Gene s : 5) Calculate the Standard Deviation of Each Gene STD v 6) Defining Weighted Distance: D 0.5* S 0.5* STD v S cos d We can calculate the weighted distance between the 1910 genes and the ideal gene according to the formula. According to the score of S and STD v, we can define a threshold, which can be used to distinguish the informative genes and irrelevant genes. This threshold can be defined as D I, D v DN, D According to our definition of the irrelevant and informative gene, we can see from the following table, when threshold 0.11, card ( D ) 349, which means, in a total of 1910 genes, informative genes have 349, and irrelevant genes have The 349 genes in the basis of the further analysis. 2.2 Elimination of Redundant Genes I S I has different degrees of correlation with the ideal gene, which is TABLE 1 DISTRIBUTION OF GENE AND IDEAL GENE distance count proportion % % % % > % In the previous section, we remove the irrelevant genes from the relationship between genes and cancer. However, 7

3 different genes in the level of expression have a certain degree of relevance, so we start to exclude redundant genes. Defining Pearson correlation coefficient: corrcoef ( v, v ) i j k1 ( m m )( m m ) vik vi vjk vj 2 2 ( mvik mvi ) ( mvjk mvj ) k1 k1 where, m, m are the expression values of gene v, v in k th sample in matrix X, m, m vik vjk i j This paper selects the threshold of 0.7; we further reduce the number of informative gene, finally get 97 genes. vi vj TABLE 2 THE NUMBER OF CLASSIFICATION FEATURE GENES UNDER DIFFERENT THRESHOLD Threshold Num of genes VARIABLE SELECTION BASED ON STEPWISE REGRESSION AFTER CLUSTERING After the removal of irrelevant genes and redundant genes two steps, we get 97 information genes, the amount is still big. However, we can determine that most of these genes contain certain cancer information and their correlation is not strong, the following we use a stepwise regression method to select the variables. TABLE 3 GENES SELECTED FROM EACH CLUSTER Cluster Gene 1 M26383 M31516 R54097 X02492 X52228 T T47377 X02761 X M63391 U R none FIG.1 GROUP1:12 GENES(LEFT)GROUP2:5 GENES(RIGHT) 8

4 First of all, we use clustering analysis (set clustering number 5), and then we give the function value, that is, the function value of the first 22 normal samples is 1, after 40 cancer samples of the function value is 0. After clustering, we do the stepwise regression in each cluster respectively (set variables to enter and stay in the model of the significant level respectively sle 0.05, sls ), and thus to screen out the variables that meet the requirements. The 12 genes are combined to a set of genes, the lowest significant level s gene in each cluster are combined into another group of genes. The samples are divided into training set (12+10) and test set (10+20), the two sets of genes are analyzed by discriminant analysis, posterior probability as shown below. From Figure 1, we can see that the first group`s misjudged number is seven, and the misjudged number of second group is six. So after clustering, choosing a gene from each cluster is adequate. When we increase the amount of genes selected from a cluster, at the same time we increase the misjudge rate. 4 VARIABLE SELECTION BASED ON DIRECT STEPWISE REGRESSION FIG.2 VARIABLE SELECTION RESULTS TABLE 4 X CORRESPONDING GENE x2 x22 x27 x36 x47 x51 x56 T47377 M63391 X12671 R99907 X02492 H20503 M59807 FIG.3 POSTERIOR PROBABILITY 9

We directly give function values, namely the function values of first 22 normal samples are 1, after 40 cancer samples values are 0; then, we do stepwise regression for the 97 variables ( sle 0.

As shown in Fig.3, among the 30 test samples, there are 4 errors in judgment,and misjudged rate is 13.

5 We directly give function values, namely the function values of first 22 normal samples are 1, after 40 cancer samples values are 0; then, we do stepwise regression for the 97 variables ( sle 0.05, sls ). The samples are divided into training set (12+10) and test set (10+20), the above six variables have been selected for discriminant analysis, and the following results are obtained. As shown in Fig.3, among the 30 test samples, there are 4 errors in judgment,and misjudged rate is 13.3%, which can be accepted, so we get the final gene combination: M63391 X12671 R99907 X02492 H20503 T47377 M CONCLUSIONS Through the application of the weighted distance and correlation coefficient two methods, we delete 1813 irrelevant and redundant genes, greatly reducing the number of genes and dimension of analysis. These 97 genes have low correlation with each other, and we adopt two different stepwise regression to get different results. Through the analysis of 3 and 4, we find that in section 3, each cluster screening one gene can get a better misjudge rate; compared with stepwise regression after clustering method, stepwise regression directly can get lower misjudge rate. So we can draw a conclusion: when we use stepwise regression in different cluster, it seems that the results can be more reasonable than stepwise directly, but we ignore that this way will destroy the structure among original variables, which will lead to the unsatisfactory results. REFERENCES [1] WANG SHOU-JUE, ZHOU LING-FEl. Gene Selection for Gene Expression Data Analysis, CONTROL & AUTOMATION, Vol 24,2008. [2] LI Ying-xin, RUAN Xiao-gang. Cancer Subtype Recognition and Feature Selection with Gene Expression Profiles, ACTA ELECTRONICA SINICA,Vol 33,2005. [3] Golub, T.R., Slonim, D.K., Tamayo, P., et al. Molecular classification of cancer:class discovery and class prediction by gene expression monitoring[j]. Science, 1999, 286: [4] Ramaswamy, S., Tamayo, P., et al. Multiclass cancer diagnosis using tumour gene expression signatures[j]. PNAS, 2001, 98 : [5] WANG SHOU-JUE. Direction-Basis-Function neural networks, IJCNN 99, 1999: [6] KAN Haijun, TANG Jun*,SU Liangliang, A method for informative gene selection using neighborhood uncertainty and scoring criteria, Journal of Anhui University ( Natural Science Edition) Vol.38 No [7] Wang Jingqi, Xu Linli, Semi supervised feature selection and clustering for multi view data[j]. Journal of Data Acquisition and Processing,2015,30(1): [8] Xu Jiucheng, Xu Tianhe, Sun Lin,et a1. Feature selection for cancer classification based on neighborhood rough set and par ticle swarm optimization[j]. Journal of Chinese Computer Systems.2014,35(11): [9] Xu Jiuchen,Li Tao,Sun Linl,Li Yuhuil, Feature Gene Selection Based on SNR and Neighborhood Rough Set, Journal of Data Acquisition and Processing V01.30.No.5,Sep.2015,PP [10] Ra maswamy S.Golub T R.DNA Mi croar rays in Clinical Oncolog y[ J].Journal of Clinical Oncology,2 002; 20( 7): [11] Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Machine Learning, 46(13): , 2000 AUTHORS 1 Jie Ni was born on October 21th, 1995 in Anhui province, she is a junior student in Yanbian University majored in statistic. 2 Yunfei Guo was born on April 13th, 1983 in Jilin province, and received his M.S. degrees in Yanbian University, China in He is a Lecture of Yanbian University. His research interests are reliability and statistical anal 10

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

5-9 JATIT. All rights reserved. A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION 1 H. Mahmoodian, M. Hamiruce Marhaban, 3 R. A. Rahim, R. Rosli, 5 M. Iqbal Saripan 1 PhD student, Department