Simple Discriminant Functions Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal

Size: px

Start display at page:

Download "Simple Discriminant Functions Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal"

Stella Rose
5 years ago
Views:

1 Genome Informatics 16(1): (2005) 245 Simple Discriminant Functions Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal Gul S. Dalgin 1 Charles DeLisi 2,3 sdalgin@bu.edu delisi@bu.edu 1 Molecular Biology, Cell Biology and Biochemistry Program, Boston University, Boston, MA 02215, USA 2 Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA 3 Bioinformatics Graduate Program, Boston University, Boston, MA, 02215, USA Abstract High-throughput gene expression profiling can identify sets of genes that are differentially expressed between different phenotypes. Discovering marker genes is particularly important in diagnosis of a cancer phenotype. However, gene sets produced to date are too large to be economically viable diagnostics. We use a hybrid decision tree-discriminant analysis to identify small sets of genes, i.e. single genes and gene pairs, which separate normal samples from different stages of tumor samples. Half the samples are selected for training to form the probability distribution of expression values of each gene. The distributions for the tumor and normal phenotypes are then used to classify the test samples. The algorithm also identifies gene pairs by combining the probability distributions to construct a decision tree which is used to determine the class of test samples. After a series of training and testing sessions, genes and gene pairs that classify all samples correctly are recorded. The method was applied to a breast cancer data; and classifier genes that distinguish normal breast from different stages of breast tumor were identified. The genes were ranked according to their minimum Euclidean distance between the expression values in tumor and normal samples. The algorithm was able to pick known cancer related genes but also find genes that were not identified as differentially expressed by t-test with a 2 fold cut-off. Overall, the method generates possible diagnostic genes and gene pairs for a specific disease phenotype to pursue further biological interpretations in cancer biology. Keywords: discriminant analysis, gene expression, cancer, diagnostic genes 1 Introduction High-throughput gene expression profiling using microarray technology has emerged as a promising technology for correlating gene expression with environmental conditions. Methods are available for allocating samples into pre-specified phenotypic groups based on differences in gene expression profiles, or for segregating samples into groups without prior specification [2, 9]. When groups are pre-specified, the aim is typically to identify differentially expressed diagnostic gene sets. Sets of over or underexpressed genes that stratify closely related diseases have been successfully identified in ALL-AML classification [4], ovarian cancer and normal tissue [3], BRCA1, BRCA2, and sporadic breast tumor classification [5] and poor prognosis and good prognosis breast cancer samples [10]. The main problem is that sets of differentially expressed genes produced so far are too large to be used as feasible diagnostics. In this paper, we present a hybrid decision tree-discriminant analysis to identify small sets of genes whose joint expression distribution separates two pre-defined classes. The method generates probability distributions from the fraction of samples in the two classes, and exploits it to select genes that classify all samples accurately after a series of training and test sessions.

2 246 Dalgin and DeLisi Herein, we applied the methodology to breast cancer data generated by Ma and colleagues [6]. Single genes and gene pairs, whose joint expression distribution separate tissue samples in different stages and grades of malignancy from normal tissue, were identified as candidate diagnostic genes. Overall, the results suggest that this new discriminant analysis efficiently identifies small gene sets that distinguish phenotypes. 2 Method and Results Data The method was applied to the breast cancer gene expression data produced by Ma and colleagues [6]. The data is described in more detail elsewhere (Dalgin et al., manuscript in preparation). The samples include normal breast tissues from breast cancer patients and three stages of breast tumor (premalignant stage (ADH), in situ cancer (DCIS) and invasive cancer (IDC) with different grades (Grade I - slow growing tumor, Grade III - fast growing tumor, Grade II - intermediate). Overall, 32 normal samples, 8 ADH, 9 DCIS Grade I, 11 DCIS Grade II, 10 DCIS Grade III, 5 IDC Grade I, 9 IDC Grade II and 9 IDC Grade III samples; and 1940 genes that were found to be differentially expressed between normal and three stages (ADH, DCIS and IDC) by linear discriminant analysis [6] were used as the publicly available data, in the current analysis. The gene expression level (E) of each gene was reported as the ratio of the expression level in the experimental sample to the expression level in the reference sample (E = log 2 (sample/reference sample)). As the reference sample, a human universal reference RNA from Stratagene was used [6]. Method The method consists of three steps, i.e. (1) dividing the samples into training and test sets (2) generating probability distributions for identifying single genes and for decision analysis when pairs are used (3) assigning test samples; and selecting genes and gene pairs that perform well. An overview of the method is given in Figure 1. Figure 1: Overview of the method.

3 Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal 247 In the first step, the samples are divided into training and test samples in each partition. The method employs a cross-validation technique by which the samples are randomly (as in the first partitionings) or semi-randomly (second and third partitionings) separated as training and test sets (See Supplementary Figure). This technique assures that all samples are used at least once in training, but it still has usage bias. The first partitioning will always be random irrespective of other partitionings whereas the second and third partitionings are semi-random to guarantee good coverage of the samples. Overall, 99 partitions are performed. In the second step, the probability distributions of expression values (E) of a gene in each of the two training classes are generated. These are used to classify genes in the test samples. The distributions of the endothelin 3 gene expression levels for tumor (T ) and normal (N) is shown in Figure 2 as an example tumor normal Series3 Series4 Series5 Series6 Series7 Series8 Series9 Series10 Series11 Series12 Series13 Series Figure 2: Distribution of expression values (E) of endothelin 3 gene in normal and tumor samples. Expression values are divided into intervals. The interval boundaries are shown near the y-axis. ** The order of the samples is arbitrary and the sample numbers have no special importance. The expression values are divided into intervals to generate the probability distributions. The number of intervals is chosen such that the values are discretisized into neither very small nor very big intervals. As an example, 32 normal and 8 ADH expression values were divided into 10 intervals. P (E N), the probability of an expression in normal samples; and P (E T ), the probability of an expression in tumor samples, are calculated from the fraction of normal and tumor samples, respectively, in the interval E + de. The probability distribution for the endothelin 3 gene is shown in Figure 3.

4 248 Dalgin and DeLisi Figure 3: Probability distribution for endothelin 3 gene. The lower (at the bottom) and upper boundary values (at the top) for each interval are shown in the x-axis. P (E N) and P (E T ) are calculated from the fractions of normal and tumor samples, respectively, in an interval. It is evident that for this particular case, expression levels of endothelin 3 above 1.28 occur only in the normal group, and expression levels below 0.06 occur only in the tumor group. However, since separation is incomplete, this gene by itself is not a good candidate to use as a signature. We therefore ask whether a second gene can be found which, in combination with endothelin 3, gives perfect separation of the training set. In order to limit the search we use, for the first gene in the pair (endothelin 3 in this example), only genes that misclassify less than 10% of the total training samples. The pairs (or singlets, when the first gene separates perfectly) thus obtained, are then evaluated on the test set. Samples in the test set are assigned to the normal category if P (N E) > P (T E), where E = (E 1, E 2 ) and to tumor otherwise, where the posteriors are given by Bayes rule (Figure 1, step 3). The pairs that correctly classify all test samples are recorded as perfect pairs after each partition. Table 1: Number of single genes and gene-pairs identified for each normal and tumor stage comparison. Number of single classifier genes Number of pairs (genes involved) Normal-ADH (336 genes) Normal-DCIS I (502 genes) Normal-DCIS II (455 genes) Normal-DCIS III (670 genes) Normal-IDC I (564 genes) Normal-IDC II (649 genes) Normal-IDC III (743 genes) DCIS Grade I is abbreviated as DCIS I. Number of pairs that appear in at least 10 partitions.

Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal 249 The classifier genes that distinguish separately between normal and 7 stages of breast tumor were identified after

5 Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal 249 The classifier genes that distinguish separately between normal and 7 stages of breast tumor were identified after performing 99 partitions for each case. Single genes and gene pairs that correctly separate the samples in at least 1 partition were recorded for each comparison. The results are summarized in Table 1. In order to determine how well the genes distinguish the two groups, genes and gene pairs were ranked based on a distance measure, which uses the overall expression value distribution. For a single classifier gene, the Euclidean distance between the tumor and normal samples was calculated. The rank of gene i is determined by this distance (d i ): d i = (ET,i E N,i ) 2 where E T,i and E N,i is the expression value of gene i in the tumor and normal sample, respectively. The rank of the gene is inversely proportional to this distance; the larger the distance, the better the gene as a classifier. In order to assess if the Euclidean distance is a distinguished feature of the classifier genes with respect to other genes, the distribution of Euclidean distances for single classifier genes and other genes was compared. An example histogram is shown in Figure 4 for single classifier genes that distinguish normal samples from DCIS Grade III samples. In this case, it is clear that the distances of single classifier genes are higher than non-classifier genes; hence have a better separation between their expression values in normal and tumor samples. This observation is valid for other classifier genes as well (data not shown). This also suggests that Euclidean distance can be used to distinguish/rank classifier genes. That is to say, when the genes are to be tested on an independent data set, the genes that have been top ranked in terms of their distance are expected to perform better than the others. Figure 4: Histogram of the Euclidean distance calculated for normal-dcis Grade III single classifier genes and the rest of the genes. Euclidean distance is calculated between the expression values of genes in normal and tumor samples.

6 250 Dalgin and DeLisi Similarly, the gene pairs were ranked according to their minimum Euclidean distance between the expression values in tumor and normal samples. First, the Euclidean distance between the expression values of the pair (gene i and j) in a tumor sample and each normal sample was calculated: ( (ET,i ) d((e T,i, E T,j ), E N ) = min E N,i ) 2 + (E T,j E N,j ) 2 + (E T,i E N,j ) 2 + (E T,j E N,i ) 2 The minimum of this set was selected for that tumor sample. After carrying out the procedure for all tumor samples, the minimum of this set of minima was selected as the minimum distance for the gene pair: d i,j = min(d((e T,i, E T,j ), E N )) T = 1,..., N T where N T is the total number of tumor samples. The rank of the gene pair is inversely proportional to this minimum distance. In order to compare our results with a conventional method, we performed t-tests on the same sets of genes, i.e genes, and the same classes defined in breast cancer, i.e., normal and 7 stages of breast tumor. The average fold change and the significance values were calculated for each gene for each normal-breast cancer stage. The aim was (1) to see whether the method selects the same or different genes when t-test is used, and (2) evaluate the classifier genes in terms of quantitative measures like average fold change. The percentage of single classifier genes that show differential expression change with a p-value < 0.05 and average fold (E T umor /E Normal ) > 2 are shown in Table 2. Table 2: Average fold (tumor/normal) and p-values of single classifier genes obtained by t-test. Avg fold (T/N) > 2 p-value < 0.05 Normal-ADH 19.1 % (4/21) 71.4 % (15/21) Normal-DCIS I 61.5 % (16/26) 84.6 % (22/26) Normal-DCIS II 66.7 % (18/27) 100 % (27/27) Normal-DCIS III 55.3 % (21/38) 94.7 % (36/38) Normal-IDC I 75.0 % (42/56) 80.4 % (45/56) Normal-IDC II 73.7 % (28/34) 89.5 % (34/38) Normal-IDC III 55.6 % (25/45) 91.1 % (41/45) The results show that a significant portion of the genes (from 26.3% for IDC Grade II to 80.9% for ADH) have changed less than 2 fold in tumor; hence would not be identified as significant by t-test. However, these genes have been identified as possible classifier genes by the current algorithm. The majority of the classifier genes have statistically significant p-values which indicate that their expression change in the two classes is significant. 3 Discussion 3.1 Comparison of the Method with Related Methods Several statistical methods have been successfully applied to find discriminatory genes between groups of samples in analyzing gene expression data. The method introduced in this paper is methodologically compared with two of the most frequently applied methods, t-test and linear discriminant analysis. Linear discriminant analysis (LDA) finds a linear subspace that maximizes class separability among the feature vector projections, where each gene is represented by a vector of its expression values across

7 Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal 251 the samples, in the space. Popular separability criterion is the ratio between-class scatter and withinclass scatter. LDA seeks directions efficient for discrimination. LDA assumes that the class mean conveys most of the class information. Therefore, it cannot enhance nonlinearly separable data sets and classes with the same mean. Additionally, with a limited number of samples and fairly large number of genes, between-class and within-class separabilites can be quite unstable. The main difference between LDA and our hybrid discriminant analysis is that LDA finds the separation of the classes spatially, by representing the classes as vectors, whereas our algorithm separates the classes by a probabilistic approach. It takes into account the distribution of expression values in both classes and generates probability distributions from the fraction of two classes in defined intervals. The probability distributions are then used to determine the class of an unknown sample in the case of single genes. The distributions of two genes are combined to construct a decision tree to assign an unknown class by a gene pair. The algorithm selects the genes that correctly assign the class of all training and test samples after a good number of simulations; hence consistency across all samples is an emphasized criterion of the method. The other method that has been applied to identify differentially expressed genes is t-test to test the hypothesis that the means of two distributions of values are different. The main disadvantage of this approach in gene expression analysis is that it produces large gene sets which are not viable to be used as diagnostics. Moreover, in some cases, e.g. closely related diseases, changes in the expression of single genes are very modest or not significant at all [8]. Our method is advantageous in such cases since it takes into account the fraction of samples in two classes no matter how similar/dissimilar the two class means or variances are. It not only selects single genes but also gene pairs which together partitions the two classes even if individual genes are not perfect classifiers alone. The method is designed to select single genes and pairs to classify two groups; however, it was considered to extend it to identify triplets or more group of genes. The downsides of this are (1) the execution time increases substantially since the search space, i.e. number of triples, is much bigger than the case of pairs and, (2) a high number of triplets have been identified for each classification which makes it hard to evaluate, rank and select for further testing on another data set. The testing methodology used here differs from the standard jackknife technique, which constructs the training set by leaving out a normal and a tumor sample, and then tests the genes on that pair. The jackknife has the advantage of being unbiased, but it is computationally much more demanding than the procedure we have used. We are currently investigating the difference between the two methods. 3.2 Marker Genes for Breast Cancer In particular, we identified single genes and gene pairs that partition normal breast samples from different breast tumor stages (Table 1). The overlap between the single gene classifiers (0%-10.34%) and between the gene pairs (0.11%-3.28%) are low showing that majority of these classifiers are specific to a certain tumor stage. Some of these genes include previously characterized cancer related genes such as Angiopoiteinlike 4, which is known to be important in sustained angiogenesis; Matrix metalloproteinase 7, which was found to be up-regulated in colorectal carcinomas [7] and Glutamine synthase, which is also up-regulated in tumor and important in tumor progression [1]. Grade specific genes also agree with previous findings. As an example, BIRC5 (survivin) gene, which is known to be overexpressed in common human cancers and was found to be correlated with Grade III tumors [6], was also identified only in Grade III tumors in this study. In summary, we were able to distinguish normal from different stages of breast tumor using no more than two genes in each instance. Each of these single genes and gene pairs are possible candidates to be used as diagnostics for a specific type of breast tumor. The total sum of all such pairs includes a large number of genes (Table 1), and that provides an entrée into the search for correlated and/or co-

8 252 Dalgin and DeLisi regulated genes. Future work will focus on the identification of biological processes that are enriched with subsets of these genes and on further determining the regulatory mechanisms controlling these genes. References [1] Dang, C. V. and Semenza, G. L., Oncogenic alterations of metabolism, TIBS Reviews, 24(2):68 72, [2] Eisen, M. B, Spellman, P. T., Brown, P. O., and Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. USA, 95(25): , [3] Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., and Haussler, D., Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics, 16(10): , [4] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science, 286(5439): , [5] Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O. P., Wilfond, B., Borg, A., and Trent, J., Gene-expression profiles in hereditary breast cancer, N. Engl. J. Med., 344(8): , [6] Ma, X. J., Salunga, R., Tuggle, J. T., Gaudet, J., Enright, E., McQuary, P., Payette, T., Pistone, M., Stecker, K., Zhang, B. M., Zhou, Y. X., Varnholt, H., Smith, B., Gadd, M., Chatfield, E., Kessler, J., Baer, T. M., Erlander, M. G., and Sgroi, D. C., Gene expression profiles of human breast cancer progression, Proc. Natl. Acad. Sci. USA, 100(10): , [7] Masaki, T., Matsuoka, H., Sugiyama, M., Abe, N., Goto, A., Sakamoto, A., and Atomi, T., Matrilysin (MMP-7) as a significant determinant of malignant potential of early invasive colorectal carcinomas, Br. J. Cancer, 84(10): , [8] Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., Houstis, N., Daly, M. J., Patterson, N., Mesirov, J. P., Golub, T. R., Tamayo, P., Spiegelman, B., Lander, E. S., Hirschhorn, J. N., Altschuler, D., and Groop, L. C., PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes, Nat. Genet., 34(3): , [9] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareevan, S., Dmitrovsky, E., Lander, E. S., and Golub, T. R., Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation, Proc. Natl. Acad. Sci. USA, 96(6): , [10] van t Veer L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H., Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415(6871): , 2002.

9 Identify Small Sets of Genes that Distinguish Cancer Phenotype from Normal 253 Supplementary Figure 32 Normal samples (N) 8 ADH samples (T) Training set Test set 16 N 1 16 N 1 4 T 1 4 T 1 1 st partitioning 8 N 2 8N 2 2 T 2 2T 2 8 N 2 2 T N 2 T 2 2 nd partitioning 8 N 3 2T 3 8N 3 2 T 3 8 N 3 2T 3 N 3 8 2T 3 3rd partitioning Figure 5: A schematic overview of dividing the samples into training and test sets. In the first partitioning, half of one class and half of the other class samples are selected randomly to train, and the others remain to test. In the second partitioning, half of the training samples are chosen randomly from the training set of the first partitioning and the other half from those that have not been used in the training set (the test set of the first partitioning). In the third partitioning, all samples not previously used in training are selected for training, and the remainder is chosen randomly from the training set of the second partitioning.

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Ivan Arreola and Dr. David Han Department of Management of Science and Statistics, University