Tissue Classification Based on Gene Expression Data

Size: px

Start display at page:

Download "Tissue Classification Based on Gene Expression Data"

Matthew Patrick
5 years ago
Views:

1 Chapter 6 Tissue Classification Based on Gene Expression Data Many diseases result from complex interactions involving numerous genes. Previously, these gene interactions have been commonly studied separately. Since such an approach is restricted to a relatively low number of genes, many important relationships have probably remained undiscovered. This, however, is likely to change with the invention of microarray technologies. Microarrays offer a totally new perspective for the study of complex diseases because of their capability for monitoring whole networks of genes. Cancer researchers especially hope to gain new insights into the development and characteristics of tumours (Liotta & Petricoin 2000). As the cause of cancer is generally linked to complex interactions of several genes, only the simultaneous monitoring of many genes may reveal the crucial underlying biological mechanisms. For this reason, microarrays have quickly become a major focus of research activities in the studies of complex diseases. The medical applications of these arrays of hope (Lander 1999) are various and include the identification of markers for classification, diagnosis, disease outcome prediction, therapeutic responsiveness and target identification. For medical applications, systems have to show a high performance and the capacity of explain their decisions. We show in this chapter that both requirements are met by EFuNNs for the basic task of tissue classification based on gene expression data. EFuNNs facilitate the extraction of cluster-based rules that constitute gene expression profiles of a disease. This is a further step towards knowledge discovery from microarray data. 99

2 6.1 Introduction Recent microarray studies of cancer have shown that gene expression data can supplement previous methods of phenotyping cancers which rely on a limited set of histological and pathological features (Golub et al. 1999, Bittner et al. 2000, Khan et al. 2001). These traditional methods depend heavily on the experience of physicians to reach the correct decision. They are further limited since tumours with similar histopathological appearances can show very different responses to the same treatment. Any method that would facilitate the decision process for physicians is highly desirable. Different data analysis techniques have been used for the classification of cancer tissue samples. The application of unsupervised classification demonstrated the possibility of discovering new subclasses of cancer (Alizadeh et al. 2000, Bittner et al. 2000). The supervised classification of cancer samples was achieved using statistical as well as machine learning techniques (Golub et al. 1999, Khan et al. 2001, Shipp et al. 2002). Altogether, microarray techniques have shown to be valuable and powerful candidates for enhancing decision making systems in the medical field. A major problem, however, remains in the extraction of knowledge from the trained classification systems. Statistical methods are often highly model dependent and include prior assumptions about the data. Machine learning techniques, such as artificial neural networks (ANNs), present a more flexible model-free approach for classification and frequently yield good results. As they are, however, black box methods, they lack the power to explain their decision processes, necessary for any wide-spread application in the medical field. Crucial decisions about disease treatment demand that the classification process is transparent. Physicians need to understand how a classifier reaches its judgement, since the final responsibility for any decision in the course of treatment remains with them. A possible scenario is presented in figure 6.1. Both applied algorithms achieve optimal classification of existing examples, but their classification of new cases differ. In the case of black box classifiers, this is problematic for the physician, who has to decide which classifier to believe. The decision for or against the classification of a classifier remains difficult, since no explanation of the classification is given by black box classifiers. In fact, any classification should be avoided in the presented example since the new samples (in the illustration C 1, C 2 ) are fairly different from all the previous samples. For black box methods, however, it is difficult to decide whether the classification demands further validation. This example shows the importance that the classifier can indicate which genes are crucial for the decision process. If, for new cases, the expression values of these genes lie outside the range of those of previous samples, an independent verification of the predictions should be considered. For any crucial medical decision, it is important to have insight into the classification process to determine the reliability and limits of the predictions. Furthermore, medical researchers could profit from classifiers that can comprehensively present which combinations of genes are indicative for certain disease types. Even if the symptoms of the disease are the same, the original cause could differ e.g. a 100

3 C1? C2? C1 C2 A A B B (a) Global partitioning (b) Local partitioning Figure 6.1: Problem of extrapolation in classification: a) Global partitioning: Two classifiers (I,II) are trained on the sample classes A and B with their classifications represented by the solid and dashed lines (in the gene expression space) i.e. the new samples (C 1 and C 2 ) are classified as samples of class B by the classifier I (solid line) and as samples of class A by classifier II (dashed line). The classification of the new samples C 1 and C 2 (of unknown classes) is problematic, since they are different to previously seen examples. b) Local partitioning: Samples are classified only when they fall into clusters in the training set. New samples (C 1, C 2 ) with a distinct expression profile lie outside known clusters and may constitute new clusters of unknown class. Further clinical tests would be advisable in this case. Local partitioning avoids the problem of extrapolation. single pathway can be interrupted in different locations. These reasons motivated us to use knowledge-based neurocomputation (KBN) for microarray data analysis. KBN is an attractive paradigm for medical applications, since it tries to construct classification systems which are able to represent the acquired knowledge in a comprehensible way. KBN has been used for the analysis of DNA and RNA data (Towell & Shavlik 1994), but has not been employed for microarray data analysis so far. In particular, we have applied evolving connectionist systems (ECOS) as a version of KBN (Kasabov 2001). For the analysis of gene expression data we used evolving fuzzy neural networks (EFuNNs) that are an implementation of ECOS. Based on fuzzy theory, EFuNNs allow for fuzzy rule extraction and adaptation (Kasabov 2001). Using fuzzy rules may be of advantage since microarray data contain a large noise component and crisp concepts remain difficult to define. While the usual focus of previous studies has been mainly on the performance of certain classifiers, we present here a more holistic methodology for tissue classification based on microarray data. We see the classification algorithm as an integral part of a larger information system for the analysis of 101

4 microarray data. The main phases of information processing in such systems are the following: 1. Feature analysis and feature extraction - defining which features (genes) are more relevant and therefore should be used when creating a model for a particular problem (e.g. classification, prediction, decision making). 2. Modelling the problem - consists of defining inputs, outputs and type of the model (e.g. probabilistic, rule-based, connectionist), training the model, etc. 3. Knowledge discovery - new knowledge is gained through analysis of the modelling results and the model itself. To achieve a good performance all these phases have to be integrated and optimised. The structure of this chapter is as follows. First, related methods are reviewed for tissue sample classification based on microarray data. We discuss the major difficulties for this task and develop a methodology for microarray-based classification. This leads to a description of KBN and in particular EFuNNs, in the context of the challenges posed by microarray data. Several novel modifications are introduced for the adaptation of EFuNNs to these challenges. The proposed methodology for classification and knowledge discovery is illustrated in the case studies of leukaemia and colon cancer. Rules for classification are extracted from the trained EFuNN and visualised. In the last section, we discuss the performance and describe the directions for future research. 6.2 Review of Related Methods and Systems Golub et al. were among the first to show that it is possible to classify tumours based on their gene expression profiles (Golub et al. 1999). Comparing acute lymphoblastic leukaemia (ALL) and acute myeloid leukaemia (AML), they were able to find a set of 50 differentially expressed genes based on the signal-to-noise ratio (equation 7.1). These genes were used to classify new tissue samples with a high accuracy. For the classification, Golub et al. used a weighted voting approach that sums the separate predictions based on each selected gene. The prediction of each gene was weighted by its capacity to discriminate ALL and AML. This approach offers the advantage of easy identification of the influence of each gene on the overall prediction. However, it is limited in discovering combinations of genes that may be more powerful for classification. Using a threshold for the overall prediction strength, some classifications can be rejected and thus the risk of misclassification can be reduced. Several alternative classification schemes have been applied to microarray data derived from tumour tissue. Khan et al. employed artificial neural networks to successfully classify four diagnostically distinct categories of small, round blue-cell tumours (Khan et al. 2001). Correct diagnosis of these tumours is crucial given that distinct clinical decisions are required regarding selection 102

5 of drugs, some of which can have dangerous side effects. The core of the applied classification model is fairly simple using four parallel perceptrons (one for each tumour category). However, Khan et al. enhanced their model by several pre- and post-processing steps: First, they included only genes that were well expressed across all samples, reducing the number of selected genes. By applying PCA, they further reduced the dimensionality of the input for the neural networks. Using ten principal components and random splits of the original data set into training and validation sets, Khan et al. generated an ensemble of trained networks. To rank the genes, they applied the sensitivity measure for neural networks (i.e. the partial derivative of the network output with respect to a single input feature.) Selecting only the top ranked genes, they repeated the construction and training of the ensemble. By this procedure (backward feature selection), they were able to optimise the number of genes for input. Further, the generation of an ensemble allows the definition of a confidence interval for the prediction. (Khan et al. used random splits to generate new data sets in their study. A similar, but statistically more developed method that allows defining a confidence interval for outputs, is the bootstrap method (Efron 1982)). Using confidence intervals, Khan et al. were able to reject samples in an independent test set that differed from the four tumour categories of interest. Clustering approaches were used for the classification in two studies. Ben- Dor et al. compared the classification based on the CAST clustering algorithm with other supervised classification schemes (k-nearest neighbour, SVM, Adaboost) (Ben-Dor et al. 2000). To use clustering for the classification of tissue samples, Ben-Dor et al. adjusted the clustering parameters, so the resulting cluster model was compatible with the tissue classes in the training set. New samples were classified based on the constructed cluster model. The genes were selected by counting the minimal number of misclassifications, which occurred for a expression threshold. This so called threshold number of misclassifications (TNOM) is optimal if the threshold can separate the two tissue classes considered. The performance of the four tested methods was strongly dependent on the data sets. No method appeared to be consistently superior for both data sets analysed. Moler et al. used the clustering of tissue samples to detect structures used for gene selection (Moler et al. 2000). Instead of focusing on genes that discriminate between tissue classes, they selected genes which are differentially expressed in the detected clusters. For selection, they used a probabilistic approach termed naive Bayesian relevance. Based on the assumption that the distribution of expression values is Gaussian, the naive Bayesian relevance indicates how well the data fits the model. However, this assumption lacks justification. The final classification was performed using SVMs. Interestingly, they observed an improved overall accuracy for gene selection based on detected clusters compared to gene selection based on the original tissue classes. This observation corresponds to the study of Keller et al. in which the authors applied Bayesian classifiers (Keller et al. 2000). They achieved a higher accuracy if the tissue classes were stratified into subclasses. Further, they based their constructed classifier on a likelihood model to select genes for multi-class 103

6 problems. Drawbacks in their approach are the explicit assumptions that genes are expressed independently of each other and that their expression follows a Gaussian distribution. While the first assumption biologically is clearly wrong, the second has not been verified. In many studies, gene selection is separated from the subsequent classification process. An alternative method is the incorporation of the selection in the classification process. For example, starting with only a small set of genes for discrimination, further genes will only be included if they contribute to the classification. Zhang et al. used decision trees to select genes sequentially (Zhang et al. 2001). Starting with one gene and finding the optimal threshold of its expression value for separation of the tissue classes, more genes are added to the model one at a time. This recursive partitioning has the advantage of selecting a combination of genes with small redundancy, since it favours the selection of mutually uncorrelated genes. However, the sequential feature selection may be easily caught in a local optimum. Xiong et al. applied a feature wrapping approach exploiting a sequential forward search technique for gene selection (Xiong et al. 2001). Using linear discriminant analysis, logistic regression and SVMs, they reported that only a few genes are needed to achieve a high classification accuracy, while including more genes can lead to a dramatically reduced performance. Similarly, Bo and Jonassen used wrapper approaches for classification (Bo & Jonassen 2002). To decrease the computational demands of this method, they searched for independently discriminative pairs of genes. The selected set of paired genes was used as a new feature set for the classification. Bo and Jonassen observed that this selection method outperformed the selection based on the ranking of single genes. Two recently introduced classification models aim at utilising global structures in the data as input for classification system. Hastie et al. applied hierarchical clustering to reduce redundancy in the data and to increase the robustness of the system (Hastie et al. 2001). Instead of using the expression values of single genes for classification, they used the average expression values in detected gene clusters. New tissue samples were classified based on a regression model, which included quadratic terms to model possible interactions between the gene clusters. West et al. used singular value decomposition to separate breast tumour samples after selecting 100 genes which correlated most strongly with the tissue classes (West et al. 2001). A Bayesian regression model was fitted to the resulting factors allowing the definition of confidence intervals for the classifications. A different approach for the classification of microarray data was presented by Califano et al. (Califano et al. 2000). The basic principle is to find consensus patterns that are over-represented in one phenotype class while they are underrepresented in the other class. Patterns were defined by intervals of pre-defined width for gene expression across samples in a class. To detect these patterns, Califano et al. modified a previously developed pattern finding algorithm for sequences. By comparing the occurrence of observed expression patterns with their occurrence in randomised data, they could assess the statistical significance of the patterns. Significant patterns were used for classification. On real 104

7 data, this method performed comparably with SVMs, whereas the method used by Califano et al. outperformed SVMs for artificially generated data. Several approaches used ensembles of classifiers for the tissue sample classification (Ben-Dor et al. 2000, Dudoit et al. 2000a, Ryu & Cho 2002). We discuss them in further detail in section The majority of the classification methods reviewed focus on the discrimination of two tissue classes. A computationally more challenging problem is the multi-class diagnosis. Ramaswamy et al. applied several supervised learning methods for the classification of 14 distinct tumour types (Ramaswamy et al. 2001). The accuracy achieved far exceeded the accuracy expected by random classification. The classification of poorly differentiated cancer, however, resulted in a low accuracy. Notably, the performance of the SVMs (which achieved the highest accuracy) increased steadily with the number of genes (with a maximum accuracy if all genes are included), whereas knn and weighted voting showed a decreased performance when more than 10 genes were included. Most systems presented here aim primarily to achieve a high classification performance and the identification of marker genes, while the discovery of more complex biological knowledge is frequently neglected. One of the few computational approaches focussing on the direct extraction of knowledge from microarray data is the recently introduced rule-based system called GABRIEL (Genetic Analysis by Rule Incorporating Expert Logic) (Pan et al. 2002a). Expert knowledge can be implemented in GABRIEL in the form of rules. Meaningful patterns in microarray data are discovered by applying genetic algorithms, and translated into structural knowledge. The knowledge contained in the system can be extracted in the form of inference rules. Re-analysing publicly available microarray data, GABRIEL found patterns previously undiscovered by human experts. This result demonstrated the benefits of knowledge-based systems in the analysis of gene expression data. 6.3 Challenges in Classifying Microarray Data Classification of tissue samples based on microarray experiments faces several major challenges: (i) Microarray data contain a high level of noise due to experimental procedures and biological heterogeneity (Schuchhardt et al. 2000, Hughes et al. 2000). (ii) Microarray data are sparse: Experiments usually include a large number of genes (features), but only a small number of samples (Golub et al. 1999, Alon et al. 1999). (iii) Many genes are highly correlated, which leads to redundancy in the data (Eisen et al. 1998). (iv) The tissue samples might be heterogenous in their composition i.e. different types of tissue can be found in a single samples. This heterogeneity can cloud the separation of the tissue classes of interest e.g. cancer and normal (Alon et al. 1999). (v) Finally, tissue samples themselves may be wrongly classified by physicians, since results of standard tests can be contradictory or inconclusive (West et al. 2001). All these reasons make any classification of microarray data difficult. Here we discuss the first three of these challenges in microarray data analysis followed by an outline of our methodology for overcoming them. 105

8 Microarray data inherit large experimental and biological variances Experimental procedures such as tissue handling, RNA extraction and amplification or hybridisation introduce variances and biases in the measured expression levels. Some oligonucleotide and cdna probes may not be specific to one gene and several different genes may hybridise to the same probes ( crosshybridisation ). Labelling cdna and scanning the slides frequently show nonlinear behaviour and introduce systematic errors (chapter four). With data quality and normalisation methods currently improving, many of the sources of noise may be reduced in the near future. However, as gene expression is highly variable even in different samples of the same tissue, the necessity remains that classifiers for microarray data will have to be robust to a high degree of inherent variance. Microarray data are sparse A second major challenge is the high dimensionality of the input space. In present microarray experiments, several thousands of genes are monitored, while the number of samples is often restricted to hundreds or less. It is well known in pattern recognition that classifiers frequently yield poor performance, when the number of examples is small compared to the number of features. A large input space usually results in a large number of parameters in the model. Since the number of examples is small the determination of the model parameters is often insufficient. Microarray data are highly redundant Many genes are strongly correlated because of the high connectivity in the genetic network within the cell. Groups of genes often share the same expression pattern if they are included in a specific pathway. Adding coexpressed genes to the classification system does not increase the information for the system. Furthermore, many genes on microarrays do not show any change in expression across the experiment. This is due to the fact that microarrays at present usually are not specific in the genes used as probes. The selection of genes to be included in a microarray experiment is often based on their availability and not on their functionality. Microarray data therefore tend to have a high degree of redundancy. 6.4 A Methodology for Gene Expression Profiling of Diseases In this section we address the issue of feature selection and outline a methodology for profiling a disease based on gene expression data that we have applied in the case studies of leukaemia and colon cancer (Reeve et al. 2002, Futschik et al. 2003). Gene selection Feature selection aims at improving classification by excluding irrelevant features and thereby reducing the size of input space. Features are excluded if they do not or only weakly contribute to the classification. In our case, a large number of genes does not change their expression between 106

9 classes. Including these genes introduces noise and may yield a poorer classification performance. We wish to select only those genes that can serve as a good basis for discrimination between classes and, thus, for classification. Ideally, we would like to find the intrinsic dimensionality of the data, in turn, minimising the redundancy in the data. A d-dimensional data set is said to have intrinsic dimensionality d if the entire data lies in a d-dimensional subspace (Bishop 1995). We can expect that different tissue classes have a distinct intrinsic dimensionality i.e. a different number of characteristic genes. Each class of tissue may have a defined set of genes, the expression levels of which are specific for that class. These sets of genes may not be overlapping i.e. a gene may be highly correlated with a single cancer subtype, but not correlated with other subtypes. Further, in selecting an optimal number of genes we have to balance the simplicity of the model with its robustness. Reducing the number of input features leads to a decreased number of model parameters. However, as genes show a high variability in their expression, the classification should not depend on too few genes. Minimising the number of genes based on a specific data set might produce a biased gene selection that is only optimal for the data studied, but not for the tissue type in general. The incorporation of feature selection in a classification system can be done in two different ways. First, feature selection and classification can be treated separately from the classification model. Features are selected with respect to predefined criteria such as Pearson correlation. This approach often has the advantage of being computationally inexpensive and easy to process. The selected genes, however, are frequently highly correlated to each other and, thus, are likely to be redundant. Alternatively, the selection of features is determined by the classification model itself, since an optimal set of features depends on the choice of the classifier. This constitutes an integrated approach. Unfortunately, the large number of genes in a microarray experiment prohibits any exhaustive search for optimal sets of genes especially if complex classifiers like ANNs are used. Sequential search techniques may still be practical but frequently find only local optima. Additionally they are rather sensitive to noise. In this study, we sought to combine the advantages of separate and integral feature selection. First, we used filtering and correlation analysis to find a set of single classifying genes i.e. genes which are independently good features for discrimination. This gene selection step is separated from the classification model. To select more complex features like groups of genes defining profiles of a disease, we then applied rule extraction to trained EFuNN models. Finding complex expression profiles may accommodate the complexity of the underlying genetic networks better and enable us to reveal the fine-structures of disease classes. The extracted rules point to groups of genes that can be more indicative for a particular tissue type than single marker genes. The rules are further compacted using a novel method termed EFuNN shaving which minimises the set of genes for the whole EFuNN model based on the classification performance. Thus, this step can be regarded as an integral approach for gene selection. 107

10 Methodology The outline of the complete methodology that we introduce in this study is as follows: 1. Normalisation: Scaling of the intensities enables the comparison of expression values between different microarrays within an experiment. 2. Pre-processing: Filtering aims at eliminating constantly expressed genes across the tissue classes. Log-transformation helps to balance the range of the data. 3. Feature selection: A set of significantly differentially expressed genes across the classes is selected for clustering and classification. 4. Clustering or unsupervised classification: Grouping of tissue samples with similar expression patterns reveals preliminary profiles of cancer samples. 5. Supervised classification: The tissue classes are modelled using classifiers. The model parameters are optimised and the resulting structure of the model analysed. 6. Knowledge discovery: The extraction of rules results in complex expression profiles. The rules represent the fine grades of the common expression levels for discriminative sets of genes. By using thresholds, smaller or larger groups can be investigated (Kasabov 2001). The details of each step are described in section 6.8 and in references (Reeve et al. 2002, Futschik et al. 2003). 6.5 Knowledge Representation in KBN Data-driven research can be performed by a variety of methods. One difficulty, however, is that many approaches are based on a number of assumptions about the analysed data. For example, the weighted voting method by Golub et al. assigned single genes values for their importance within the classification, assuming genes are independent variables (Golub et al. 1999). It is well known, however, that genes are highly correlated and most of the functions within cells are carried out by several interacting genes. This should be reflected in the structure of a classifier. One possibility may be the construction of more complex statistical models, which would be rather difficult since relatively little is known about the genetic networks in cells. Another approach is the use of flexible machine learning methods like ANNs which do not require assumptions about the data distribution. They can model highly non-linear tasks and have proven to be very applicable if prior knowledge is minimal. The disadvantage is the black box character of ANNs. To overcome the drawback of the black box approach without losing the computational power of ANNs, we applied KBN for the classification of cancer tissues using microarray data. An introduction into KBN can be found in section KBN have been applied successfully to classify DNA and RNA sequences (Towell & Shavlik 1994). Towell and Shavlik introduced the KBANN 108

11 algorithm to encode and refine expert knowledge. While this is practical in sequence analysis where knowledge has been accumulated over the past decades, this is not the case for gene regulation on the genomic scale. The study of the complex interactions and the various functions of genes is still in its infancy. The main goal of a KBN in microarray data analysis is to (1) model non-linear associations between many genes and the diseases studied and to (2) extract knowledge, possibly in the form of rules. 6.6 Modelling and Knowledge Discovery by EFuNNs For the purpose of tissue classification and knowledge discovery, we used EFuNNs to re-analyse the leukaemia and colon cancer microarray data sets. The original EFuNN algorithm is described in section Here, we discuss its features in relation to the challenges posed by gene expression data and introduce several novel modifications to adjust the EFuNN algorithm to these challenges. We used fuzzy neural networks to study gene expression for the following reasons: Microarray data have a large noise component. Fuzzy classifiers have shown that they are robust for this kind of data (Kasabov 1996). The variability of the data is large. Binning gene expression values into categories such as high or low in a crisp fashion may lead to oversimplification. The use of fuzzy membership functions is more appropriate. As little is known about gene expression on the genomic scale, defining crisp concepts is difficult. Fuzzy variables may be useful. Fuzzy logic is less sensitive to imprecise data than classical Boolean logic, since in fuzzy logic it is not necessary to set hard thresholds between concepts such as gene i has medium expression and gene i has high expression. A certain gene expression value can be partially labelled as medium and high. This labelling is defined by a set of membership functions µ i that represent a mapping of crisp values to fuzzy labels (e.g. low, medium, high ). Figure 6.2 illustrates this concept. The rule nodes in EFuNNs form a local representation of the domain space. This feature contrasts with MLPs which constitute a distributed representation of the input examples by the activation values in the hidden layer. In an MLP, hidden nodes with sigmoid activation functions span hyperplanes in the domain space. A new example is classified based only on its position with respect to these hyperplanes. This is problematic in the case of microarray data, as the maximum distance to previous examples cannot be controlled. To avoid crucial misclassifications, however, examples should not be classified at all if they have distinct expression profiles from those used for training. This can be easily achieved by a classifier with a local representation in the domain space by setting appropriate activation thresholds. New examples with distinct gene 109

12 Degree of membership Min Low Medium Gene expression g ** g * High Max Figure 6.2: By using triangular membership functions a crisp gene expression value g can be mapped into fuzzy labels. In the above example, g has a degree of membership of 0.58 for highly expressed and 0.42 for medium expressed. Classical set theory is based on crisp thresholds (dashed lines) for membership. These crisp thresholds can result that two similarly expressed genes are put in different categories (e.g. g would be labelled as highly expressed and g as medium expressed ) making classical logic more sensitive to noise. expression profiles from those of previous examples stay below the threshold and remain unclassified. A favourable quality of EFuNNs is the ability to map the trained network to a set of symbolic inference rules. In the case of tissue classification, the rules represent the assignment of regions in the gene space to tissue classes. The rule nodes are the cluster centres of tissue samples. Since the network structure can be directly mapped into a set of rules, no intermediate steps are necessary. This ensures that we do not sacrifice the performance of the network by its conversion into inference rules. Many alternative rule extraction procedures with intermediate steps frequently lead to a dramatic decrease in accuracy. We will see in section 6.9, that compact rules can be derived without loss of performance. Although different membership functions can be exploited within EFuNNs, the use of the triangular membership function guarantees the conservation of the property i µ i(g jk ) = 1, where µ i is the ith membership value for the gene expression value g jk of gene j in sample k. This results from the specific form of the supervised learning which updates the rule node vector r j according to r j i = ri j + α(xi k ri j ), (xi k, ri j : the membership degrees of the fuzzy variable i for the kth example and rule node j; α: learning rate). It follows that r j i = (rj i + α(xi k ri j )) = rj i + α x i k α rj i = 1 α + α = 1 i i i i i This property ensures that the rules extracted from an EFuNN are consistent in the anterior part. Contradictory rules of the form [IF (g i is high ( µ high i = 1) AND g i is low ( µ low i = 1) THEN sample k is cancer tissue] cannot be created. 110

13 6.7 Adaptations of EFuNNs to Off-line Learning EFuNNs have been primarily designed for on-line learning (section 3.5.2). They can adapt to an input-output stream of data (whose structure may change) without suffering catastrophic forgetting that is frequently observed in many conventional neural networks. Our case studies, however, state off-line learning problems, since we analyse fixed data sets of limited size. While the dynamic learning algorithm of EFuNNs adapts very well to situations in where large numbers of examples are present, several modifications are needed for an improved classification in the off-line learning task of classifying gene expression data: 1. Multiple data passes: Several passes of the training set through the EFuNN structure are used instead of just one pass in the original EFuNN algorithm. Multi-pass training allows for a refinement of the initially learned EFuNN structure after the first pass. The location of rule nodes created in the first data pass is adjusted in subsequent iterations resulting in a better representation of the detected clusters. In our experiments the structure of EFuNNs in multi-pass training-mode usually stabilised quickly. Frequently three passes give an optimal or near optimal result. 2. Model generation: The learned structure of EFuNNs depends, particularly for small data sets, on the order of objects in the training set. Different orders can lead to different locations and receptive fields of the rule nodes. 1 This apparent shortcoming can be turned into an advantage if we use this dependency for model generation. By randomising the order of objects in the training set, we can generate a set of different EFuNNs structures. By comparing the performance on a validation set, these structures can be ranked and the best performing structure can be selected. Note that neural networks with batch learning algorithms like MLPs do not depend on the order of the training data, since the batch learning algorithm averages over all examples during the training. However, their final structures depend on the initial structure, for which random weights are frequently used. To achieve an optimal structure, several MLPs with different initial structures usually are trained and the best performing network is selected. 3. EFuNN shaving: Every EFuNN node will be activated by samples showing a certain gene expression pattern. The node represents a cluster of samples with similar gene expression levels. As we will see in the following analysis, only a small subset of genes is crucial for defining these 1 Biological neural networks show a similar behaviour. It is well-known in neuroscience that the learning process of biological networks depends strongly on the order of the presented examples. Positive as well as negative interference due to a specific order in the learning process can be observed. Good teaching of a physiological network involves not only defining the set of examples, but also the order in which the learning examples are presented (Kandel et al. 1995). 111

14 patterns. The remaining genes can be regarded as noise in the classification process. EFuNN shaving tries to discover the characteristic set of genes by a gradual reduction of the number of included genes. The details of this modification of the EFuNN training are given in section 6.9. The expression of remaining genes constitute gene expression patterns that are specific for the rule nodes and less redundant. We use this compact set of genes to extract rules of minimal length for the nodes. 4. EFuNN tuning: The EFuNN learning algorithm is based on a local adjustment of the rule nodes in the input-output-space without the adjustment of neighbouring rule nodes. (This is strictly the case only for the 1-of-N activation mode which we use throughout these case studies, while other M-of-N modes simultaneously adjust several rule nodes for each training example.) The local learning can result in an EFuNN structure comprising localised conflicts between single rule nodes e.g. overlapping receptive fields. In on-line learning, these conflicts will be resolved by the permanent adjustment of the borders between the rule nodes using new training examples from a continuous data stream. For the off-line learning case, this possibility does not exist as the number of examples is limited. However, since we have a finite number of training examples, we can use the complete set of examples to refine the borders between the rule nodes i.e. we adjust the EFuNN structure globally to the data structure. The details of EFuNN tuning are described in chapter seven. 6.8 Modelling of Gene Expression Data: Case Studies on Leukaemia and Colon Cancer Tissues In this section, we apply the described methodology to the leukaemia and colon cancer data sets (Golub et al. 1999, Alon et al. 1999). We first normalise and pre-process the data and examine the importance of these steps for classification. Using EFuNNs, we classify the data sets and analyse the results and structures of the fuzzy neural networks. Finally we use rule extraction to identify groups of genes that form profiles and are highly indicative of particular cancer types Normalisation, Pre-processing and Feature Selection Normalisation To compare different microarray experiments, the data has to be normalised (chapter four). For the purpose of this study, we used global normalisation i.e. we scaled the arrays to have the same total fluorescence intensity. Pre-processing The next step in the analysis is the pre-processing of data. Since microarray data are frequently strongly skewed, we used the log 2 -transformation to balance the data (see section 4.4). To exclude genes which do not 112

15 2. Principal Component ALL B cell ALL T cell AML Principal Component Figure 6.3: PCA of leukaemia samples based on 100 genes that have the largest squared Pearson correlation with the two classes ALL and AML. The first two principal components include 63.3% of the total variance of the data. show changes in expression, we filtered out all genes k with [max(log 2 (g k )) min(log 2 (g k ))] < 3 Although these procedures are common in the analysis of microarray data (Tamayo et al. 1999, Perou et al. 2000), they seemed somewhat arbitrary. A major direction in our future research is therefore to find an optimal adjustment of the pre-processing and normalisation within the classification system (section 6.11). Gene selection For the selection of genes for classification, several different methods like t-test or Fisher discriminant analysis have been proposed (Welsh et al. 2001, Dudoit et al. 2000b). Intuitively, we wish to select genes if they are highly correlated with the tissue classes. Since we do not want to discriminate between correlated and anti-correlated genes, we used the squared Pearson correlation coefficient c 2 P that is computed as follows (section 3.1): 2 c 2 P (g i, c) = 1 ( (g ij ḡ i )(c j c) ) N g i c j with the vector g i of the expression value of gene i, the class membership c j of the samples j and the Euclidean distance. Although it is a simple approach for gene selection, it finds sets of genes that are well suited for discriminating tissue classes. This can be seen using PCA (section 3.7.1), which gives a first insight into the structure of high-dimensional data. After selecting 100 genes, the visualisation of the first two components of the leukaemia data set indicates that AML and ALL samples are generally well separated with only a small overlap between the two classes (figure 6.3). A PCA 113

16 100 Leukaemia 50 Number of genes Colon Frequency of occurence Figure 6.4: Stability of gene selection by squared Pearson correlation for the leukaemia and colon data set. The graphs show how frequently genes were selected applying the bootstrap method described in the text. projection of the colon data set can be found in figure 6.6 which shows that the colon cancer and normal tissue samples are less distinguished and include several outliers. Besides distinguishing between the tissue classes, selected genes should not depend on a specific choice of the data samples. To examine the stability of the selection, we used a bootstrap method. Leaving out one sample, the genes are selected based on the remaining (N 1) samples. This is then repeated for all samples. Genes that are well correlated across all samples should be frequently selected, while genes that are correlated only with a subset of the samples should be selected less often. Figure 6.4 shows that most of the genes are repeatedly chosen. This demonstrates that the selection using squared Pearson correlation is highly stable (at least for the data sets analysed here). We finally assessed the quality of normalisation and pre-processing by testing their influence on the classification performance using N-fold cross-validation (section 3.6). As can be seen in figure 6.5, normalisation and pre-processing can strongly influence the classification performance, as in the case of colon tissue data. The performance was increased by normalisation and further by log 2 -transformation and filtering of the data. For the leukaemia case study, however, the improvement was minor. Two possible explanations: either the leukaemia data contained less variation than the colon cancer data due to experimental procedures, or our specific choice of normalisation and pre-processing steps (in conjunction with the gene selection) was better suited for colon cancer than for leukaemia. This would suggest that normalisation, pre-processing and feature selection are not only important for the classification performance but are also data dependent. A future classification system may adjust the pre-classification steps to the data to improve its performance. 114

17 100 Accuracy (%) Raw data Scaled data Scaled + logged data Colon Cancer Scaled + logged + filtered data Raw data Scaled data Scaled + logged data Leukemia Scaled + logged + filtered data Figure 6.5: Dependency of EFuNN performance on pre-processing and normalisation of the microarray data. The N-fold cross-validation testing was based on 100 selected genes (EFuNN parameter: error threshold = 0.9; maximum receptive field = 0.3; aggregation number = 20). Data and Classifier Errth maxrf nagg Rules Acc-tr Acc-te Leukaemia - EFuNN Leukaemia - EFuNN Colon - EFuNN Colon - EFuNN (shaved) Table 6.1: Performance of EFuNNs in N-fold cross-validation testing: Errth - Error threshold, maxrf - maximum receptive field, nagg - aggregation number, Rules - average number of evolved rule nodes, Acc-tr - average classification accuracy of training examples in %, Acc-te - average classification accuracy of testing examples in % Classification of Leukaemia and Colon Cancer Tissues The performance of EFuNNs was tested by N-fold cross-validation. The crossvalidation procedure included gene selection to avoid any over-estimation of the accuracy that might occur if we use the complete data set (including the test set) in the gene selection step. For modelling, the parameters of EFuNNs have to be adjusted. Note that we used 3 triangular membership functions low, medium and high throughout the analysis (figure 6.2). The chosen parameters do not only determine the classification accuracy but also the internal structure of EFuNNs by controlling the number of rule nodes. Frequent aggregation of rule nodes and a larger maximum receptive field, for example, yield a smaller number of rule nodes. An example of this behaviour is given for the leukaemia data in table 6.1. By reducing the maximum receptive field from 0.5 to 0.3, the average number of rule nodes produced during the training phase increased from 2.0 to 6.1. While the classification accuracy rose for the training set, it decreased for the testing set. This is a consequence of the well-known biasvariance trade-off. Classifiers with a large number of parameters show a high flexibility in modelling the data but tend to overfit it resulting in a poorer generalisation. They achieve a higher accuracy for the training data, but show a reduced performance on the test data. For the leukaemia data set, a maximum accuracy by EFuNNs of 97.2% was 115

18 Data Classifier Study Accuracy Leukaemia weighted voting Golub et al Leukaemia feature wrapping Bo & Jonassen 98.6 Leukaemia ensemble of classifiers Ryu & Cho 97.1 Colon decision tree Zhang et al Colon Lin. discrim. analysis Xiong et al Colon SVM Moler et al Colon clustering Ben-Dor et al Colon feature wrapping Bo & Jonassen 88.7 Table 6.2: Comparison of classification results for leukaemia and colon cancer data: The accuracies (in %) of several alternative approaches are shown as reported in the corresponding publications. A direct comparison is difficult, since the testing procedures differ (see text). obtained on the testing set using N-fold cross-validation. This is a very high accuracy considering that the trained EFuNN models consisted on average of only two rules. An overview of classification results for other approaches can be found in table 6.2. Golub et al. achieved a lower performance (94.1%) applying a weighted voting method (Golub et al. 1999). However, a direct comparison is restricted since Golub et al. assessed the accuracy based on a single split of the total data into the training and testing sets. Bo and Jonassen reported a maximum accuracy of 98.6% including more than 30 genes in their model (Bo & Jonassen 2002). As we see in the next section, EFuNNs need considerably fewer genes for classification. Using the same training and test data as Golub et al., the ensemble classifier constructed by Ryu and Cho yield a maximum accuracy of 97.1% (with a further improvement recently reported (Ryu & Cho 2002)). The classification performance is considerably lower for the colon cancer data with 90.3% (table 6.1) which we will improve the accuracy using EFuNN shaving in the next section. An accuracy of 93.6% was achieved by Zhang et al. using decision tree techniques (Zhang et al. 2001), but based on a different testing method and floored by by the lack of any data normalisation. Fisher s linear discriminant analysis conducted by Xiong et al. achieved an accuracy of 87.0% (Xiong et al. 2000). Moler et al. also reanalysed the colon tumour data applying an SVM approach that classified 88.7% of the samples correctly (Moler et al. 2000). The same accuracy was achieved by Ben-Dor et al. using clusteringbased classification (Ben-Dor et al. 2000) and by Bo and Jonassen applying their proposed feature wrapping method (Bo & Jonassen 2002). For the colon data, the PCA shows several samples as outliers (figure 6.6). This made the classification more difficult. As for the leukaemia data set, a large maximum field ensured the simplicity of the model and a high accuracy on the test set. The misclassifications of samples are shown as a PCA projection in figure 6.6. Most of these misclassifications can be considered as outliers. Based on the set of selected genes and the PCA projection, their gene expression profiles were very similar to expression profiles of samples that belong to the other tissue class. 116

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology