Tissue Classification Based on Gene Expression Data

Size: px
Start display at page:

Download "Tissue Classification Based on Gene Expression Data"

Transcription

1 Chapter 6 Tissue Classification Based on Gene Expression Data Many diseases result from complex interactions involving numerous genes. Previously, these gene interactions have been commonly studied separately. Since such an approach is restricted to a relatively low number of genes, many important relationships have probably remained undiscovered. This, however, is likely to change with the invention of microarray technologies. Microarrays offer a totally new perspective for the study of complex diseases because of their capability for monitoring whole networks of genes. Cancer researchers especially hope to gain new insights into the development and characteristics of tumours (Liotta & Petricoin 2000). As the cause of cancer is generally linked to complex interactions of several genes, only the simultaneous monitoring of many genes may reveal the crucial underlying biological mechanisms. For this reason, microarrays have quickly become a major focus of research activities in the studies of complex diseases. The medical applications of these arrays of hope (Lander 1999) are various and include the identification of markers for classification, diagnosis, disease outcome prediction, therapeutic responsiveness and target identification. For medical applications, systems have to show a high performance and the capacity of explain their decisions. We show in this chapter that both requirements are met by EFuNNs for the basic task of tissue classification based on gene expression data. EFuNNs facilitate the extraction of cluster-based rules that constitute gene expression profiles of a disease. This is a further step towards knowledge discovery from microarray data. 99

2 6.1 Introduction Recent microarray studies of cancer have shown that gene expression data can supplement previous methods of phenotyping cancers which rely on a limited set of histological and pathological features (Golub et al. 1999, Bittner et al. 2000, Khan et al. 2001). These traditional methods depend heavily on the experience of physicians to reach the correct decision. They are further limited since tumours with similar histopathological appearances can show very different responses to the same treatment. Any method that would facilitate the decision process for physicians is highly desirable. Different data analysis techniques have been used for the classification of cancer tissue samples. The application of unsupervised classification demonstrated the possibility of discovering new subclasses of cancer (Alizadeh et al. 2000, Bittner et al. 2000). The supervised classification of cancer samples was achieved using statistical as well as machine learning techniques (Golub et al. 1999, Khan et al. 2001, Shipp et al. 2002). Altogether, microarray techniques have shown to be valuable and powerful candidates for enhancing decision making systems in the medical field. A major problem, however, remains in the extraction of knowledge from the trained classification systems. Statistical methods are often highly model dependent and include prior assumptions about the data. Machine learning techniques, such as artificial neural networks (ANNs), present a more flexible model-free approach for classification and frequently yield good results. As they are, however, black box methods, they lack the power to explain their decision processes, necessary for any wide-spread application in the medical field. Crucial decisions about disease treatment demand that the classification process is transparent. Physicians need to understand how a classifier reaches its judgement, since the final responsibility for any decision in the course of treatment remains with them. A possible scenario is presented in figure 6.1. Both applied algorithms achieve optimal classification of existing examples, but their classification of new cases differ. In the case of black box classifiers, this is problematic for the physician, who has to decide which classifier to believe. The decision for or against the classification of a classifier remains difficult, since no explanation of the classification is given by black box classifiers. In fact, any classification should be avoided in the presented example since the new samples (in the illustration C 1, C 2 ) are fairly different from all the previous samples. For black box methods, however, it is difficult to decide whether the classification demands further validation. This example shows the importance that the classifier can indicate which genes are crucial for the decision process. If, for new cases, the expression values of these genes lie outside the range of those of previous samples, an independent verification of the predictions should be considered. For any crucial medical decision, it is important to have insight into the classification process to determine the reliability and limits of the predictions. Furthermore, medical researchers could profit from classifiers that can comprehensively present which combinations of genes are indicative for certain disease types. Even if the symptoms of the disease are the same, the original cause could differ e.g. a 100

3 C1? C2? C1 C2 A A B B (a) Global partitioning (b) Local partitioning Figure 6.1: Problem of extrapolation in classification: a) Global partitioning: Two classifiers (I,II) are trained on the sample classes A and B with their classifications represented by the solid and dashed lines (in the gene expression space) i.e. the new samples (C 1 and C 2 ) are classified as samples of class B by the classifier I (solid line) and as samples of class A by classifier II (dashed line). The classification of the new samples C 1 and C 2 (of unknown classes) is problematic, since they are different to previously seen examples. b) Local partitioning: Samples are classified only when they fall into clusters in the training set. New samples (C 1, C 2 ) with a distinct expression profile lie outside known clusters and may constitute new clusters of unknown class. Further clinical tests would be advisable in this case. Local partitioning avoids the problem of extrapolation. single pathway can be interrupted in different locations. These reasons motivated us to use knowledge-based neurocomputation (KBN) for microarray data analysis. KBN is an attractive paradigm for medical applications, since it tries to construct classification systems which are able to represent the acquired knowledge in a comprehensible way. KBN has been used for the analysis of DNA and RNA data (Towell & Shavlik 1994), but has not been employed for microarray data analysis so far. In particular, we have applied evolving connectionist systems (ECOS) as a version of KBN (Kasabov 2001). For the analysis of gene expression data we used evolving fuzzy neural networks (EFuNNs) that are an implementation of ECOS. Based on fuzzy theory, EFuNNs allow for fuzzy rule extraction and adaptation (Kasabov 2001). Using fuzzy rules may be of advantage since microarray data contain a large noise component and crisp concepts remain difficult to define. While the usual focus of previous studies has been mainly on the performance of certain classifiers, we present here a more holistic methodology for tissue classification based on microarray data. We see the classification algorithm as an integral part of a larger information system for the analysis of 101

4 microarray data. The main phases of information processing in such systems are the following: 1. Feature analysis and feature extraction - defining which features (genes) are more relevant and therefore should be used when creating a model for a particular problem (e.g. classification, prediction, decision making). 2. Modelling the problem - consists of defining inputs, outputs and type of the model (e.g. probabilistic, rule-based, connectionist), training the model, etc. 3. Knowledge discovery - new knowledge is gained through analysis of the modelling results and the model itself. To achieve a good performance all these phases have to be integrated and optimised. The structure of this chapter is as follows. First, related methods are reviewed for tissue sample classification based on microarray data. We discuss the major difficulties for this task and develop a methodology for microarray-based classification. This leads to a description of KBN and in particular EFuNNs, in the context of the challenges posed by microarray data. Several novel modifications are introduced for the adaptation of EFuNNs to these challenges. The proposed methodology for classification and knowledge discovery is illustrated in the case studies of leukaemia and colon cancer. Rules for classification are extracted from the trained EFuNN and visualised. In the last section, we discuss the performance and describe the directions for future research. 6.2 Review of Related Methods and Systems Golub et al. were among the first to show that it is possible to classify tumours based on their gene expression profiles (Golub et al. 1999). Comparing acute lymphoblastic leukaemia (ALL) and acute myeloid leukaemia (AML), they were able to find a set of 50 differentially expressed genes based on the signal-to-noise ratio (equation 7.1). These genes were used to classify new tissue samples with a high accuracy. For the classification, Golub et al. used a weighted voting approach that sums the separate predictions based on each selected gene. The prediction of each gene was weighted by its capacity to discriminate ALL and AML. This approach offers the advantage of easy identification of the influence of each gene on the overall prediction. However, it is limited in discovering combinations of genes that may be more powerful for classification. Using a threshold for the overall prediction strength, some classifications can be rejected and thus the risk of misclassification can be reduced. Several alternative classification schemes have been applied to microarray data derived from tumour tissue. Khan et al. employed artificial neural networks to successfully classify four diagnostically distinct categories of small, round blue-cell tumours (Khan et al. 2001). Correct diagnosis of these tumours is crucial given that distinct clinical decisions are required regarding selection 102

5 of drugs, some of which can have dangerous side effects. The core of the applied classification model is fairly simple using four parallel perceptrons (one for each tumour category). However, Khan et al. enhanced their model by several pre- and post-processing steps: First, they included only genes that were well expressed across all samples, reducing the number of selected genes. By applying PCA, they further reduced the dimensionality of the input for the neural networks. Using ten principal components and random splits of the original data set into training and validation sets, Khan et al. generated an ensemble of trained networks. To rank the genes, they applied the sensitivity measure for neural networks (i.e. the partial derivative of the network output with respect to a single input feature.) Selecting only the top ranked genes, they repeated the construction and training of the ensemble. By this procedure (backward feature selection), they were able to optimise the number of genes for input. Further, the generation of an ensemble allows the definition of a confidence interval for the prediction. (Khan et al. used random splits to generate new data sets in their study. A similar, but statistically more developed method that allows defining a confidence interval for outputs, is the bootstrap method (Efron 1982)). Using confidence intervals, Khan et al. were able to reject samples in an independent test set that differed from the four tumour categories of interest. Clustering approaches were used for the classification in two studies. Ben- Dor et al. compared the classification based on the CAST clustering algorithm with other supervised classification schemes (k-nearest neighbour, SVM, Adaboost) (Ben-Dor et al. 2000). To use clustering for the classification of tissue samples, Ben-Dor et al. adjusted the clustering parameters, so the resulting cluster model was compatible with the tissue classes in the training set. New samples were classified based on the constructed cluster model. The genes were selected by counting the minimal number of misclassifications, which occurred for a expression threshold. This so called threshold number of misclassifications (TNOM) is optimal if the threshold can separate the two tissue classes considered. The performance of the four tested methods was strongly dependent on the data sets. No method appeared to be consistently superior for both data sets analysed. Moler et al. used the clustering of tissue samples to detect structures used for gene selection (Moler et al. 2000). Instead of focusing on genes that discriminate between tissue classes, they selected genes which are differentially expressed in the detected clusters. For selection, they used a probabilistic approach termed naive Bayesian relevance. Based on the assumption that the distribution of expression values is Gaussian, the naive Bayesian relevance indicates how well the data fits the model. However, this assumption lacks justification. The final classification was performed using SVMs. Interestingly, they observed an improved overall accuracy for gene selection based on detected clusters compared to gene selection based on the original tissue classes. This observation corresponds to the study of Keller et al. in which the authors applied Bayesian classifiers (Keller et al. 2000). They achieved a higher accuracy if the tissue classes were stratified into subclasses. Further, they based their constructed classifier on a likelihood model to select genes for multi-class 103

6 problems. Drawbacks in their approach are the explicit assumptions that genes are expressed independently of each other and that their expression follows a Gaussian distribution. While the first assumption biologically is clearly wrong, the second has not been verified. In many studies, gene selection is separated from the subsequent classification process. An alternative method is the incorporation of the selection in the classification process. For example, starting with only a small set of genes for discrimination, further genes will only be included if they contribute to the classification. Zhang et al. used decision trees to select genes sequentially (Zhang et al. 2001). Starting with one gene and finding the optimal threshold of its expression value for separation of the tissue classes, more genes are added to the model one at a time. This recursive partitioning has the advantage of selecting a combination of genes with small redundancy, since it favours the selection of mutually uncorrelated genes. However, the sequential feature selection may be easily caught in a local optimum. Xiong et al. applied a feature wrapping approach exploiting a sequential forward search technique for gene selection (Xiong et al. 2001). Using linear discriminant analysis, logistic regression and SVMs, they reported that only a few genes are needed to achieve a high classification accuracy, while including more genes can lead to a dramatically reduced performance. Similarly, Bo and Jonassen used wrapper approaches for classification (Bo & Jonassen 2002). To decrease the computational demands of this method, they searched for independently discriminative pairs of genes. The selected set of paired genes was used as a new feature set for the classification. Bo and Jonassen observed that this selection method outperformed the selection based on the ranking of single genes. Two recently introduced classification models aim at utilising global structures in the data as input for classification system. Hastie et al. applied hierarchical clustering to reduce redundancy in the data and to increase the robustness of the system (Hastie et al. 2001). Instead of using the expression values of single genes for classification, they used the average expression values in detected gene clusters. New tissue samples were classified based on a regression model, which included quadratic terms to model possible interactions between the gene clusters. West et al. used singular value decomposition to separate breast tumour samples after selecting 100 genes which correlated most strongly with the tissue classes (West et al. 2001). A Bayesian regression model was fitted to the resulting factors allowing the definition of confidence intervals for the classifications. A different approach for the classification of microarray data was presented by Califano et al. (Califano et al. 2000). The basic principle is to find consensus patterns that are over-represented in one phenotype class while they are underrepresented in the other class. Patterns were defined by intervals of pre-defined width for gene expression across samples in a class. To detect these patterns, Califano et al. modified a previously developed pattern finding algorithm for sequences. By comparing the occurrence of observed expression patterns with their occurrence in randomised data, they could assess the statistical significance of the patterns. Significant patterns were used for classification. On real 104

7 data, this method performed comparably with SVMs, whereas the method used by Califano et al. outperformed SVMs for artificially generated data. Several approaches used ensembles of classifiers for the tissue sample classification (Ben-Dor et al. 2000, Dudoit et al. 2000a, Ryu & Cho 2002). We discuss them in further detail in section The majority of the classification methods reviewed focus on the discrimination of two tissue classes. A computationally more challenging problem is the multi-class diagnosis. Ramaswamy et al. applied several supervised learning methods for the classification of 14 distinct tumour types (Ramaswamy et al. 2001). The accuracy achieved far exceeded the accuracy expected by random classification. The classification of poorly differentiated cancer, however, resulted in a low accuracy. Notably, the performance of the SVMs (which achieved the highest accuracy) increased steadily with the number of genes (with a maximum accuracy if all genes are included), whereas knn and weighted voting showed a decreased performance when more than 10 genes were included. Most systems presented here aim primarily to achieve a high classification performance and the identification of marker genes, while the discovery of more complex biological knowledge is frequently neglected. One of the few computational approaches focussing on the direct extraction of knowledge from microarray data is the recently introduced rule-based system called GABRIEL (Genetic Analysis by Rule Incorporating Expert Logic) (Pan et al. 2002a). Expert knowledge can be implemented in GABRIEL in the form of rules. Meaningful patterns in microarray data are discovered by applying genetic algorithms, and translated into structural knowledge. The knowledge contained in the system can be extracted in the form of inference rules. Re-analysing publicly available microarray data, GABRIEL found patterns previously undiscovered by human experts. This result demonstrated the benefits of knowledge-based systems in the analysis of gene expression data. 6.3 Challenges in Classifying Microarray Data Classification of tissue samples based on microarray experiments faces several major challenges: (i) Microarray data contain a high level of noise due to experimental procedures and biological heterogeneity (Schuchhardt et al. 2000, Hughes et al. 2000). (ii) Microarray data are sparse: Experiments usually include a large number of genes (features), but only a small number of samples (Golub et al. 1999, Alon et al. 1999). (iii) Many genes are highly correlated, which leads to redundancy in the data (Eisen et al. 1998). (iv) The tissue samples might be heterogenous in their composition i.e. different types of tissue can be found in a single samples. This heterogeneity can cloud the separation of the tissue classes of interest e.g. cancer and normal (Alon et al. 1999). (v) Finally, tissue samples themselves may be wrongly classified by physicians, since results of standard tests can be contradictory or inconclusive (West et al. 2001). All these reasons make any classification of microarray data difficult. Here we discuss the first three of these challenges in microarray data analysis followed by an outline of our methodology for overcoming them. 105

8 Microarray data inherit large experimental and biological variances Experimental procedures such as tissue handling, RNA extraction and amplification or hybridisation introduce variances and biases in the measured expression levels. Some oligonucleotide and cdna probes may not be specific to one gene and several different genes may hybridise to the same probes ( crosshybridisation ). Labelling cdna and scanning the slides frequently show nonlinear behaviour and introduce systematic errors (chapter four). With data quality and normalisation methods currently improving, many of the sources of noise may be reduced in the near future. However, as gene expression is highly variable even in different samples of the same tissue, the necessity remains that classifiers for microarray data will have to be robust to a high degree of inherent variance. Microarray data are sparse A second major challenge is the high dimensionality of the input space. In present microarray experiments, several thousands of genes are monitored, while the number of samples is often restricted to hundreds or less. It is well known in pattern recognition that classifiers frequently yield poor performance, when the number of examples is small compared to the number of features. A large input space usually results in a large number of parameters in the model. Since the number of examples is small the determination of the model parameters is often insufficient. Microarray data are highly redundant Many genes are strongly correlated because of the high connectivity in the genetic network within the cell. Groups of genes often share the same expression pattern if they are included in a specific pathway. Adding coexpressed genes to the classification system does not increase the information for the system. Furthermore, many genes on microarrays do not show any change in expression across the experiment. This is due to the fact that microarrays at present usually are not specific in the genes used as probes. The selection of genes to be included in a microarray experiment is often based on their availability and not on their functionality. Microarray data therefore tend to have a high degree of redundancy. 6.4 A Methodology for Gene Expression Profiling of Diseases In this section we address the issue of feature selection and outline a methodology for profiling a disease based on gene expression data that we have applied in the case studies of leukaemia and colon cancer (Reeve et al. 2002, Futschik et al. 2003). Gene selection Feature selection aims at improving classification by excluding irrelevant features and thereby reducing the size of input space. Features are excluded if they do not or only weakly contribute to the classification. In our case, a large number of genes does not change their expression between 106

9 classes. Including these genes introduces noise and may yield a poorer classification performance. We wish to select only those genes that can serve as a good basis for discrimination between classes and, thus, for classification. Ideally, we would like to find the intrinsic dimensionality of the data, in turn, minimising the redundancy in the data. A d-dimensional data set is said to have intrinsic dimensionality d if the entire data lies in a d-dimensional subspace (Bishop 1995). We can expect that different tissue classes have a distinct intrinsic dimensionality i.e. a different number of characteristic genes. Each class of tissue may have a defined set of genes, the expression levels of which are specific for that class. These sets of genes may not be overlapping i.e. a gene may be highly correlated with a single cancer subtype, but not correlated with other subtypes. Further, in selecting an optimal number of genes we have to balance the simplicity of the model with its robustness. Reducing the number of input features leads to a decreased number of model parameters. However, as genes show a high variability in their expression, the classification should not depend on too few genes. Minimising the number of genes based on a specific data set might produce a biased gene selection that is only optimal for the data studied, but not for the tissue type in general. The incorporation of feature selection in a classification system can be done in two different ways. First, feature selection and classification can be treated separately from the classification model. Features are selected with respect to predefined criteria such as Pearson correlation. This approach often has the advantage of being computationally inexpensive and easy to process. The selected genes, however, are frequently highly correlated to each other and, thus, are likely to be redundant. Alternatively, the selection of features is determined by the classification model itself, since an optimal set of features depends on the choice of the classifier. This constitutes an integrated approach. Unfortunately, the large number of genes in a microarray experiment prohibits any exhaustive search for optimal sets of genes especially if complex classifiers like ANNs are used. Sequential search techniques may still be practical but frequently find only local optima. Additionally they are rather sensitive to noise. In this study, we sought to combine the advantages of separate and integral feature selection. First, we used filtering and correlation analysis to find a set of single classifying genes i.e. genes which are independently good features for discrimination. This gene selection step is separated from the classification model. To select more complex features like groups of genes defining profiles of a disease, we then applied rule extraction to trained EFuNN models. Finding complex expression profiles may accommodate the complexity of the underlying genetic networks better and enable us to reveal the fine-structures of disease classes. The extracted rules point to groups of genes that can be more indicative for a particular tissue type than single marker genes. The rules are further compacted using a novel method termed EFuNN shaving which minimises the set of genes for the whole EFuNN model based on the classification performance. Thus, this step can be regarded as an integral approach for gene selection. 107

10 Methodology The outline of the complete methodology that we introduce in this study is as follows: 1. Normalisation: Scaling of the intensities enables the comparison of expression values between different microarrays within an experiment. 2. Pre-processing: Filtering aims at eliminating constantly expressed genes across the tissue classes. Log-transformation helps to balance the range of the data. 3. Feature selection: A set of significantly differentially expressed genes across the classes is selected for clustering and classification. 4. Clustering or unsupervised classification: Grouping of tissue samples with similar expression patterns reveals preliminary profiles of cancer samples. 5. Supervised classification: The tissue classes are modelled using classifiers. The model parameters are optimised and the resulting structure of the model analysed. 6. Knowledge discovery: The extraction of rules results in complex expression profiles. The rules represent the fine grades of the common expression levels for discriminative sets of genes. By using thresholds, smaller or larger groups can be investigated (Kasabov 2001). The details of each step are described in section 6.8 and in references (Reeve et al. 2002, Futschik et al. 2003). 6.5 Knowledge Representation in KBN Data-driven research can be performed by a variety of methods. One difficulty, however, is that many approaches are based on a number of assumptions about the analysed data. For example, the weighted voting method by Golub et al. assigned single genes values for their importance within the classification, assuming genes are independent variables (Golub et al. 1999). It is well known, however, that genes are highly correlated and most of the functions within cells are carried out by several interacting genes. This should be reflected in the structure of a classifier. One possibility may be the construction of more complex statistical models, which would be rather difficult since relatively little is known about the genetic networks in cells. Another approach is the use of flexible machine learning methods like ANNs which do not require assumptions about the data distribution. They can model highly non-linear tasks and have proven to be very applicable if prior knowledge is minimal. The disadvantage is the black box character of ANNs. To overcome the drawback of the black box approach without losing the computational power of ANNs, we applied KBN for the classification of cancer tissues using microarray data. An introduction into KBN can be found in section KBN have been applied successfully to classify DNA and RNA sequences (Towell & Shavlik 1994). Towell and Shavlik introduced the KBANN 108

11 algorithm to encode and refine expert knowledge. While this is practical in sequence analysis where knowledge has been accumulated over the past decades, this is not the case for gene regulation on the genomic scale. The study of the complex interactions and the various functions of genes is still in its infancy. The main goal of a KBN in microarray data analysis is to (1) model non-linear associations between many genes and the diseases studied and to (2) extract knowledge, possibly in the form of rules. 6.6 Modelling and Knowledge Discovery by EFuNNs For the purpose of tissue classification and knowledge discovery, we used EFuNNs to re-analyse the leukaemia and colon cancer microarray data sets. The original EFuNN algorithm is described in section Here, we discuss its features in relation to the challenges posed by gene expression data and introduce several novel modifications to adjust the EFuNN algorithm to these challenges. We used fuzzy neural networks to study gene expression for the following reasons: Microarray data have a large noise component. Fuzzy classifiers have shown that they are robust for this kind of data (Kasabov 1996). The variability of the data is large. Binning gene expression values into categories such as high or low in a crisp fashion may lead to oversimplification. The use of fuzzy membership functions is more appropriate. As little is known about gene expression on the genomic scale, defining crisp concepts is difficult. Fuzzy variables may be useful. Fuzzy logic is less sensitive to imprecise data than classical Boolean logic, since in fuzzy logic it is not necessary to set hard thresholds between concepts such as gene i has medium expression and gene i has high expression. A certain gene expression value can be partially labelled as medium and high. This labelling is defined by a set of membership functions µ i that represent a mapping of crisp values to fuzzy labels (e.g. low, medium, high ). Figure 6.2 illustrates this concept. The rule nodes in EFuNNs form a local representation of the domain space. This feature contrasts with MLPs which constitute a distributed representation of the input examples by the activation values in the hidden layer. In an MLP, hidden nodes with sigmoid activation functions span hyperplanes in the domain space. A new example is classified based only on its position with respect to these hyperplanes. This is problematic in the case of microarray data, as the maximum distance to previous examples cannot be controlled. To avoid crucial misclassifications, however, examples should not be classified at all if they have distinct expression profiles from those used for training. This can be easily achieved by a classifier with a local representation in the domain space by setting appropriate activation thresholds. New examples with distinct gene 109

12 Degree of membership Min Low Medium Gene expression g ** g * High Max Figure 6.2: By using triangular membership functions a crisp gene expression value g can be mapped into fuzzy labels. In the above example, g has a degree of membership of 0.58 for highly expressed and 0.42 for medium expressed. Classical set theory is based on crisp thresholds (dashed lines) for membership. These crisp thresholds can result that two similarly expressed genes are put in different categories (e.g. g would be labelled as highly expressed and g as medium expressed ) making classical logic more sensitive to noise. expression profiles from those of previous examples stay below the threshold and remain unclassified. A favourable quality of EFuNNs is the ability to map the trained network to a set of symbolic inference rules. In the case of tissue classification, the rules represent the assignment of regions in the gene space to tissue classes. The rule nodes are the cluster centres of tissue samples. Since the network structure can be directly mapped into a set of rules, no intermediate steps are necessary. This ensures that we do not sacrifice the performance of the network by its conversion into inference rules. Many alternative rule extraction procedures with intermediate steps frequently lead to a dramatic decrease in accuracy. We will see in section 6.9, that compact rules can be derived without loss of performance. Although different membership functions can be exploited within EFuNNs, the use of the triangular membership function guarantees the conservation of the property i µ i(g jk ) = 1, where µ i is the ith membership value for the gene expression value g jk of gene j in sample k. This results from the specific form of the supervised learning which updates the rule node vector r j according to r j i = ri j + α(xi k ri j ), (xi k, ri j : the membership degrees of the fuzzy variable i for the kth example and rule node j; α: learning rate). It follows that r j i = (rj i + α(xi k ri j )) = rj i + α x i k α rj i = 1 α + α = 1 i i i i i This property ensures that the rules extracted from an EFuNN are consistent in the anterior part. Contradictory rules of the form [IF (g i is high ( µ high i = 1) AND g i is low ( µ low i = 1) THEN sample k is cancer tissue] cannot be created. 110

13 6.7 Adaptations of EFuNNs to Off-line Learning EFuNNs have been primarily designed for on-line learning (section 3.5.2). They can adapt to an input-output stream of data (whose structure may change) without suffering catastrophic forgetting that is frequently observed in many conventional neural networks. Our case studies, however, state off-line learning problems, since we analyse fixed data sets of limited size. While the dynamic learning algorithm of EFuNNs adapts very well to situations in where large numbers of examples are present, several modifications are needed for an improved classification in the off-line learning task of classifying gene expression data: 1. Multiple data passes: Several passes of the training set through the EFuNN structure are used instead of just one pass in the original EFuNN algorithm. Multi-pass training allows for a refinement of the initially learned EFuNN structure after the first pass. The location of rule nodes created in the first data pass is adjusted in subsequent iterations resulting in a better representation of the detected clusters. In our experiments the structure of EFuNNs in multi-pass training-mode usually stabilised quickly. Frequently three passes give an optimal or near optimal result. 2. Model generation: The learned structure of EFuNNs depends, particularly for small data sets, on the order of objects in the training set. Different orders can lead to different locations and receptive fields of the rule nodes. 1 This apparent shortcoming can be turned into an advantage if we use this dependency for model generation. By randomising the order of objects in the training set, we can generate a set of different EFuNNs structures. By comparing the performance on a validation set, these structures can be ranked and the best performing structure can be selected. Note that neural networks with batch learning algorithms like MLPs do not depend on the order of the training data, since the batch learning algorithm averages over all examples during the training. However, their final structures depend on the initial structure, for which random weights are frequently used. To achieve an optimal structure, several MLPs with different initial structures usually are trained and the best performing network is selected. 3. EFuNN shaving: Every EFuNN node will be activated by samples showing a certain gene expression pattern. The node represents a cluster of samples with similar gene expression levels. As we will see in the following analysis, only a small subset of genes is crucial for defining these 1 Biological neural networks show a similar behaviour. It is well-known in neuroscience that the learning process of biological networks depends strongly on the order of the presented examples. Positive as well as negative interference due to a specific order in the learning process can be observed. Good teaching of a physiological network involves not only defining the set of examples, but also the order in which the learning examples are presented (Kandel et al. 1995). 111

14 patterns. The remaining genes can be regarded as noise in the classification process. EFuNN shaving tries to discover the characteristic set of genes by a gradual reduction of the number of included genes. The details of this modification of the EFuNN training are given in section 6.9. The expression of remaining genes constitute gene expression patterns that are specific for the rule nodes and less redundant. We use this compact set of genes to extract rules of minimal length for the nodes. 4. EFuNN tuning: The EFuNN learning algorithm is based on a local adjustment of the rule nodes in the input-output-space without the adjustment of neighbouring rule nodes. (This is strictly the case only for the 1-of-N activation mode which we use throughout these case studies, while other M-of-N modes simultaneously adjust several rule nodes for each training example.) The local learning can result in an EFuNN structure comprising localised conflicts between single rule nodes e.g. overlapping receptive fields. In on-line learning, these conflicts will be resolved by the permanent adjustment of the borders between the rule nodes using new training examples from a continuous data stream. For the off-line learning case, this possibility does not exist as the number of examples is limited. However, since we have a finite number of training examples, we can use the complete set of examples to refine the borders between the rule nodes i.e. we adjust the EFuNN structure globally to the data structure. The details of EFuNN tuning are described in chapter seven. 6.8 Modelling of Gene Expression Data: Case Studies on Leukaemia and Colon Cancer Tissues In this section, we apply the described methodology to the leukaemia and colon cancer data sets (Golub et al. 1999, Alon et al. 1999). We first normalise and pre-process the data and examine the importance of these steps for classification. Using EFuNNs, we classify the data sets and analyse the results and structures of the fuzzy neural networks. Finally we use rule extraction to identify groups of genes that form profiles and are highly indicative of particular cancer types Normalisation, Pre-processing and Feature Selection Normalisation To compare different microarray experiments, the data has to be normalised (chapter four). For the purpose of this study, we used global normalisation i.e. we scaled the arrays to have the same total fluorescence intensity. Pre-processing The next step in the analysis is the pre-processing of data. Since microarray data are frequently strongly skewed, we used the log 2 -transformation to balance the data (see section 4.4). To exclude genes which do not 112

15 2. Principal Component ALL B cell ALL T cell AML Principal Component Figure 6.3: PCA of leukaemia samples based on 100 genes that have the largest squared Pearson correlation with the two classes ALL and AML. The first two principal components include 63.3% of the total variance of the data. show changes in expression, we filtered out all genes k with [max(log 2 (g k )) min(log 2 (g k ))] < 3 Although these procedures are common in the analysis of microarray data (Tamayo et al. 1999, Perou et al. 2000), they seemed somewhat arbitrary. A major direction in our future research is therefore to find an optimal adjustment of the pre-processing and normalisation within the classification system (section 6.11). Gene selection For the selection of genes for classification, several different methods like t-test or Fisher discriminant analysis have been proposed (Welsh et al. 2001, Dudoit et al. 2000b). Intuitively, we wish to select genes if they are highly correlated with the tissue classes. Since we do not want to discriminate between correlated and anti-correlated genes, we used the squared Pearson correlation coefficient c 2 P that is computed as follows (section 3.1): 2 c 2 P (g i, c) = 1 ( (g ij ḡ i )(c j c) ) N g i c j with the vector g i of the expression value of gene i, the class membership c j of the samples j and the Euclidean distance. Although it is a simple approach for gene selection, it finds sets of genes that are well suited for discriminating tissue classes. This can be seen using PCA (section 3.7.1), which gives a first insight into the structure of high-dimensional data. After selecting 100 genes, the visualisation of the first two components of the leukaemia data set indicates that AML and ALL samples are generally well separated with only a small overlap between the two classes (figure 6.3). A PCA 113

16 100 Leukaemia 50 Number of genes Colon Frequency of occurence Figure 6.4: Stability of gene selection by squared Pearson correlation for the leukaemia and colon data set. The graphs show how frequently genes were selected applying the bootstrap method described in the text. projection of the colon data set can be found in figure 6.6 which shows that the colon cancer and normal tissue samples are less distinguished and include several outliers. Besides distinguishing between the tissue classes, selected genes should not depend on a specific choice of the data samples. To examine the stability of the selection, we used a bootstrap method. Leaving out one sample, the genes are selected based on the remaining (N 1) samples. This is then repeated for all samples. Genes that are well correlated across all samples should be frequently selected, while genes that are correlated only with a subset of the samples should be selected less often. Figure 6.4 shows that most of the genes are repeatedly chosen. This demonstrates that the selection using squared Pearson correlation is highly stable (at least for the data sets analysed here). We finally assessed the quality of normalisation and pre-processing by testing their influence on the classification performance using N-fold cross-validation (section 3.6). As can be seen in figure 6.5, normalisation and pre-processing can strongly influence the classification performance, as in the case of colon tissue data. The performance was increased by normalisation and further by log 2 -transformation and filtering of the data. For the leukaemia case study, however, the improvement was minor. Two possible explanations: either the leukaemia data contained less variation than the colon cancer data due to experimental procedures, or our specific choice of normalisation and pre-processing steps (in conjunction with the gene selection) was better suited for colon cancer than for leukaemia. This would suggest that normalisation, pre-processing and feature selection are not only important for the classification performance but are also data dependent. A future classification system may adjust the pre-classification steps to the data to improve its performance. 114

17 100 Accuracy (%) Raw data Scaled data Scaled + logged data Colon Cancer Scaled + logged + filtered data Raw data Scaled data Scaled + logged data Leukemia Scaled + logged + filtered data Figure 6.5: Dependency of EFuNN performance on pre-processing and normalisation of the microarray data. The N-fold cross-validation testing was based on 100 selected genes (EFuNN parameter: error threshold = 0.9; maximum receptive field = 0.3; aggregation number = 20). Data and Classifier Errth maxrf nagg Rules Acc-tr Acc-te Leukaemia - EFuNN Leukaemia - EFuNN Colon - EFuNN Colon - EFuNN (shaved) Table 6.1: Performance of EFuNNs in N-fold cross-validation testing: Errth - Error threshold, maxrf - maximum receptive field, nagg - aggregation number, Rules - average number of evolved rule nodes, Acc-tr - average classification accuracy of training examples in %, Acc-te - average classification accuracy of testing examples in % Classification of Leukaemia and Colon Cancer Tissues The performance of EFuNNs was tested by N-fold cross-validation. The crossvalidation procedure included gene selection to avoid any over-estimation of the accuracy that might occur if we use the complete data set (including the test set) in the gene selection step. For modelling, the parameters of EFuNNs have to be adjusted. Note that we used 3 triangular membership functions low, medium and high throughout the analysis (figure 6.2). The chosen parameters do not only determine the classification accuracy but also the internal structure of EFuNNs by controlling the number of rule nodes. Frequent aggregation of rule nodes and a larger maximum receptive field, for example, yield a smaller number of rule nodes. An example of this behaviour is given for the leukaemia data in table 6.1. By reducing the maximum receptive field from 0.5 to 0.3, the average number of rule nodes produced during the training phase increased from 2.0 to 6.1. While the classification accuracy rose for the training set, it decreased for the testing set. This is a consequence of the well-known biasvariance trade-off. Classifiers with a large number of parameters show a high flexibility in modelling the data but tend to overfit it resulting in a poorer generalisation. They achieve a higher accuracy for the training data, but show a reduced performance on the test data. For the leukaemia data set, a maximum accuracy by EFuNNs of 97.2% was 115

18 Data Classifier Study Accuracy Leukaemia weighted voting Golub et al Leukaemia feature wrapping Bo & Jonassen 98.6 Leukaemia ensemble of classifiers Ryu & Cho 97.1 Colon decision tree Zhang et al Colon Lin. discrim. analysis Xiong et al Colon SVM Moler et al Colon clustering Ben-Dor et al Colon feature wrapping Bo & Jonassen 88.7 Table 6.2: Comparison of classification results for leukaemia and colon cancer data: The accuracies (in %) of several alternative approaches are shown as reported in the corresponding publications. A direct comparison is difficult, since the testing procedures differ (see text). obtained on the testing set using N-fold cross-validation. This is a very high accuracy considering that the trained EFuNN models consisted on average of only two rules. An overview of classification results for other approaches can be found in table 6.2. Golub et al. achieved a lower performance (94.1%) applying a weighted voting method (Golub et al. 1999). However, a direct comparison is restricted since Golub et al. assessed the accuracy based on a single split of the total data into the training and testing sets. Bo and Jonassen reported a maximum accuracy of 98.6% including more than 30 genes in their model (Bo & Jonassen 2002). As we see in the next section, EFuNNs need considerably fewer genes for classification. Using the same training and test data as Golub et al., the ensemble classifier constructed by Ryu and Cho yield a maximum accuracy of 97.1% (with a further improvement recently reported (Ryu & Cho 2002)). The classification performance is considerably lower for the colon cancer data with 90.3% (table 6.1) which we will improve the accuracy using EFuNN shaving in the next section. An accuracy of 93.6% was achieved by Zhang et al. using decision tree techniques (Zhang et al. 2001), but based on a different testing method and floored by by the lack of any data normalisation. Fisher s linear discriminant analysis conducted by Xiong et al. achieved an accuracy of 87.0% (Xiong et al. 2000). Moler et al. also reanalysed the colon tumour data applying an SVM approach that classified 88.7% of the samples correctly (Moler et al. 2000). The same accuracy was achieved by Ben-Dor et al. using clusteringbased classification (Ben-Dor et al. 2000) and by Bo and Jonassen applying their proposed feature wrapping method (Bo & Jonassen 2002). For the colon data, the PCA shows several samples as outliers (figure 6.6). This made the classification more difficult. As for the leukaemia data set, a large maximum field ensured the simplicity of the model and a high accuracy on the test set. The misclassifications of samples are shown as a PCA projection in figure 6.6. Most of these misclassifications can be considered as outliers. Based on the set of selected genes and the PCA projection, their gene expression profiles were very similar to expression profiles of samples that belong to the other tissue class. 116

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures 1 2 3 4 5 Kathleen T Quach Department of Neuroscience University of California, San Diego

More information

Comparison of discrimination methods for the classification of tumors using gene expression data

Comparison of discrimination methods for the classification of tumors using gene expression data Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley

More information

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection 202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray

More information

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene

More information

Classification of cancer profiles. ABDBM Ron Shamir

Classification of cancer profiles. ABDBM Ron Shamir Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;

More information

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections New: Bias-variance decomposition, biasvariance tradeoff, overfitting, regularization, and feature selection Yi

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Introduction to Discrimination in Microarray Data Analysis

Introduction to Discrimination in Microarray Data Analysis Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t

More information

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes. Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1 Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension

More information

10CS664: PATTERN RECOGNITION QUESTION BANK

10CS664: PATTERN RECOGNITION QUESTION BANK 10CS664: PATTERN RECOGNITION QUESTION BANK Assignments would be handed out in class as well as posted on the class blog for the course. Please solve the problems in the exercises of the prescribed text

More information

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Diagnosis of multiple cancer types by shrunken centroids of gene expression Diagnosis of multiple cancer types by shrunken centroids of gene expression Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu PNAS 99:10:6567-6572, 14 May 2002 Nearest Centroid

More information

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s Using Bayesian Networks to Analyze Expression Data Xu Siwei, s0789023 Muhammad Ali Faisal, s0677834 Tejal Joshi, s0677858 Outline Introduction Bayesian Networks Equivalence Classes Applying to Expression

More information

Evaluating Classifiers for Disease Gene Discovery

Evaluating Classifiers for Disease Gene Discovery Evaluating Classifiers for Disease Gene Discovery Kino Coursey Lon Turnbull khc0021@unt.edu lt0013@unt.edu Abstract Identification of genes involved in human hereditary disease is an important bioinfomatics

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 1, Jan Feb 2017 RESEARCH ARTICLE Classification of Cancer Dataset in Data Mining Algorithms Using R Tool P.Dhivyapriya [1], Dr.S.Sivakumar [2] Research Scholar [1], Assistant professor [2] Department of Computer Science

More information

Predicting Kidney Cancer Survival from Genomic Data

Predicting Kidney Cancer Survival from Genomic Data Predicting Kidney Cancer Survival from Genomic Data Christopher Sauer, Rishi Bedi, Duc Nguyen, Benedikt Bünz Abstract Cancers are on par with heart disease as the leading cause for mortality in the United

More information

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence To understand the network paradigm also requires examining the history

More information

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION

A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION 5-9 JATIT. All rights reserved. A COMBINATORY ALGORITHM OF UNIVARIATE AND MULTIVARIATE GENE SELECTION 1 H. Mahmoodian, M. Hamiruce Marhaban, 3 R. A. Rahim, R. Rosli, 5 M. Iqbal Saripan 1 PhD student, Department

More information

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis

Applications. DSC 410/510 Multivariate Statistical Methods. Discriminating Two Groups. What is Discriminant Analysis DSC 4/5 Multivariate Statistical Methods Applications DSC 4/5 Multivariate Statistical Methods Discriminant Analysis Identify the group to which an object or case (e.g. person, firm, product) belongs:

More information

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine

Brain Tumour Detection of MR Image Using Naïve Beyer classifier and Support Vector Machine International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Brain Tumour Detection of MR Image Using Naïve

More information

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 A.K.A. Artificial Intelligence Unsupervised learning! Cluster analysis Patterns, Clumps, and Joining

More information

Selection and Combination of Markers for Prediction

Selection and Combination of Markers for Prediction Selection and Combination of Markers for Prediction NACC Data and Methods Meeting September, 2010 Baojiang Chen, PhD Sarah Monsell, MS Xiao-Hua Andrew Zhou, PhD Overview 1. Research motivation 2. Describe

More information

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations

Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations Deep Learning Analytics for Predicting Prognosis of Acute Myeloid Leukemia with Cytogenetics, Age, and Mutations Andy Nguyen, M.D., M.S. Medical Director, Hematopathology, Hematology and Coagulation Laboratory,

More information

A HMM-based Pre-training Approach for Sequential Data

A HMM-based Pre-training Approach for Sequential Data A HMM-based Pre-training Approach for Sequential Data Luca Pasa 1, Alberto Testolin 2, Alessandro Sperduti 1 1- Department of Mathematics 2- Department of Developmental Psychology and Socialisation University

More information

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines

Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Class discovery in Gene Expression Data: Characterizing Splits by Support Vector Machines Florian Markowetz and Anja von Heydebreck Max-Planck-Institute for Molecular Genetics Computational Molecular Biology

More information

Predictive Biomarkers

Predictive Biomarkers Uğur Sezerman Evolutionary Selection of Near Optimal Number of Features for Classification of Gene Expression Data Using Genetic Algorithms Predictive Biomarkers Biomarker: A gene, protein, or other change

More information

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach Identifying Parkinson s Patients: A Functional Gradient Boosting Approach Devendra Singh Dhami 1, Ameet Soni 2, David Page 3, and Sriraam Natarajan 1 1 Indiana University Bloomington 2 Swarthmore College

More information

Sparse Coding in Sparse Winner Networks

Sparse Coding in Sparse Winner Networks Sparse Coding in Sparse Winner Networks Janusz A. Starzyk 1, Yinyin Liu 1, David Vogel 2 1 School of Electrical Engineering & Computer Science Ohio University, Athens, OH 45701 {starzyk, yliu}@bobcat.ent.ohiou.edu

More information

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang Classification Methods Course: Gene Expression Data Analysis -Day Five Rainer Spang Ms. Smith DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith 30.000 properties of Ms. Smith The expression

More information

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA International Journal of Software Engineering and Knowledge Engineering Vol. 13, No. 6 (2003) 579 592 c World Scientific Publishing Company MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION

More information

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23: Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:7332-7341 Presented by Deming Mi 7/25/2006 Major reasons for few prognostic factors to

More information

Data analysis in microarray experiment

Data analysis in microarray experiment 16 1 004 Chinese Bulletin of Life Sciences Vol. 16, No. 1 Feb., 004 1004-0374 (004) 01-0041-08 100005 Q33 A Data analysis in microarray experiment YANG Chang, FANG Fu-De * (National Laboratory of Medical

More information

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang

Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang Identifying Thyroid Carcinoma Subtypes and Outcomes through Gene Expression Data Kun-Hsing Yu, Wei Wang, Chung-Yu Wang Abstract: Unlike most cancers, thyroid cancer has an everincreasing incidence rate

More information

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Nearest Shrunken Centroid as Feature Selection of Microarray Data Nearest Shrunken Centroid as Feature Selection of Microarray Data Myungsook Klassen Computer Science Department, California Lutheran University 60 West Olsen Rd, Thousand Oaks, CA 91360 mklassen@clunet.edu

More information

Hybridized KNN and SVM for gene expression data classification

Hybridized KNN and SVM for gene expression data classification Mei, et al, Hybridized KNN and SVM for gene expression data classification Hybridized KNN and SVM for gene expression data classification Zhen Mei, Qi Shen *, Baoxian Ye Chemistry Department, Zhengzhou

More information

Detection of Cognitive States from fmri data using Machine Learning Techniques

Detection of Cognitive States from fmri data using Machine Learning Techniques Detection of Cognitive States from fmri data using Machine Learning Techniques Vishwajeet Singh, K.P. Miyapuram, Raju S. Bapi* University of Hyderabad Computational Intelligence Lab, Department of Computer

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) AUTHORS: Tejas Prahlad INTRODUCTION Acute Respiratory Distress Syndrome (ARDS) is a condition

More information

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012 Intelligent Edge Detector Based on Multiple Edge Maps M. Qasim, W.L. Woon, Z. Aung Technical Report DNA #2012-10 May 2012 Data & Network Analytics Research Group (DNA) Computing and Information Science

More information

NeuroMem. RBF Decision Space Mapping

NeuroMem. RBF Decision Space Mapping NeuroMem RBF Decision Space Mapping Version 4.2 Revised 01/04/2017 This manual is copyrighted and published by General Vision Inc. All rights reserved. Contents 1 Introduction to a decision space... 3

More information

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann http://genomebiology.com/22/3/2/research/69. Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann Address: Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich,

More information

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018 Introduction to Machine Learning Katherine Heller Deep Learning Summer School 2018 Outline Kinds of machine learning Linear regression Regularization Bayesian methods Logistic Regression Why we do this

More information

Learning in neural networks

Learning in neural networks http://ccnl.psy.unipd.it Learning in neural networks Marco Zorzi University of Padova M. Zorzi - European Diploma in Cognitive and Brain Sciences, Cognitive modeling", HWK 19-24/3/2006 1 Connectionist

More information

Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6)

Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6) Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6) BPNN in Practice Week 3 Lecture Notes page 1 of 1 The Hopfield Network In this network, it was designed on analogy of

More information

Gender Based Emotion Recognition using Speech Signals: A Review

Gender Based Emotion Recognition using Speech Signals: A Review 50 Gender Based Emotion Recognition using Speech Signals: A Review Parvinder Kaur 1, Mandeep Kaur 2 1 Department of Electronics and Communication Engineering, Punjabi University, Patiala, India 2 Department

More information

EECS 433 Statistical Pattern Recognition

EECS 433 Statistical Pattern Recognition EECS 433 Statistical Pattern Recognition Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 19 Outline What is Pattern

More information

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION

FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION FUZZY C-MEANS AND ENTROPY BASED GENE SELECTION BY PRINCIPAL COMPONENT ANALYSIS IN CANCER CLASSIFICATION SOMAYEH ABBASI, HAMID MAHMOODIAN Department of Electrical Engineering, Najafabad branch, Islamic

More information

PMR5406 Redes Neurais e Lógica Fuzzy. Aula 5 Alguns Exemplos

PMR5406 Redes Neurais e Lógica Fuzzy. Aula 5 Alguns Exemplos PMR5406 Redes Neurais e Lógica Fuzzy Aula 5 Alguns Exemplos APPLICATIONS Two examples of real life applications of neural networks for pattern classification: RBF networks for face recognition FF networks

More information

Stepwise Knowledge Acquisition in a Fuzzy Knowledge Representation Framework

Stepwise Knowledge Acquisition in a Fuzzy Knowledge Representation Framework Stepwise Knowledge Acquisition in a Fuzzy Knowledge Representation Framework Thomas E. Rothenfluh 1, Karl Bögl 2, and Klaus-Peter Adlassnig 2 1 Department of Psychology University of Zurich, Zürichbergstraße

More information

Neurons and neural networks II. Hopfield network

Neurons and neural networks II. Hopfield network Neurons and neural networks II. Hopfield network 1 Perceptron recap key ingredient: adaptivity of the system unsupervised vs supervised learning architecture for discrimination: single neuron perceptron

More information

Error Detection based on neural signals

Error Detection based on neural signals Error Detection based on neural signals Nir Even- Chen and Igor Berman, Electrical Engineering, Stanford Introduction Brain computer interface (BCI) is a direct communication pathway between the brain

More information

NMF-Density: NMF-Based Breast Density Classifier

NMF-Density: NMF-Based Breast Density Classifier NMF-Density: NMF-Based Breast Density Classifier Lahouari Ghouti and Abdullah H. Owaidh King Fahd University of Petroleum and Minerals - Department of Information and Computer Science. KFUPM Box 1128.

More information

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil

CSE Introduction to High-Perfomance Deep Learning ImageNet & VGG. Jihyung Kil CSE 5194.01 - Introduction to High-Perfomance Deep Learning ImageNet & VGG Jihyung Kil ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Colon cancer subtypes from gene expression data

Colon cancer subtypes from gene expression data Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto Sherman Ip Leon Law Module 6: Applied Statistics 26th February 2016 Aim Replicate findings of Felipe De Sousa et

More information

Learning Classifier Systems (LCS/XCSF)

Learning Classifier Systems (LCS/XCSF) Context-Dependent Predictions and Cognitive Arm Control with XCSF Learning Classifier Systems (LCS/XCSF) Laurentius Florentin Gruber Seminar aus Künstlicher Intelligenz WS 2015/16 Professor Johannes Fürnkranz

More information

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE

EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE EARLY STAGE DIAGNOSIS OF LUNG CANCER USING CT-SCAN IMAGES BASED ON CELLULAR LEARNING AUTOMATE SAKTHI NEELA.P.K Department of M.E (Medical electronics) Sengunthar College of engineering Namakkal, Tamilnadu,

More information

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data

Data analysis and binary regression for predictive discrimination. using DNA microarray data. (Breast cancer) discrimination. Expression array data West Mike of Statistics & Decision Sciences Institute Duke University wwwstatdukeedu IPAM Functional Genomics Workshop November Two group problems: Binary outcomes ffl eg, ER+ versus ER ffl eg, lymph node

More information

AClass: A Simple, Online Probabilistic Classifier. Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL

AClass: A Simple, Online Probabilistic Classifier. Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL AClass: A Simple, Online Probabilistic Classifier Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL AClass: A Simple, Online Probabilistic Classifier or How I learned to stop worrying

More information

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM)

An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM) www.ijcsi.org 8 An Efficient Diseases Classifier based on Microarray Datasets using Clustering ANOVA Extreme Learning Machine (CAELM) Shamsan Aljamali 1, Zhang Zuping 2 and Long Jun 3 1 School of Information

More information

Good Old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers q

Good Old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers q European Journal of Cancer 40 (2004) 1837 1841 European Journal of Cancer www.ejconline.com Good Old clinical markers have similar power in breast cancer prognosis as microarray gene expression profilers

More information

Lecture #4: Overabundance Analysis and Class Discovery

Lecture #4: Overabundance Analysis and Class Discovery 236632 Topics in Microarray Data nalysis Winter 2004-5 November 15, 2004 Lecture #4: Overabundance nalysis and Class Discovery Lecturer: Doron Lipson Scribes: Itai Sharon & Tomer Shiran 1 Differentially

More information

Integrative Methods for Gene Data Analysis and Knowledge Discovery on the Case Study of KEDRI s Brain Gene Ontology. Yuepeng Wang

Integrative Methods for Gene Data Analysis and Knowledge Discovery on the Case Study of KEDRI s Brain Gene Ontology. Yuepeng Wang Integrative Methods for Gene Data Analysis and Knowledge Discovery on the Case Study of KEDRI s Brain Gene Ontology Yuepeng Wang A thesis submitted to Auckland University of Technology in partial fulfilment

More information

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis? How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis? Richards J. Heuer, Jr. Version 1.2, October 16, 2005 This document is from a collection of works by Richards J. Heuer, Jr.

More information

Introduction to Computational Neuroscience

Introduction to Computational Neuroscience Introduction to Computational Neuroscience Lecture 5: Data analysis II Lesson Title 1 Introduction 2 Structure and Function of the NS 3 Windows to the Brain 4 Data analysis 5 Data analysis II 6 Single

More information

A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system

A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system Technology and Health Care 24 (2016) S601 S605 DOI 10.3233/THC-161187 IOS Press S601 A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system Jongwoo Lim, Bohyun

More information

Classification of Synapses Using Spatial Protein Data

Classification of Synapses Using Spatial Protein Data Classification of Synapses Using Spatial Protein Data Jenny Chen and Micol Marchetti-Bowick CS229 Final Project December 11, 2009 1 MOTIVATION Many human neurological and cognitive disorders are caused

More information

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction

Abstract. Optimization strategy of Copy Number Variant calling using Multiplicom solutions APPLICATION NOTE. Introduction Optimization strategy of Copy Number Variant calling using Multiplicom solutions Michael Vyverman, PhD; Laura Standaert, PhD and Wouter Bossuyt, PhD Abstract Copy number variations (CNVs) represent a significant

More information

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network Contents Classification Rule Generation for Bioinformatics Hyeoncheol Kim Rule Extraction from Neural Networks Algorithm Ex] Promoter Domain Hybrid Model of Knowledge and Learning Knowledge refinement

More information

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection

How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection How to Create Better Performing Bayesian Networks: A Heuristic Approach for Variable Selection Esma Nur Cinicioglu * and Gülseren Büyükuğur Istanbul University, School of Business, Quantitative Methods

More information

Discrimination and Generalization in Pattern Categorization: A Case for Elemental Associative Learning

Discrimination and Generalization in Pattern Categorization: A Case for Elemental Associative Learning Discrimination and Generalization in Pattern Categorization: A Case for Elemental Associative Learning E. J. Livesey (el253@cam.ac.uk) P. J. C. Broadhurst (pjcb3@cam.ac.uk) I. P. L. McLaren (iplm2@cam.ac.uk)

More information

An Improved Algorithm To Predict Recurrence Of Breast Cancer

An Improved Algorithm To Predict Recurrence Of Breast Cancer An Improved Algorithm To Predict Recurrence Of Breast Cancer Umang Agrawal 1, Ass. Prof. Ishan K Rajani 2 1 M.E Computer Engineer, Silver Oak College of Engineering & Technology, Gujarat, India. 2 Assistant

More information

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis Data mining for Obstructive Sleep Apnea Detection 18 October 2017 Konstantinos Nikolaidis Introduction: What is Obstructive Sleep Apnea? Obstructive Sleep Apnea (OSA) is a relatively common sleep disorder

More information

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS International Journal of Computer Engineering & Technology (IJCET) Volume 9, Issue 4, July-Aug 2018, pp. 196-201, Article IJCET_09_04_021 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=9&itype=4

More information

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/ CSCI1950 Z Computa4onal Methods for Biology Lecture 18 Ben Raphael April 8, 2009 hip://cs.brown.edu/courses/csci1950 z/ Binary classifica,on Given a set of examples (x i, y i ), where y i = + 1, from unknown

More information

Wen et al. (1998) PNAS, 95:

Wen et al. (1998) PNAS, 95: Large-scale temporal gene expression mapping of central nervous system development Fluctuations in mrna expression of 2 genes during rat central nervous system development, focusing on the cervical spinal

More information

Neuroinformatics. Ilmari Kurki, Urs Köster, Jukka Perkiö, (Shohei Shimizu) Interdisciplinary and interdepartmental

Neuroinformatics. Ilmari Kurki, Urs Köster, Jukka Perkiö, (Shohei Shimizu) Interdisciplinary and interdepartmental Neuroinformatics Aapo Hyvärinen, still Academy Research Fellow for a while Post-docs: Patrik Hoyer and Jarmo Hurri + possibly international post-docs PhD students Ilmari Kurki, Urs Köster, Jukka Perkiö,

More information

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks

Gene Expression Based Leukemia Sub Classification Using Committee Neural Networks Bioinformatics and Biology Insights M e t h o d o l o g y Open Access Full open access to this and thousands of other papers at http://www.la-press.com. Gene Expression Based Leukemia Sub Classification

More information

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data

Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data Feature selection methods for early predictive biomarker discovery using untargeted metabolomic data Dhouha Grissa, Mélanie Pétéra, Marion Brandolini, Amedeo Napoli, Blandine Comte and Estelle Pujos-Guillot

More information

Classification of Epileptic Seizure Predictors in EEG

Classification of Epileptic Seizure Predictors in EEG Classification of Epileptic Seizure Predictors in EEG Problem: Epileptic seizures are still not fully understood in medicine. This is because there is a wide range of potential causes of epilepsy which

More information

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,

More information

Evolutionary Programming

Evolutionary Programming Evolutionary Programming Searching Problem Spaces William Power April 24, 2016 1 Evolutionary Programming Can we solve problems by mi:micing the evolutionary process? Evolutionary programming is a methodology

More information

Network Analysis of Toxic Chemicals and Symptoms: Implications for Designing First-Responder Systems

Network Analysis of Toxic Chemicals and Symptoms: Implications for Designing First-Responder Systems Network Analysis of Toxic Chemicals and Symptoms: Implications for Designing First-Responder Systems Suresh K. Bhavnani 1 PhD, Annie Abraham 1, Christopher Demeniuk 1, Messeret Gebrekristos 1 Abe Gong

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Literature Survey Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is

More information

Reactive agents and perceptual ambiguity

Reactive agents and perceptual ambiguity Major theme: Robotic and computational models of interaction and cognition Reactive agents and perceptual ambiguity Michel van Dartel and Eric Postma IKAT, Universiteit Maastricht Abstract Situated and

More information

Motivation: Fraud Detection

Motivation: Fraud Detection Outlier Detection Motivation: Fraud Detection http://i.imgur.com/ckkoaop.gif Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 2 Techniques: Fraud Detection Features Dissimilarity Groups and

More information

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Final Project Report CS 229 Autumn 2017 Category: Life Sciences Maxwell Allman (mallman) Lin Fan (linfan) Jamie Kang (kangjh) 1 Introduction

More information

Intelligent Control Systems

Intelligent Control Systems Lecture Notes in 4 th Class in the Control and Systems Engineering Department University of Technology CCE-CN432 Edited By: Dr. Mohammed Y. Hassan, Ph. D. Fourth Year. CCE-CN432 Syllabus Theoretical: 2

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

OUTLIER DETECTION : A REVIEW

OUTLIER DETECTION : A REVIEW International Journal of Advances Outlier in Embedeed Detection System : A Review Research January-June 2011, Volume 1, Number 1, pp. 55 71 OUTLIER DETECTION : A REVIEW K. Subramanian 1, and E. Ramraj

More information

Application of Tree Structures of Fuzzy Classifier to Diabetes Disease Diagnosis

Application of Tree Structures of Fuzzy Classifier to Diabetes Disease Diagnosis , pp.143-147 http://dx.doi.org/10.14257/astl.2017.143.30 Application of Tree Structures of Fuzzy Classifier to Diabetes Disease Diagnosis Chang-Wook Han Department of Electrical Engineering, Dong-Eui University,

More information

Modeling Sentiment with Ridge Regression

Modeling Sentiment with Ridge Regression Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,

More information

Data complexity measures for analyzing the effect of SMOTE over microarrays

Data complexity measures for analyzing the effect of SMOTE over microarrays ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Data complexity

More information

ANN predicts locoregional control using molecular marker profiles of. Head and Neck squamous cell carcinoma

ANN predicts locoregional control using molecular marker profiles of. Head and Neck squamous cell carcinoma ANN predicts locoregional control using molecular marker profiles of Head and Neck squamous cell carcinoma Final Project: 539 Dinesh Kumar Tewatia Introduction Radiotherapy alone or combined with chemotherapy,

More information

MOST: detecting cancer differential gene expression

MOST: detecting cancer differential gene expression Biostatistics (2008), 9, 3, pp. 411 418 doi:10.1093/biostatistics/kxm042 Advance Access publication on November 29, 2007 MOST: detecting cancer differential gene expression HENG LIAN Division of Mathematical

More information

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt:

More information

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD Department of Biomedical Informatics Department of Computer Science and Engineering The Ohio State University Review

More information