Evaluating Classifiers for Disease Gene Discovery

Size: px

Start display at page:

Download "Evaluating Classifiers for Disease Gene Discovery"

Rudolf York
5 years ago
Views:

1 Evaluating Classifiers for Disease Gene Discovery Kino Coursey Lon Turnbull Abstract Identification of genes involved in human hereditary disease is an important bioinfomatics task. The tool PROSPECTR is used to estimate the likelihood that a gene is involved in human hereditary disease by looking at patterns in sequence based features, and was developed using the alternating decision tree algorithm in the Weka machine learning tool. Here we examine both the performance of other classifiers using the same data, and over a subset of the most statistically relevant features, and examine the hypothesis of using sequence based features for disease prediction. We were able to find a better classifier, but generated questions about the predictive value of the original features selected. Introduction and Problem Determining the degree to which a gene is involved in some genetic disease is an important bioinformatics task. The tool PROSPECTR 1, developed by the University of Edinburgh, is used to estimate the likelihood that a gene is involved in human hereditary disease by looking at patterns in DNA sequence features, and was developed using the alternating 1 decision tree algorithm in the Weka machine learning tool. In this project we extended the work of the PROSPECTR project at defining classifiers to determine the likelihood of candidate regions being involved in a genetic disease. While their work focused on only limited depth decision trees, we examined the other classifiers in the Weka machine learning tool set, and determined the quality and accuracy of the various classifiers. We also examined classifier performance over a subset of the most statistically relevant features to test the predictive value of the sequence-based features. If possible, the ability to improve the accuracy of this classification task would allow the high probability sites to be given priority when searching for disease genes. Background and Related Work The ability to estimate the probability of a gene being involved in human hereditary diseases is a very useful bioinformatics operation. More and more information on human and other genomes are constantly being collected. The ability to provide an estimate of probability that a given gene is involved in a disease phenotype would speed up this important biomedical task. Developing better classifiers is central to this approach. Genetic diseases are due to genes that have been mutated so that the body or

2 some parts of the body no longer functions correctly. More than 100 known genetic disorders are the direct result of a mutation in one gene. It is much more difficult to find the basis of polygenic diseases that have a complex pattern of inheritance where more than one gene needs to be mutated before a susceptibility to a disease is expressed 2. It has been suggested that the genes that have some relationship to hereditary disease might have common variations in their DNA sequence structure. The University of Edinburgh group has used the alternating decision tree algorithm from Weka to test this hypothesis. They used 63 distinct features to test about 18,000 genes that are not known to be involved in disease and the 1,084 genes listed in the Online Mendelian Inheritance in Man 3. On average, 70% of the disease genes were correctly identified with their automatic classifier called PROSPECTR. What are Disease Genes? Central to the classification process is defining what a gene being involved in a disease phenotype means. One definition is when any gene has mutated in such a way that the proteins created from it are dysfunctional. However, mutation can occur in any gene, and causes of diseases are a continuum of genetic activity interacting with non-genetic factors. "Every individual is a deviant in terms of biochemical individuality, meaning that every person has an inherited predisposition to disease in a particular circumstance." from The Metabolic Molecular Bases of Inherited Disease 2 Polygenic disorder: A genetic disorder resulting from the combined action of alleles of more than one gene (e.g., heart disease, diabetes, and some cancers). 3 Online Mendelian Inheritance in Man (OMIM): National Center for Biotechnology Information database of genetic diseases with information on their clinical diagnosis and treatment, cell biology, biochemistry, and molecular medicine. Given this, how can one search for disease genes given such a fuzzy concept? The simple expedient taken by the PROSPECTR group was to take collections of genes identified as being highly correlated or causally linked to medically classified conditions from online gene data banks with disease influence annotations. PROSPECTR Results The PROSPECTR group tested a number of DNA features in an attempt to find differences between disease genes and non-disease genes. Table 1 shows the ratio of the median in a disease set to the median in a control set of the 9 of the 24 features that had statistically significant differences. The larger the ratio found between feature and disease the greater the dependence. Feature Ratio Gene encodes signal peptide 2.06 Gene Length 1.42 Protein length ' CpG islands 1.33 Exon Number 1.25 cdna length 1.15 Distance to neighboring gene ' UTR length 1.09 Protein identity with BRH in 1.09 mouse Table 1: Feature to Disease ratio Given these relationships between disease genes and features we became interested in evaluating the performance of classifiers on a reduced set. If the features are predictive, then eliminating the noisier features should improve performance. Problem Formulation Because of the benefit of finding better classifiers, our project was to evaluate the performance of a number of classifiers, and compare them to PROSPECTR. In addition, based on examination of the PROSPECTR group work we became interested in examining classifier

3 performance using only the most statistically significant features. Our goal was to determine: 1. Does a better classifier exist for the data set? 2. How valid are the features used in the dataset for disease prediction? Methodology We take the simple approach of: 1. Duplicating the PROSPECTR teams results by obtaining their dataset and machine learning tool set. 2. Extend the examination by using the same machine learning tool set to examine other relevant classifiers. 3. Produce datasets with only the most statistically relevant features 4. Measure the performance of the originally selected classifiers and those we examined on the reduced datasets. 5. Identify those areas where the other classifiers exceed the performance of the PROSPECTR baseline. The DataSets The group at the University of Edinburgh provides the datasets used to create and test PROSPECTR 4. Three datasets are available to duplicate their work: OMIN_traingset: From the Online Medelian Inheritance in Man (OMIM) and a set from Ensembl that were not known to be involved in human disease. hgmd_testset: Independent test set, with 675 genes in the Human Gene Mutation Database and 675 genes not known to be involved in disease. Pocus_testset: The POCUS list is made up of genes involved in oligogenic 5 disorders. 4 Datasets are available from d.shtml Weka Data Mining Tool Set Weka is a collection of machine learning algorithms for data mining tasks developed at the University of Waikato in New Zealand (Witten, 2004) 6. Weka provides several classifier systems including but not limited to decision tree, rule generators, statistical analysis, Bayes, SVM and neural networks. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It also has an active development and user community that ensures that any new machine learning method that conforms to the tabular dataset format is available. In addition, the project has spawned several interesting branches (Grid Weka, Parallel Weka, BioWeka). The algorithms in Weka can either be applied directly to a dataset or called from your own Java code. It is also well suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. The Classifiers Examined We examined a cross section of classifiers currently provided by Weka. ADTree -B 15 -E -1 The alternating decision tree is optimized for two-class problems. Each prediction node has an associated positive or negative numeric value. To obtain a prediction for an instance, filter down all applicable branches and sum up the values of any prediction node encountered. Predict the final classification based on if the sum is positive or negative. J48 -C M 5 J48 is a variant of the C4.5 decision tree induction algorithm. Test nodes in the tree are selected based on their 5 Oligogenic: A phenotypic trait produced by two or more genes working together. 6 Weka is available at

4 ability to produce a clearer separation between the classes. Leaf nodes contain either all one class, or a mixture with the majority class used as the classification. Minimum node size is equal to 5. Logistic -R 1.0E-8 -M Linear logistic regression, a form of regression for classification. It constructs a linear regression model of the logit transform of the probability. log(p/(1-p)) = b0 + b1*x1 + b2*x bk*xk where b0...bk are chosen by a regression method to make class optimal predictions. SMO Sequential Minimal Optimization algorithm for training a support vector classifier (SVM's). Training an SVM requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. For classification, SVMs operate by finding a hypersurface in the space of possible inputs. This hypersurface will attempt to split the positive examples from the negative examples. Naïve Bayes Standard probabilistic Naïve Bayes classifier. Using Bayes Theorem and class statistics select the most probable classification. Ibk -K 5 -W 0 K-nearest neighbor classifier (k=5). Select the class as the majority of the 5 nearest training instances using a distance metric. PART -M 2 -C Q 1 Obtains rules from partial decision trees build using C4.5 heuristics Evaluation: What makes a Good Classifier? One has to define how one would detect a good classifier. First, all of the datasets are balanced, with an equal number of disease and non-disease genes. Hence ideally the total correct should be high, and the number correctly identified in each class should be balanced. Another approach is to select the classifier with the least number of mismatches as the best choice. Details of the experiments can be found in Table 3. Classifier Table 2: Percentage correct and change on reduced feature set. Classifier Results When examining the performance of the classifiers one notices: 1. J48 appears to superior when all data is presented and is competitive on the reduced set. 2. Data reduction affects the original ADTree the least. 3. SMO with all data is competitive. 4. IBk with all data is competitive. 5. For ADTree and J48 data reduction helps in prediction on the POCUS dataset. 6. Naïve Bayes is not competitive with default settings for this task. 7. The Logistic classifier appears close to the ADTree in performance and may need additional tuning. Feature Results Percent total correct ADTree J Logistic SMO Naïve Bayes Ibk-K Difference with best features PART A reduced set of independent features should to produce similar results to the training set. If not the analysis is suspect. If the analysis is valid, we would expect that classification using the only the

5 successful subset of features found by the PROSPECTR application would result in improved results. However, when we reduced the number of features to focus on the more prominent ones unexpectedly a different set of classifiers predominates in the reduced data set, and in only 2 cases does it improve (see Table 4). While J48 has the best performance on the training set (88.7% correct) it shows the largest drop in performance on a reduce feature set. Why? While it is possible that disease genes may indicate the statistically related features as indicated by the diseasefeature association table, it is also possible that the features do not predict disease. Logically a disease implying a feature does not mean that a feature implies a disease. This might explain the lack of improvement on the statistically significant feature set. Another possibility is that some of the features are not independent of each other and thus while having different correlation profiles are in fact dependent on each other. For instance protein length, gene length and cdna length would intuitively be related to each other. Longer proteins require longer DNA sequences to specify and thus would be reflected in the other measures. However, genes involved in disease causing phenotypes can also be short. So while diseases may be found associated with longer genes, the length measures alone would not be a good predictor. Future Work and Conclusions One of the reasons for the use of ADTree was the human readable nature of the rules generated. While an important output, it may also be important to have higher absolute recall and precision when performing this task. to consider as new classifiers are constantly being generated by the machine learning and data mining community, and more information is constantly being generated by various genomic and bioinformatics projects. It can be expected that this will continue to be an area of research. In conclusion, we have shown: The J48 classifier performs better than the ADTree classifier, the one chosen by PROSPECTR method. The features that showed the largest differences in the PRPSPECTR study were most likely a statistical anomaly. It seems that using these machine learning methods to classify disease genes is not very productive. At best it needs to be combined with some other independent method and more relevant feature set. References Euan Adie et. al., Speeding disease gene discovery by sequence based candidate prioritization, BMC Bioinformatics 2005, 6:55. Hammond MP, Birney E, Genome information resources - developments at Ensembl. Trends in Genetics 2004, 20: Ian H. Witten and Eibe Frank. Data mining: Practical machine learning tools and techniques Morgan Kaufman, San Francisco, CA, USA, i?rid=gnd The different classifiers showed different performance characteristics, and ADTree was found not to be the best along all dimensions of measure. This is important

6 Disease Recall Normal Recall Disease Precision Normal Precision D-F N-F D as D D as N N as D N as N ADTree -B 15 -E -1 Self % 68.33% 69.79% 71.79% 71.43% 70.02% HGMD % 61.99% 63.58% 64.81% 64.93% 63.37% POCUS % 68.00% 69.23% 70.83% 70.59% 69.39% Self_Reduced % 60.93% 65.52% 70.30% 69.62% 65.28% HGMD_Reduced % 56.66% 61.42% 64.64% 64.99% 60.39% POCUS_Reduced % 70.00% 71.15% 72.92% 72.55% 71.43% J48 -C M 5 Self % 85.09% 86.11% 91.81% 89.15% 88.32% HGMD % 62.41% 63.03% 63.48% 63.56% 62.94% POCUS % 62.00% 64.15% 65.96% 66.02% 63.92% Self_Reduced % 63.33% 69.61% 79.81% 76.12% 70.62% HGMD_Reduced % 52.17% 58.67% 61.90% 62.94% 56.62% POCUS_Reduced % 62.00% 67.80% 75.61% 73.39% 68.13% Logistic -R 1.0E-8 -M Self % 66.39% 68.65% 71.56% 71.05% 68.88% HGMD % 61.29% 63.35% 64.93% 65.08% 63.06% POCUS % 66.00% 67.92% 70.21% 69.90% 68.04% Self_Reduced % 61.11% 63.98% 66.40% 66.43% 63.65% HGMD_Reduced % 60.59% 61.66% 62.34% 62.52% 61.45% POCUS_Reduced % 56.00% 62.71% 68.29% 67.89% 61.54% SMO Self % 63.06% 68.13% 75.00% 73.16% 68.51% HGMD % 57.92% 63.10% 67.37% 67.23% 62.29% POCUS % 62.00% 68.33% 77.50% 74.55% 68.89% Self_Reduced % 65.65% 65.33% 65.05% 65.02% 65.35% HGMD_Reduced % 66.20% 64.45% 63.10% 62.83% 64.61% POCUS_Reduced % 62.00% 64.81% 67.39% 67.31% 64.58% Naïve Bayes Self % 53.24% 62.37% 70.29% 69.12% 60.59% HGMD % 52.59% 58.48% 61.27% 62.34% 56.60% POCUS % 64.00% 69.49% 78.05% 75.23% 70.33% Self_Reduced % 80.93% 65.61% 55.99% 46.81% 66.19% HGMD_Reduced % 80.93% 60.47% 53.33% 39.36% 64.29% POCUS_Reduced % 76.00% 65.71% 58.46% 54.12% 66.09% Ibk -K 5 -W 0 Self % 72.31% 73.93% 77.10% 76.16% 74.63% HGMD % 59.19% 62.50% 64.92% 65.14% 61.92% POCUS % 70.00% 71.70% 74.47% 73.79% 72.16% Self_Reduced % 71.11% 73.24% 77.26% 76.05% 74.06% HGMD_Reduced % 58.63% 61.49% 63.33% 63.69% 60.89% POCUS_Reduced % 64.00% 60.00% 58.18% 56.84% 60.95% PART -M 2 -C Q 1 Self % 63.15% 72.74% 97.43% 83.62% 76.63% HGMD % 45.16% 59.40% 69.55% 68.26% 54.76% POCUS % 54.00% 64.62% 77.14% 73.04% 63.53% Self_Reduced % 44.26% 63.25% 91.57% 76.23% 59.68% HGMD_Reduced % 36.04% 58.17% 76.49% 70.33% 49.00% POCUS_Reduced % 32.00% 58.54% 88.89% 72.73% 47.06% Table 3: Classifier Performance Breakdown

7 ADTree -B 15 -E -1 omin_trainingset (63 factors) omni_traininset_reduced (13 factors) Difference self_prediction hgmd_testset pocus_testset J48 -C M 5 self_prediction hgmd_testset pocus_testset Logistic -R 1.0E-8 -M self_prediction hgmd_testset pocus_testset SMO self_prediction hgmd_testset pocus_testset Naïve Bayes self_prediction hgmd_testset pocus_testset Ibk -K 5 -W 0 self_prediction hgmd_testset pocus_testset PART -M 2 -C Q 1 self_prediction hgmd_testset pocus_testset Table 4: Percentage of Instances correctly classified

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Research Article Bioinformatics International Journal of Pharma and Bio Sciences ISSN 0975-6299 A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS D.UDHAYAKUMARAPANDIAN