Evaluating Classifiers for Disease Gene Discovery

Evaluating Classifiers for Disease Gene Discovery Kino Coursey Lon Turnbull khc0021@unt.edu lt0013@unt.edu Abstract Identification of genes involved in human hereditary disease is an important bioinfomatics task. The tool PROSPECTR is used to estimate the likelihood that a gene is involved in human hereditary disease by looking at patterns in sequence based features, and was developed using the alternating decision tree algorithm in the Weka machine learning tool. Here we examine both the performance of other classifiers using the same data, and over a subset of the most statistically relevant features, and examine the hypothesis of using sequence based features for disease prediction. We were able to find a better classifier, but generated questions about the predictive value of the original features selected. Introduction and Problem Determining the degree to which a gene is involved in some genetic disease is an important bioinformatics task. The tool PROSPECTR 1, developed by the University of Edinburgh, is used to estimate the likelihood that a gene is involved in human hereditary disease by looking at patterns in DNA sequence features, and was developed using the alternating 1 http://www.genetics.med.ed.ac.uk/prospectr/ decision tree algorithm in the Weka machine learning tool. In this project we extended the work of the PROSPECTR project at defining classifiers to determine the likelihood of candidate regions being involved in a genetic disease. While their work focused on only limited depth decision trees, we examined the other classifiers in the Weka machine learning tool set, and determined the quality and accuracy of the various classifiers. We also examined classifier performance over a subset of the most statistically relevant features to test the predictive value of the sequence-based features. If possible, the ability to improve the accuracy of this classification task would allow the high probability sites to be given priority when searching for disease genes. Background and Related Work The ability to estimate the probability of a gene being involved in human hereditary diseases is a very useful bioinformatics operation. More and more information on human and other genomes are constantly being collected. The ability to provide an estimate of probability that a given gene is involved in a disease phenotype would speed up this important biomedical task. Developing better classifiers is central to this approach. Genetic diseases are due to genes that have been mutated so that the body or

some parts of the body no longer functions correctly. More than 100 known genetic disorders are the direct result of a mutation in one gene. It is much more difficult to find the basis of polygenic diseases that have a complex pattern of inheritance where more than one gene needs to be mutated before a susceptibility to a disease is expressed 2. It has been suggested that the genes that have some relationship to hereditary disease might have common variations in their DNA sequence structure. The University of Edinburgh group has used the alternating decision tree algorithm from Weka to test this hypothesis. They used 63 distinct features to test about 18,000 genes that are not known to be involved in disease and the 1,084 genes listed in the Online Mendelian Inheritance in Man 3. On average, 70% of the disease genes were correctly identified with their automatic classifier called PROSPECTR. What are Disease Genes? Central to the classification process is defining what a gene being involved in a disease phenotype means. One definition is when any gene has mutated in such a way that the proteins created from it are dysfunctional. However, mutation can occur in any gene, and causes of diseases are a continuum of genetic activity interacting with non-genetic factors. "Every individual is a deviant in terms of biochemical individuality, meaning that every person has an inherited predisposition to disease in a particular circumstance." from The Metabolic Molecular Bases of Inherited Disease 2 Polygenic disorder: A genetic disorder resulting from the combined action of alleles of more than one gene (e.g., heart disease, diabetes, and some cancers). 3 Online Mendelian Inheritance in Man (OMIM): National Center for Biotechnology Information database of genetic diseases with information on their clinical diagnosis and treatment, cell biology, biochemistry, and molecular medicine. Given this, how can one search for disease genes given such a fuzzy concept? The simple expedient taken by the PROSPECTR group was to take collections of genes identified as being highly correlated or causally linked to medically classified conditions from online gene data banks with disease influence annotations. PROSPECTR Results The PROSPECTR group tested a number of DNA features in an attempt to find differences between disease genes and non-disease genes. Table 1 shows the ratio of the median in a disease set to the median in a control set of the 9 of the 24 features that had statistically significant differences. The larger the ratio found between feature and disease the greater the dependence. Feature Ratio Gene encodes signal peptide 2.06 Gene Length 1.42 Protein length 1.29 5' CpG islands 1.33 Exon Number 1.25 cdna length 1.15 Distance to neighboring gene 1.13 3' UTR length 1.09 Protein identity with BRH in 1.09 mouse Table 1: Feature to Disease ratio Given these relationships between disease genes and features we became interested in evaluating the performance of classifiers on a reduced set. If the features are predictive, then eliminating the noisier features should improve performance. Problem Formulation Because of the benefit of finding better classifiers, our project was to evaluate the performance of a number of classifiers, and compare them to PROSPECTR. In addition, based on examination of the PROSPECTR group work we became interested in examining classifier

performance using only the most statistically significant features. Our goal was to determine: 1. Does a better classifier exist for the data set? 2. How valid are the features used in the dataset for disease prediction? Methodology We take the simple approach of: 1. Duplicating the PROSPECTR teams results by obtaining their dataset and machine learning tool set. 2. Extend the examination by using the same machine learning tool set to examine other relevant classifiers. 3. Produce datasets with only the most statistically relevant features 4. Measure the performance of the originally selected classifiers and those we examined on the reduced datasets. 5. Identify those areas where the other classifiers exceed the performance of the PROSPECTR baseline. The DataSets The group at the University of Edinburgh provides the datasets used to create and test PROSPECTR 4. Three datasets are available to duplicate their work: OMIN_traingset: From the Online Medelian Inheritance in Man (OMIM) and a set from Ensembl that were not known to be involved in human disease. hgmd_testset: Independent test set, with 675 genes in the Human Gene Mutation Database and 675 genes not known to be involved in disease. Pocus_testset: The POCUS list is made up of genes involved in oligogenic 5 disorders. 4 Datasets are available from http://www.genetics.med.ed.ac.uk/prospectr/downloa d.shtml Weka Data Mining Tool Set Weka is a collection of machine learning algorithms for data mining tasks developed at the University of Waikato in New Zealand (Witten, 2004) 6. Weka provides several classifier systems including but not limited to decision tree, rule generators, statistical analysis, Bayes, SVM and neural networks. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It also has an active development and user community that ensures that any new machine learning method that conforms to the tabular dataset format is available. In addition, the project has spawned several interesting branches (Grid Weka, Parallel Weka, BioWeka). The algorithms in Weka can either be applied directly to a dataset or called from your own Java code. It is also well suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. The Classifiers Examined We examined a cross section of classifiers currently provided by Weka. ADTree -B 15 -E -1 The alternating decision tree is optimized for two-class problems. Each prediction node has an associated positive or negative numeric value. To obtain a prediction for an instance, filter down all applicable branches and sum up the values of any prediction node encountered. Predict the final classification based on if the sum is positive or negative. J48 -C 0.25 -M 5 J48 is a variant of the C4.5 decision tree induction algorithm. Test nodes in the tree are selected based on their 5 Oligogenic: A phenotypic trait produced by two or more genes working together. 6 Weka is available at http://www.cs.waikato.ac.nz/ml/weka/

ability to produce a clearer separation between the classes. Leaf nodes contain either all one class, or a mixture with the majority class used as the classification. Minimum node size is equal to 5. Logistic -R 1.0E-8 -M Linear logistic regression, a form of regression for classification. It constructs a linear regression model of the logit transform of the probability. log(p/(1-p)) = b0 + b1*x1 + b2*x2 +... + bk*xk where b0...bk are chosen by a regression method to make class optimal predictions. SMO Sequential Minimal Optimization algorithm for training a support vector classifier (SVM's). Training an SVM requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. For classification, SVMs operate by finding a hypersurface in the space of possible inputs. This hypersurface will attempt to split the positive examples from the negative examples. Naïve Bayes Standard probabilistic Naïve Bayes classifier. Using Bayes Theorem and class statistics select the most probable classification. Ibk -K 5 -W 0 K-nearest neighbor classifier (k=5). Select the class as the majority of the 5 nearest training instances using a distance metric. PART -M 2 -C 0.25 -Q 1 Obtains rules from partial decision trees build using C4.5 heuristics Evaluation: What makes a Good Classifier? One has to define how one would detect a good classifier. First, all of the datasets are balanced, with an equal number of disease and non-disease genes. Hence ideally the total correct should be high, and the number correctly identified in each class should be balanced. Another approach is to select the classifier with the least number of mismatches as the best choice. Details of the experiments can be found in Table 3. Classifier Table 2: Percentage correct and change on reduced feature set. Classifier Results When examining the performance of the classifiers one notices: 1. J48 appears to superior when all data is presented and is competitive on the reduced set. 2. Data reduction affects the original ADTree the least. 3. SMO with all data is competitive. 4. IBk with all data is competitive. 5. For ADTree and J48 data reduction helps in prediction on the POCUS dataset. 6. Naïve Bayes is not competitive with default settings for this task. 7. The Logistic classifier appears close to the ADTree in performance and may need additional tuning. Feature Results Percent total correct ADTree 75.5-3.1 J48 88.7-15.1 Logistic 70.0-4.9 SMO 72.3-6.0 Naïve Bayes 73.0-12.0 Ibk-K 75.4-0.32 Difference with best features PART 80.7-10.6 A reduced set of independent features should to produce similar results to the training set. If not the analysis is suspect. If the analysis is valid, we would expect that classification using the only the

successful subset of features found by the PROSPECTR application would result in improved results. However, when we reduced the number of features to focus on the more prominent ones unexpectedly a different set of classifiers predominates in the reduced data set, and in only 2 cases does it improve (see Table 4). While J48 has the best performance on the training set (88.7% correct) it shows the largest drop in performance on a reduce feature set. Why? While it is possible that disease genes may indicate the statistically related features as indicated by the diseasefeature association table, it is also possible that the features do not predict disease. Logically a disease implying a feature does not mean that a feature implies a disease. This might explain the lack of improvement on the statistically significant feature set. Another possibility is that some of the features are not independent of each other and thus while having different correlation profiles are in fact dependent on each other. For instance protein length, gene length and cdna length would intuitively be related to each other. Longer proteins require longer DNA sequences to specify and thus would be reflected in the other measures. However, genes involved in disease causing phenotypes can also be short. So while diseases may be found associated with longer genes, the length measures alone would not be a good predictor. Future Work and Conclusions One of the reasons for the use of ADTree was the human readable nature of the rules generated. While an important output, it may also be important to have higher absolute recall and precision when performing this task. to consider as new classifiers are constantly being generated by the machine learning and data mining community, and more information is constantly being generated by various genomic and bioinformatics projects. It can be expected that this will continue to be an area of research. In conclusion, we have shown: The J48 classifier performs better than the ADTree classifier, the one chosen by PROSPECTR method. The features that showed the largest differences in the PRPSPECTR study were most likely a statistical anomaly. It seems that using these machine learning methods to classify disease genes is not very productive. At best it needs to be combined with some other independent method and more relevant feature set. References Euan Adie et. al., Speeding disease gene discovery by sequence based candidate prioritization, BMC Bioinformatics 2005, 6:55. Hammond MP, Birney E, Genome information resources - developments at Ensembl. Trends in Genetics 2004, 20:268-272. Ian H. Witten and Eibe Frank. Data mining: Practical machine learning tools and techniques Morgan Kaufman, San Francisco, CA, USA, 2004. http://www.cs.waikato.ac.nz/ml/weka/ http://www.biomedcentral.com/1471-2105/6/55 http://www.ncbi.nlm.nih.gov/books/bv.fcg i?rid=gnd The different classifiers showed different performance characteristics, and ADTree was found not to be the best along all dimensions of measure. This is important

Disease Recall Normal Recall Disease Precision Normal Precision D-F N-F D as D D as N N as D N as N ADTree -B 15 -E -1 Self 790 290 342 738 73.15% 68.33% 69.79% 71.79% 71.43% 70.02% HGMD 473 240 271 442 66.34% 61.99% 63.58% 64.81% 64.93% 63.37% POCUS 36 14 16 34 72.00% 68.00% 69.23% 70.83% 70.59% 69.39% Self_Reduced 802 278 422 658 74.26% 60.93% 65.52% 70.30% 69.62% 65.28% HGMD_Reduced 492 221 309 404 69.00% 56.66% 61.42% 64.64% 64.99% 60.39% POCUS_Reduced 37 13 15 35 74.00% 70.00% 71.15% 72.92% 72.55% 71.43% J48 -C 0.25 -M 5 Self 998 82 161 919 92.41% 85.09% 86.11% 91.81% 89.15% 88.32% HGMD 457 256 268 445 64.10% 62.41% 63.03% 63.48% 63.56% 62.94% POCUS 34 16 19 31 68.00% 62.00% 64.15% 65.96% 66.02% 63.92% Self_Reduced 907 173 396 684 83.98% 63.33% 69.61% 79.81% 76.12% 70.62% HGMD_Reduced 484 229 341 372 67.88% 52.17% 58.67% 61.90% 62.94% 56.62% POCUS_Reduced 40 10 19 31 80.00% 62.00% 67.80% 75.61% 73.39% 68.13% Logistic -R 1.0E-8 -M Self 795 285 363 717 73.61% 66.39% 68.65% 71.56% 71.05% 68.88% HGMD 477 236 276 437 66.90% 61.29% 63.35% 64.93% 65.08% 63.06% POCUS 36 14 17 33 72.00% 66.00% 67.92% 70.21% 69.90% 68.04% Self_Reduced 746 334 420 660 69.07% 61.11% 63.98% 66.40% 66.43% 63.65% HGMD_Reduced 452 261 281 432 63.39% 60.59% 61.66% 62.34% 62.52% 61.45% POCUS_Reduced 37 13 22 28 74.00% 56.00% 62.71% 68.29% 67.89% 61.54% SMO Self 853 227 399 681 78.98% 63.06% 68.13% 75.00% 73.16% 68.51% HGMD 513 200 300 413 71.95% 57.92% 63.10% 67.37% 67.23% 62.29% POCUS 41 9 19 31 82.00% 62.00% 68.33% 77.50% 74.55% 68.89% Self_Reduced 699 381 371 709 64.72% 65.65% 65.33% 65.05% 65.02% 65.35% HGMD_Reduced 437 276 241 472 61.29% 66.20% 64.45% 63.10% 62.83% 64.61% POCUS_Reduced 35 15 19 31 70.00% 62.00% 64.81% 67.39% 67.31% 64.58% Naïve Bayes Self 837 243 505 575 77.50% 53.24% 62.37% 70.29% 69.12% 60.59% HGMD 476 237 338 375 66.76% 52.59% 58.48% 61.27% 62.34% 56.60% POCUS 41 9 18 32 82.00% 64.00% 69.49% 78.05% 75.23% 70.33% Self_Reduced 393 687 206 874 36.39% 80.93% 65.61% 55.99% 46.81% 66.19% HGMD_Reduced 208 505 136 577 29.17% 80.93% 60.47% 53.33% 39.36% 64.29% POCUS_Reduced 23 27 12 38 46.00% 76.00% 65.71% 58.46% 54.12% 66.09% Ibk -K 5 -W 0 Self 848 232 299 781 78.52% 72.31% 73.93% 77.10% 76.16% 74.63% HGMD 485 228 291 422 68.02% 59.19% 62.50% 64.92% 65.14% 61.92% POCUS 38 12 15 35 76.00% 70.00% 71.70% 74.47% 73.79% 72.16% Self_Reduced 854 226 312 768 79.07% 71.11% 73.24% 77.26% 76.05% 74.06% HGMD_Reduced 471 242 295 418 66.06% 58.63% 61.49% 63.33% 63.69% 60.89% POCUS_Reduced 27 23 18 32 54.00% 64.00% 60.00% 58.18% 56.84% 60.95% PART -M 2 -C 0.25 -Q 1 Self 1062 18 398 682 98.33% 63.15% 72.74% 97.43% 83.62% 76.63% HGMD 572 141 391 322 80.22% 45.16% 59.40% 69.55% 68.26% 54.76% POCUS 42 8 23 27 84.00% 54.00% 64.62% 77.14% 73.04% 63.53% Self_Reduced 1036 44 602 478 95.93% 44.26% 63.25% 91.57% 76.23% 59.68% HGMD_Reduced 634 79 456 257 88.92% 36.04% 58.17% 76.49% 70.33% 49.00% POCUS_Reduced 48 2 34 16 96.00% 32.00% 58.54% 88.89% 72.73% 47.06% Table 3: Classifier Performance Breakdown

ADTree -B 15 -E -1 omin_trainingset (63 factors) omni_traininset_reduced (13 factors) Difference self_prediction 70.7407 67.5926-3.1481 hgmd_testset 64.1655 62.8331-1.3324 pocus_testset 70.0000 72.0000 2.0000 J48 -C 0.25 -M 5 self_prediction 88.7500 73.6740-15.0760 hgmd_testset 63.2539 60.0281-3.2258 pocus_testset 65.0000 73.0000 8.0000 Logistic -R 1.0E-8 -M self_prediction 70.0000 65.0926-4.9074 hgmd_testset 64.0954 61.9916-2.1038 pocus_testset 69.0000 65.0000-4.0000 SMO self_prediction 71.0185 65.1852-5.8333 hgmd_testset 64.9369 63.7447-1.1922 pocus_testset 72.0000 66.0000-6.0000 Naïve Bayes self_prediction 65.3704 58.6574-6.7130 hgmd_testset 59.6774 55.0491-4.6283 pocus_testset 73.0000 61.0000-12.0000 Ibk -K 5 -W 0 self_prediction 75.4167 75.0926-0.3241 hgmd_testset 63.6045 62.3422-1.2623 pocus_testset 73.0000 59.0000-14.0000 PART -M 2 -C 0.25 -Q 1 self_prediction 80.7407 70.0926-10.6481 hgmd_testset 62.6928 62.4825-0.2103 pocus_testset 69.0000 64.0000-5.0000 Table 4: Percentage of Instances correctly classified