Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues
|
|
- Vivien Davidson
- 6 years ago
- Views:
Transcription
1 Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues Oleg Okun 1 and Helen Priisalu 2 1 University of Oulu, Oulu 90014, Finland 2 Tallinn University of Technology, Tallinn 19086, Estonia Abstract. Random forest is a collection (ensemble) of decision trees. It is a popular ensemble technique in pattern recognition. In this article, we apply random forest for cancer classification based on gene expression and address two issues that have been so far overlooked in other works. First, we demonstrate on two different real-world datasets that the performance of random forest is strongly influenced by dataset complexity. When estimated before running random forest, this complexity can serve as a useful performance indicator and it can explain a difference in performance on different datasets. Second, we show that one should rely with caution on feature importance used to rank genes: two forests, generated with the different number of features per node split, may have very similar classification errors on the same dataset, but the respective lists of genes ranked according to feature importance can be weakly correlated. 1 Introduction Gene expression based cancer classification is a supervised classification problem. However, unlike many other classification problems in machine learning, it is unusual because the number of features (gene expressions) far exceeds the number of cases (samples taken from patients). This atypical characteristic makes this task much more challenging than the problems where the number of available cases is much larger than the number of features. In gene expression based cancer classification, a subset of the original genes is relevant and related to cancer, but genes constituting this subset are frequently unknown and need to be discovered and selected by means of machine learning methods. As remarked in [1], classification algorithms providing measures of feature importance are of great interest for gene selection, especially if the classification algorithm itself ranks genes. One of such algorithms is random forest. Random forest has not been frequently utilised in bioinformatics [1,2,3,4,5,6]. However, it has several properties that make it attractive. The most important among them are 1) it does not overfit when the number of features exceeds the number of cases, 2) it implicitly performs feature selection, 3) it incorporates interactions among features, and 4) it returns feature importance. In addition, it was claimed [1,2] that its performance is not much influenced by parameter choices. J. Martí et al. (Eds.): IbPRIA 2007, Part II, LNCS 4478, pp , c Springer-Verlag Berlin Heidelberg 2007
2 484 O. Okun and H. Priisalu The most significant parameter of random forest is mtry, the number of features used at each split of decision tree. In [2] they claimed that the performance of random forest is often relatively insensitive to the choice of mtry as long as mtry is far from its minimum or maximum possible values (1 or m, respectively, where m is the total number of features). Another parameter is the number of trees, which should be quite large (say, 500 to several thousands). In gene expression based cancer classification there are two goals: to achieve as high as possible classification rate with as few as possible genes. Often researchers concentrate on high accuracy while overlooking the analysis of the selected genes. Based on tests with two gene expression datasets, we discovered in this article that although the random forest performance in terms of error rate may be similar or the same for two different values of mtry, generankings, produced by two forests applied to a certain dataset, can be weakly correlated. In other words, genes that are very important in one case can be almost irrelevant in another case. This is the first overlooked issue emphasising that feature importance provided by random forest should be treated with caution. Another overlooked issue concerns a less severe but nevertheless important problem. It is often said that random forests are competitive with respect to other classifiers used in cancer research. We do not argue against this claim, but would like to emphasise that dataset complexity computed before trying random forest on a certain dataset can provide a useful performance estimate. Again, we demonstrate based on several complexity measures borrowed from [7] that the performance of random forest can be roughly predicted from these measures. Our goal was not to obtain precise numerical predictions but rather to attain a kind of indication of the expected performance without classifying a dataset. 2 Random Forest A random forest is a collection of fully grown CART-like (CART stands for Classification and Regression Tree) decision trees combined by averaging the predictions of individual trees in the forest. For each tree, given that the total number of cases in a dataset is N, a training set is first generated by randomly choosing N times with replacement from all N cases (bootstrap sample). It can be shown [8] that this botstrap sample includes only about 2/3 of the original data. The rest of the cases is used as a test (or out-of-bag) set in order to estimate the out-of-bag (OOB) error of classification, which serves as a fair estimate of accuracy. If there are m features, a number mtry m is specified such that at each node, mtry out of m features are randomly selected (thus, random forest uses two random mechanisms: bootstrap aggregation and random feature selection) and the best split on these mtry features is used to split the node. Various splitting criteria can be employed such as Gini index, information gain, node impurity. The value of mtry is constant during the forest growing (typical values of mtry are chosen to be approximately equal to either m 2 or m,or2 m). Unlike CART, each tree in the forest is fully grown without pruning. Each tree is a weak classifier and because of this fact, averaging the
3 Random Forest for Gene Expression Based Cancer Classification 485 predictions of many weak classifiers results in significant accuracy improvement compared to a single tree. In other words, since the unpruned trees are low-bias, high-variance models, averaging over an ensemble of trees reduces variance while keeping bias low. In addition to being the useful estimate of classification accuracy, the out-ofbagerrorisalsousedtogetestimatesof feature importance. However, based on the out-of-bag error alone, it is difficult to define a sharp division between important and unimportant features. 3 Datasets Two datasets were chosen for experiments. They differ in technology used to produce a dataset and in dataset complexity. Dataset complexity is discussed in detail below. 3.1 SAGE Dataset SAGE stands for Serial Analysis of Gene Expression [9,10]. This is technology alternative to microarrays (cdnas and oligonucleotides). Though SAGE was originally conceived for use in cancer studies, there is not much research using SAGE datasets regarding ensembles of classifiers (to our best knowledge, this is the first research on random forests based on SAGE data). SAGE provides a statistical description of the mrna population present in a cell without prior selection of the genes to be studied [11]. This is the main distinction of SAGE over microarray approaches (cdna and oligonucleotide) that are limited to the genes represented in the chip. SAGE counts the number of transcripts or tags for each gene, where the tags substitute the expression levels. As a result, counting sequence tags yields positive integer numbers in contrast to microarray measurements. In the chosen dataset [12], there are expressions of 822 genes in 74 cases (24 cases are normal while 50 cases are cancerous) [13]. Unlike many other datasets with one or few types of cancer, it contains 9 different types of cancer. We decided to ignore the difference between cancer types and to treat all cancerous cases as belonging to a single class. No preprocessing was done. 3.2 Colon Dataset This microarray (oligonucleotide) dataset [14], introduced in [15], contains expressions of 2,000 genes for 62 cases (22 normal and 40 colon tumour cases). Preprocessing includes the logarithmic transformation to base 10, followed by normalisation to zero mean and unit variance as usually done with this dataset. 4 Dataset Complexity It is known that the performance of individual classifiers and their ensembles is strongly data-dependent. It is often impossible to give any theoretical bounds on
4 486 O. Okun and H. Priisalu performance or these bounds are limited to few very specific cases and too weak to be useful in practice. To gain insight into a supervised classification problem such as gene expression based cancer classification, one can adopt complexity measures introduced and studied in [7]. Knowing the dataset complexity can help to predict the behaviour of a certain classifier before it is applied to the dataset, though the prediction may be not absolute because of finite dataset size. Complexity measures described below assume two-class problems and they are classifier-independent, i.e., they do not rely on a certain classification model. Employing classifier-dependent measures would not provide an absolute scale for comparison. For example, it is well known that a nearest neighbour classifier can sometimes easily classify a highly nonlinear dataset. The following characteristics were adopted to estimate the dataset complexity. 4.1 Fisher s Discriminant Ratio (F1) Fisher s discriminant ratio is defined as f = (μ1 μ2)2,whereμ σ1 2 1, μ 2, σ +σ2 1 2, σ2 2 are 2 the means and variances of the two classes, respectively. The higher f (f corresponds to two classes represented by two spatially separated points), the easier the classification problem. Hence F 1=max{f i }, i =1,...,m. 4.2 Volume of Overlap Region (F2) A similar measure is the overlap of the tails of the two class-conditional distributions. Let min(g i,c j )andmax(g i,c j ) be the minimum and maximum values of feature g i in class c j. Then the overlap measure F 2isdefinedtobe F 2=Πi=1 m MIN(max(g i,c 1),max(g i,c 2)) MAX(min(g i,c 1),min(g i,c 2)) MAX(max(g i,c 1),max(g i,c 2)) MIN(min(g i,c 1),min(g i,c 2)).IfF2 0, it implies that there is at least one feature for which value ranges of the two classes do not overlap. In other words, the smaller F 2, the easier the dataset to classify. 4.3 Feature Efficiency (F3) This measure accounts for how much each feature individually contributes to the class separation. Each feature takes values in a certain interval. If there is an overlap of intervals of two classes, there is ambiguity of classification in the overlapping region. The larger the number of cases lying outside this region, the easier class separation. For linearly separated classes, the overlapping region is empty and therefore all cases are outside of it. For highly overlapped classes, this region is large and the number of cases lying outside is small. Thus, feature efficiency is defined as the fraction of cases outside the overlapping region. F 3 corresponds to the maximum feature efficiency. 5 Experimental Details In all experiments below we used Random Forest software from Salford Systems (San Diego, CA, USA), version 1.0. The number of trees in the forest was equal to the default value, 500.
5 Random Forest for Gene Expression Based Cancer Classification Complexity Measures As can be seen from the described complexity measures, they are computed before classification. Values of all complexity measures are summarised in Table 1 for both datasets. SAGE dataset appears to be more complex for classification than Colon one. It is therefore natural to expect a worse performance of random forest on the SAGE data. A higher complexity of the SAGE data is not very surprising since this dataset comprises nine different types of cancer treated as one class, while the colon data only includes one cancer type. Table 2 confirms this idea as well as the results from Table 1. Hence, it can be good to estimate the dataset complexity before applying random forest to the dataset in order to have a rough estimate of classification accuracy which can be achieved. Table 2 points to dramatic performance degradation of random forest occurred on the SAGE data, compared to the Colon data. This in turn implies that random forest might not achieve acceptable performance in complex problems. Table 1. Summary of dataset complexity measures for both datasets. Italicised values point to a more complex dataset according to each measure. Dataset F 1 F 2 F 3 SAGE e Colon e Table 2. OOB error rates. For each dataset, three typical values of mtry were tried mtry SAGE mtry Colon Receiver Operating Characteristic Except for OOB error, we also utilised a Receiver Operating Characteristic (ROC) for performance evaluation. ROC is a plot of false positive rate (X-axis) versus true positive rate (Y-axis) of a binary classifier. The true positive rate (TPR) is defined as the ratio of the number of correctly classified positive cases to the total number of positive cases. The false positive rate (FPR) is defined as the ratio of incorrectly classified negative cases to the total number of negative cases. Cancer (normal) cases are positives (negatives). TPR and FPR vary together as a threshold on a classifier s continuous output varies. The diagonal line y = x corresponds to a classifier which predicts a class membership by randomly guessing it. Hence, all useful classifiers must have ROC curves above this line. The best possible classifier would yield a graph that is a point in the upper left corner of the ROC space, i.e., all true positives are found and no false positives are found.
6 488 O. Okun and H. Priisalu The ROC curve is a two-dimensional plot of classifier performance. To compare classifiers one typically prefers to work with a single scalar value. This value is called the Area Under Curve or AUC. It is calculated by adding the areas under the ROC curve between each pair of consecutive FPR values, using, for example, the trapezoidal rule. Because the AUC is a portion of the area of the unit square, its value will always lie between 0 and 1. Because random guessing produces the diagonal line between (0,0) and (1,1), which has an area of 0.5, no realistic classifier should have an AUC less than 0.5 [16]. In fact, the better a classifier performs, the higher the AUC. The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive case higher than a randomly chosen negative case [16]. AUC values for both datasets and typical choices of mtry are shown in Table 3. Table 3. AUC values. For each dataset, three typical values of mtry were tried. mtry SAGE mtry Colon Looking at Tables 2 and 3, one can notice that the performance of random forest on each dataset remains almost the same as mtry varies.thisistheexpected result just confirming conclusions of other researchers. We went, however, one step further and analysed the gene rankings produced according to the Gini index of feature importance. The Gini index is computed as follows. For every node split by a feature in every tree in the forest we have a measure of how much the split improved the separation between classes. Accumulating these improvements leads to scores that are then standardised. The most important gene always gets a score of and a rank of 1. The second most important gene will get a smaller score and a rank of 2, etc. We used these ranks to compute rank correlation coefficients. We opted for the rank correlation coefficients such as Kendall s τ and Spearman s ρ instead of the linear (Pearson) correlation coefficient, because they provide appropriate results even if the correlation between two variables is not linear. Both Kendall s τ and Spearman s ρ with a correction for ties were computed for all possible pairs of ranked genes lists (for details, see [17]). There were three pairs for each dataset because of three values of mtry. Two statistical tests were done: two-tailed test that correlation is not zero and one-tailed test that correlation is greater than zero. For SAGE, positive correlation if existed at significance levels 0.05 and 0.01 was about at maximum, while for Colon its value was even smaller ( ). It means that gene ranks turned out to be almost uncorrelated. Hence, given two similar OOB error rates, one should use feature importance provided by random forest with caution in order to avoid spurious conclusions about biological relevance of top ranked genes.
7 Random Forest for Gene Expression Based Cancer Classification 489 The fact that different subsets of genes can be equally relevant when predicting cancer has been already highlighted in several works [18,19]. It was argued that one of the possible explanations for such multiplicity and non-uniqueness is a strong influence of the training set on gene selection. In other words, different groups of patients can lead to different gene importance rankings due to genuine differences between patients (cancer grade, stage, etc.). In random forest, bootstrap naturally produces different training sets and these sets have a significant overlap. Although there are many trees in random forest, it seems that multiplicity and non-uniqueness still cannot be avoided. This observation implies that for random forest the rank in the list is not necessarily a reliable indicator of gene importance. Despite of this pessimistic conclusion, random forest remains a good predictive method that probably needs to be complemented by more rigorous and careful analysis of the results. 6 Conclusion We considered the overlooked issues related to random forests for cancer classification based on gene expression. To facilitate biological interpretation, it is important to know which genes are relevant to cancer. It was claimed that random forest can attach importance measure to each gene, which may point to gene relevance. We showed that despite of similar OOB errors for several typical choices of mtry, gene importance can significantly vary. Perhaps, one alternative could be to combine explicit feature selection and random forest (see, e.g. [1,4]), but it needs extra verification since it was reported in [1] (see Stability (uniqueness) of results there) that this strategy does not always lead to very stable results. In addition, dataset complexity computed before running random forest can be a useful performance predictor. Based on it, users can decide whether to apply random forest or not. References 1. Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene Selection and Classification of Microarray Data Using Random Forest.: BMC Bioinformatics, vol. 7 (2006) 2. Svetnik, V., Liaw, A., Tong, C., Wang, T.: Application of Breiman s Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules. In: Roli, F., Kittler, J., Windeatt, T. (eds.) MCS LNCS, vol. 3077, pp Springer, Heidelberg (2004) 3. Wu, B., Abbot, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., Zhao, H.: Comparison of Statistical Methods for Classification of Ovarian Cancer Using Mass Spectrometry Data. Bioinformatics 19, (2003) 4. Geurts, P., Fillet, M., de Seny, D., Meuwis, M.-A., Malaise, M., Merville, M.-P., Wehenkel, L.: Proteomic Mass Spectra Classification Using Decision Tree Based Ensemble Methods. Bioinformatics 21, (2005)
8 490 O. Okun and H. Priisalu 5. Alvarez, S., Díaz-Uriarte, R., Osorio, A., Barroso, A., Melchor, L., Paz, M.F., Honrado, E., Rodríguez, R., Urioste, M., Valle, L., Díez, O., Cigudosa, J.C., Dopazo, J., Esteller, M., Benitez, J.: A Predictor Based on the Somatic Genomic Changes of the BRCA1/BRCA2 Breast Cancer Tumors Identifies the Non-BRCA1/BRCA2 Tumors with BRCA1 Promoter Hypermethylation. Clinical Cancer Research 11, (2005) 6. Gunther, E.C., Stone, D.J., Gerwein, R.W., Bento, P., Heyes, M.P.: Prediction of Clinical Drug Efficacy by Classification of Drug-Induced Genomic Expression Profiles in Vitro. Proc. Natl. Acad. Sci. 100, (2003) 7. Ho, T.K., Basu, M.: Complexity Measures of Supervised Classification Problems. IEEE Trans. Patt. Analysis and Machine Intell. 24, (2002) 8. Breiman, L.: Random Forests. Machine Learning 45, 5 32 (2001) Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W.: Serial Analysis of Gene Expression. Science 270, (1995) 11. Aldaz, M.C.: Serial Analysis of Gene Expression (SAGE) in Cancer Research. In: Ladanyi, M., Gerald, W.L. (eds.) Expression Profiling of Human Tumors: Diagnostic and Research Applications, pp Humana Press, Totowa, NJ (2003) Gandrillon, O.: Guide to the Gene Expression Data. In: Berka, P., Crémilleux, B. (eds.) Proc. the ECML/PKDD Discovery Challenge Workshop, Pisa, Italy, pp (2004) Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. In: Proc. Natl. Acad. Sci. 96, (1999) 16. Fawcett, T.: An Introduction to ROC Analysis. Patt. Recogn. Letters 27, (2006) 17. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Chapman & Hall/CRC, Boca Raton, London, New York, Washington, DC (2004) 18. Ein-Dor, L., Kela, I., Getz, G., Givol, D., Domany, E.: Outcome Signature Genes in Breast Cancer: Is There a Unique Set? Bioinformatics 21, (2005) 19. Michiels, S., Koscielny, S., Hill, C.: Prediction of Cancer Outcome with Microarrays: a Multiple Random Validation Strategy. Lancet 365, (2005)
Predicting Breast Cancer Survival Using Treatment and Patient Factors
Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women
More informationData complexity measures for analyzing the effect of SMOTE over microarrays
ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Data complexity
More informationPredictive Biomarkers
Uğur Sezerman Evolutionary Selection of Near Optimal Number of Features for Classification of Gene Expression Data Using Genetic Algorithms Predictive Biomarkers Biomarker: A gene, protein, or other change
More informationEfficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection
202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray
More informationPrediction-based Threshold for Medication Alert
MEDINFO 2013 C.U. Lehmann et al. (Eds.) 2013 IMIA and IOS Press. This article is published online with Open Access IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial
More informationA Random Forest Model for the Analysis of Chemical Descriptors for the Elucidation of HIV 1 Protease Protein Ligand Interactions
A Random Forest Model for the Analysis of Chemical Descriptors for the Elucidation of HIV 1 Protease Protein Ligand Interactions Gene M. Ko, A. Srinivas Reddy, Sunil Kumar, Barbara A. Bailey, and Rajni
More informationComparison of discrimination methods for the classification of tumors using gene expression data
Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley
More informationAn Improved Algorithm To Predict Recurrence Of Breast Cancer
An Improved Algorithm To Predict Recurrence Of Breast Cancer Umang Agrawal 1, Ass. Prof. Ishan K Rajani 2 1 M.E Computer Engineer, Silver Oak College of Engineering & Technology, Gujarat, India. 2 Assistant
More informationA Learning Method of Directly Optimizing Classifier Performance at Local Operating Range
A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range Lae-Jeong Park and Jung-Ho Moon Department of Electrical Engineering, Kangnung National University Kangnung, Gangwon-Do,
More informationMulti-class Cancer Classification Using Ensembles of Classifiers: Preliminary Results
Multi-class Cancer Classification Using Ensembles of Classifiers: Preliminary Results Oleg Okun 1 and Helen Priisalu 2 1 University of Oulu, Finland 2 Tallinn University of Technology, Estonia Abstract.
More informationWhen Overlapping Unexpectedly Alters the Class Imbalance Effects
When Overlapping Unexpectedly Alters the Class Imbalance Effects V. García 1,2, R.A. Mollineda 2,J.S.Sánchez 2,R.Alejo 1,2, and J.M. Sotoca 2 1 Lab. Reconocimiento de Patrones, Instituto Tecnológico de
More informationGene Selection for Tumor Classification Using Microarray Gene Expression Data
Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology
More informationHybridized KNN and SVM for gene expression data classification
Mei, et al, Hybridized KNN and SVM for gene expression data classification Hybridized KNN and SVM for gene expression data classification Zhen Mei, Qi Shen *, Baoxian Ye Chemistry Department, Zhengzhou
More informationReview. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN
Outline 1. Review sensitivity and specificity 2. Define an ROC curve 3. Define AUC 4. Non-parametric tests for whether or not the test is informative 5. Introduce the binormal ROC model 6. Discuss non-parametric
More informationGene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering
Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene
More informationPredictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts
Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts Kinar, Y., Kalkstein, N., Akiva, P., Levin, B., Half, E.E., Goldshtein, I., Chodick, G. and Shalev,
More informationNearest Shrunken Centroid as Feature Selection of Microarray Data
Nearest Shrunken Centroid as Feature Selection of Microarray Data Myungsook Klassen Computer Science Department, California Lutheran University 60 West Olsen Rd, Thousand Oaks, CA 91360 mklassen@clunet.edu
More informationINTRODUCTION TO MACHINE LEARNING. Decision tree learning
INTRODUCTION TO MACHINE LEARNING Decision tree learning Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign
More informationPredicting Breast Cancer Survivability Rates
Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer
More informationT. R. Golub, D. K. Slonim & Others 1999
T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have
More informationRoadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:
Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:7332-7341 Presented by Deming Mi 7/25/2006 Major reasons for few prognostic factors to
More informationKnowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC
More informationBIOINFORMATICS. Permutation importance: a corrected feature importance measure
BIOINFORMATICS Vol. 00 no. 00 2009 Pages 1 8 Permutation importance: a corrected feature importance measure André Altmann 1,, Laura Toloşi 1,, Oliver Sander 1,, Thomas Lengauer 1 1 Department of Computational
More informationApplication of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties
Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Bob Obenchain, Risk Benefit Statistics, August 2015 Our motivation for using a Cut-Point
More informationWhite Paper Estimating Complex Phenotype Prevalence Using Predictive Models
White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015
More informationSTATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin
STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin Key words : Bayesian approach, classical approach, confidence interval, estimation, randomization,
More informationThe Effect of Guessing on Item Reliability
The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct
More informationROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)
ROC Curve Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341) 580342 ROC Curve The ROC Curve procedure provides a useful way to evaluate the performance of classification
More informationAssessing Functional Neural Connectivity as an Indicator of Cognitive Performance *
Assessing Functional Neural Connectivity as an Indicator of Cognitive Performance * Brian S. Helfer 1, James R. Williamson 1, Benjamin A. Miller 1, Joseph Perricone 1, Thomas F. Quatieri 1 MIT Lincoln
More informationComparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationResearch Supervised clustering of genes Marcel Dettling and Peter Bühlmann
http://genomebiology.com/22/3/2/research/69. Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann Address: Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich,
More informationIntroduction to Discrimination in Microarray Data Analysis
Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t
More informationIdentification of Tissue Independent Cancer Driver Genes
Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important
More informationUnit 1 Exploring and Understanding Data
Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile
More informationThe Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016
The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016 This course does not cover how to perform statistical tests on SPSS or any other computer program. There are several courses
More information4. Model evaluation & selection
Foundations of Machine Learning CentraleSupélec Fall 2017 4. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
More informationUsing CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification
Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification Kenna Mawk, D.V.M. Informatics Product Manager Ciphergen Biosystems, Inc. Outline Introduction to ProteinChip Technology
More informationPRACTICAL STATISTICS FOR MEDICAL RESEARCH
PRACTICAL STATISTICS FOR MEDICAL RESEARCH Douglas G. Altman Head of Medical Statistical Laboratory Imperial Cancer Research Fund London CHAPMAN & HALL/CRC Boca Raton London New York Washington, D.C. Contents
More informationEvaluating Classifiers for Disease Gene Discovery
Evaluating Classifiers for Disease Gene Discovery Kino Coursey Lon Turnbull khc0021@unt.edu lt0013@unt.edu Abstract Identification of genes involved in human hereditary disease is an important bioinfomatics
More informationMODEL SELECTION STRATEGIES. Tony Panzarella
MODEL SELECTION STRATEGIES Tony Panzarella Lab Course March 20, 2014 2 Preamble Although focus will be on time-to-event data the same principles apply to other outcome data Lab Course March 20, 2014 3
More informationEstimation of Area under the ROC Curve Using Exponential and Weibull Distributions
XI Biennial Conference of the International Biometric Society (Indian Region) on Computational Statistics and Bio-Sciences, March 8-9, 22 43 Estimation of Area under the ROC Curve Using Exponential and
More informationMachine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017
Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 A.K.A. Artificial Intelligence Unsupervised learning! Cluster analysis Patterns, Clumps, and Joining
More informationData Mining in Bioinformatics Day 7: Clustering in Bioinformatics
Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt:
More informationBiomarker adaptive designs in clinical trials
Review Article Biomarker adaptive designs in clinical trials James J. Chen 1, Tzu-Pin Lu 1,2, Dung-Tsa Chen 3, Sue-Jane Wang 4 1 Division of Bioinformatics and Biostatistics, National Center for Toxicological
More informationMammogram Analysis: Tumor Classification
Mammogram Analysis: Tumor Classification Term Project Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is the
More informationClassification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang
Classification Methods Course: Gene Expression Data Analysis -Day Five Rainer Spang Ms. Smith DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith 30.000 properties of Ms. Smith The expression
More informationReview: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections
Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections New: Bias-variance decomposition, biasvariance tradeoff, overfitting, regularization, and feature selection Yi
More informationAn Empirical and Formal Analysis of Decision Trees for Ranking
An Empirical and Formal Analysis of Decision Trees for Ranking Eyke Hüllermeier Department of Mathematics and Computer Science Marburg University 35032 Marburg, Germany eyke@mathematik.uni-marburg.de Stijn
More information6. Unusual and Influential Data
Sociology 740 John ox Lecture Notes 6. Unusual and Influential Data Copyright 2014 by John ox Unusual and Influential Data 1 1. Introduction I Linear statistical models make strong assumptions about the
More informationStatistical Assessment of the Global Regulatory Role of Histone. Acetylation in Saccharomyces cerevisiae. (Support Information)
Statistical Assessment of the Global Regulatory Role of Histone Acetylation in Saccharomyces cerevisiae (Support Information) Authors: Guo-Cheng Yuan, Ping Ma, Wenxuan Zhong and Jun S. Liu Linear Relationship
More informationIntroduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015
Introduction to diagnostic accuracy meta-analysis Yemisi Takwoingi October 2015 Learning objectives To appreciate the concept underlying DTA meta-analytic approaches To know the Moses-Littenberg SROC method
More information(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN)
UNIT 4 OTHER DESIGNS (CORRELATIONAL DESIGN AND COMPARATIVE DESIGN) Quasi Experimental Design Structure 4.0 Introduction 4.1 Objectives 4.2 Definition of Correlational Research Design 4.3 Types of Correlational
More informationChapter 17 Sensitivity Analysis and Model Validation
Chapter 17 Sensitivity Analysis and Model Validation Justin D. Salciccioli, Yves Crutain, Matthieu Komorowski and Dominic C. Marshall Learning Objectives Appreciate that all models possess inherent limitations
More informationA scored AUC Metric for Classifier Evaluation and Selection
A scored AUC Metric for Classifier Evaluation and Selection Shaomin Wu SHAOMIN.WU@READING.AC.UK School of Construction Management and Engineering, The University of Reading, Reading RG6 6AW, UK Peter Flach
More informationThe Analysis of Proteomic Spectra from Serum Samples. Keith Baggerly Biostatistics & Applied Mathematics MD Anderson Cancer Center
The Analysis of Proteomic Spectra from Serum Samples Keith Baggerly Biostatistics & Applied Mathematics MD Anderson Cancer Center PROTEOMICS 1 What Are Proteomic Spectra? DNA makes RNA makes Protein Microarrays
More informationAn Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework
An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework Soumya GHOSE, Jhimli MITRA 1, Sankalp KHANNA 1 and Jason DOWLING 1 1. The Australian e-health and
More informationColon cancer subtypes from gene expression data
Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto Sherman Ip Leon Law Module 6: Applied Statistics 26th February 2016 Aim Replicate findings of Felipe De Sousa et
More informationStatistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.
Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1 Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension
More informationComments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.
Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Holger Höfling Gad Getz Robert Tibshirani June 26, 2007 1 Introduction Identifying genes that are involved
More information1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp
The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve
More informationIntroduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.
Diagnostic Tests 1 Introduction Suppose we have a quantitative measurement X i on experimental or observed units i = 1,..., n, and a characteristic Y i = 0 or Y i = 1 (e.g. case/control status). The measurement
More informationBootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers
Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers Kai-Ming Jiang 1,2, Bao-Liang Lu 1,2, and Lei Xu 1,2,3(&) 1 Department of Computer Science and Engineering,
More informationMOST: detecting cancer differential gene expression
Biostatistics (2008), 9, 3, pp. 411 418 doi:10.1093/biostatistics/kxm042 Advance Access publication on November 29, 2007 MOST: detecting cancer differential gene expression HENG LIAN Division of Mathematical
More informationBivariate variable selection for classification problem
Bivariate variable selection for classification problem Vivian W. Ng Leo Breiman Abstract In recent years, large amount of attention has been placed on variable or feature selection in various domains.
More informationRASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays
Supplementary Materials RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Junhee Seok 1*, Weihong Xu 2, Ronald W. Davis 2, Wenzhong Xiao 2,3* 1 School of Electrical Engineering,
More informationClustering mass spectrometry data using order statistics
Proteomics 2003, 3, 1687 1691 DOI 10.1002/pmic.200300517 1687 Douglas J. Slotta 1 Lenwood S. Heath 1 Naren Ramakrishnan 1 Rich Helm 2 Malcolm Potts 3 1 Department of Computer Science 2 Department of Wood
More informationStatistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies
Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Stanford Biostatistics Workshop Pierre Neuvial with Henrik Bengtsson and Terry Speed Department of Statistics, UC Berkeley
More informationIEEE SIGNAL PROCESSING LETTERS, VOL. 13, NO. 3, MARCH A Self-Structured Adaptive Decision Feedback Equalizer
SIGNAL PROCESSING LETTERS, VOL 13, NO 3, MARCH 2006 1 A Self-Structured Adaptive Decision Feedback Equalizer Yu Gong and Colin F N Cowan, Senior Member, Abstract In a decision feedback equalizer (DFE),
More informationValidating the Visual Saliency Model
Validating the Visual Saliency Model Ali Alsam and Puneet Sharma Department of Informatics & e-learning (AITeL), Sør-Trøndelag University College (HiST), Trondheim, Norway er.puneetsharma@gmail.com Abstract.
More informationA Statistical Framework for Classification of Tumor Type from microrna Data
DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 A Statistical Framework for Classification of Tumor Type from microrna Data JOSEFINE RÖHSS KTH ROYAL INSTITUTE OF TECHNOLOGY
More informationDetection Theory: Sensitivity and Response Bias
Detection Theory: Sensitivity and Response Bias Lewis O. Harvey, Jr. Department of Psychology University of Colorado Boulder, Colorado The Brain (Observable) Stimulus System (Observable) Response System
More informationDepartment of Epidemiology, Rollins School of Public Health, Emory University, Atlanta GA, USA.
A More Intuitive Interpretation of the Area Under the ROC Curve A. Cecile J.W. Janssens, PhD Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta GA, USA. Corresponding
More informationBusiness Statistics Probability
Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment
More informationClassification of cancer profiles. ABDBM Ron Shamir
Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;
More informationINTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ
INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ OBJECTIVES Definitions Stages of Scientific Knowledge Quantification and Accuracy Types of Medical Data Population and sample Sampling methods DEFINITIONS
More informationPerformance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool
Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool Sujata Joshi Assistant Professor, Dept. of CSE Nitte Meenakshi Institute of Technology Bangalore,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016 Exam policy: This exam allows one one-page, two-sided cheat sheet; No other materials. Time: 80 minutes. Be sure to write your name and
More informationMETHODS FOR DETECTING CERVICAL CANCER
Chapter III METHODS FOR DETECTING CERVICAL CANCER 3.1 INTRODUCTION The successful detection of cervical cancer in a variety of tissues has been reported by many researchers and baseline figures for the
More informationApplying Machine Learning Methods in Medical Research Studies
Applying Machine Learning Methods in Medical Research Studies Daniel Stahl Department of Biostatistics and Health Informatics Psychiatry, Psychology & Neuroscience (IoPPN), King s College London daniel.r.stahl@kcl.ac.uk
More informationThe Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0
The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used
More informationOn testing dependency for data in multidimensional contingency tables
On testing dependency for data in multidimensional contingency tables Dominika Polko 1 Abstract Multidimensional data analysis has a very important place in statistical research. The paper considers the
More informationBayesRandomForest: An R
BayesRandomForest: An R implementation of Bayesian Random Forest for Regression Analysis of High-dimensional Data Oyebayo Ridwan Olaniran (rid4stat@yahoo.com) Universiti Tun Hussein Onn Malaysia Mohd Asrul
More informationMammogram Analysis: Tumor Classification
Mammogram Analysis: Tumor Classification Literature Survey Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is
More informationStage-Specific Predictive Models for Cancer Survivability
University of Wisconsin Milwaukee UWM Digital Commons Theses and Dissertations December 2016 Stage-Specific Predictive Models for Cancer Survivability Elham Sagheb Hossein Pour University of Wisconsin-Milwaukee
More informationPsychology, 2010, 1: doi: /psych Published Online August 2010 (
Psychology, 2010, 1: 194-198 doi:10.4236/psych.2010.13026 Published Online August 2010 (http://www.scirp.org/journal/psych) Using Generalizability Theory to Evaluate the Applicability of a Serial Bayes
More informationGene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The parsimony principle: A quick review Find the tree that requires the fewest
More informationDetection and Classification of Diabetic Retinopathy using Retinal Images
Detection and Classification of Diabetic Retinopathy using Retinal Images Kanika Verma, Prakash Deep and A. G. Ramakrishnan, Senior Member, IEEE Medical Intelligence and Language Engineering Lab Department
More informationA Hybrid Approach for Mining Metabolomic Data
A Hybrid Approach for Mining Metabolomic Data Dhouha Grissa 1,3, Blandine Comte 1, Estelle Pujos-Guillot 2, and Amedeo Napoli 3 1 INRA, UMR1019, UNH-MAPPING, F-63000 Clermont-Ferrand, France, 2 INRA, UMR1019,
More informationChapter 9: Comparing two means
Chapter 9: Comparing two means Smart Alex s Solutions Task 1 Is arachnophobia (fear of spiders) specific to real spiders or will pictures of spiders evoke similar levels of anxiety? Twelve arachnophobes
More informationMODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA
International Journal of Software Engineering and Knowledge Engineering Vol. 13, No. 6 (2003) 579 592 c World Scientific Publishing Company MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION
More informationSheila Barron Statistics Outreach Center 2/8/2011
Sheila Barron Statistics Outreach Center 2/8/2011 What is Power? When conducting a research study using a statistical hypothesis test, power is the probability of getting statistical significance when
More informationSupplementary Materials
Supplementary Materials July 2, 2015 1 EEG-measures of consciousness Table 1 makes explicit the abbreviations of the EEG-measures. Their computation closely follows Sitt et al. (2014) (supplement). PE
More informationProbability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data
Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data Tong WW, McComb ME, Perlman DH, Huang H, O Connor PB, Costello
More informationClassifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/
CSCI1950 Z Computa4onal Methods for Biology Lecture 18 Ben Raphael April 8, 2009 hip://cs.brown.edu/courses/csci1950 z/ Binary classifica,on Given a set of examples (x i, y i ), where y i = + 1, from unknown
More informationLearning with Rare Cases and Small Disjuncts
Appears in Proceedings of the 12 th International Conference on Machine Learning, Morgan Kaufmann, 1995, 558-565. Learning with Rare Cases and Small Disjuncts Gary M. Weiss Rutgers University/AT&T Bell
More information11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES
Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are
More information3. Model evaluation & selection
Foundations of Machine Learning CentraleSupélec Fall 2016 3. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr
More informationCorrelation and Regression
Dublin Institute of Technology ARROW@DIT Books/Book Chapters School of Management 2012-10 Correlation and Regression Donal O'Brien Dublin Institute of Technology, donal.obrien@dit.ie Pamela Sharkey Scott
More informationModeling Sentiment with Ridge Regression
Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,
More information