Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues

Size: px
Start display at page:

Download "Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues"

Transcription

1 Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues Oleg Okun 1 and Helen Priisalu 2 1 University of Oulu, Oulu 90014, Finland 2 Tallinn University of Technology, Tallinn 19086, Estonia Abstract. Random forest is a collection (ensemble) of decision trees. It is a popular ensemble technique in pattern recognition. In this article, we apply random forest for cancer classification based on gene expression and address two issues that have been so far overlooked in other works. First, we demonstrate on two different real-world datasets that the performance of random forest is strongly influenced by dataset complexity. When estimated before running random forest, this complexity can serve as a useful performance indicator and it can explain a difference in performance on different datasets. Second, we show that one should rely with caution on feature importance used to rank genes: two forests, generated with the different number of features per node split, may have very similar classification errors on the same dataset, but the respective lists of genes ranked according to feature importance can be weakly correlated. 1 Introduction Gene expression based cancer classification is a supervised classification problem. However, unlike many other classification problems in machine learning, it is unusual because the number of features (gene expressions) far exceeds the number of cases (samples taken from patients). This atypical characteristic makes this task much more challenging than the problems where the number of available cases is much larger than the number of features. In gene expression based cancer classification, a subset of the original genes is relevant and related to cancer, but genes constituting this subset are frequently unknown and need to be discovered and selected by means of machine learning methods. As remarked in [1], classification algorithms providing measures of feature importance are of great interest for gene selection, especially if the classification algorithm itself ranks genes. One of such algorithms is random forest. Random forest has not been frequently utilised in bioinformatics [1,2,3,4,5,6]. However, it has several properties that make it attractive. The most important among them are 1) it does not overfit when the number of features exceeds the number of cases, 2) it implicitly performs feature selection, 3) it incorporates interactions among features, and 4) it returns feature importance. In addition, it was claimed [1,2] that its performance is not much influenced by parameter choices. J. Martí et al. (Eds.): IbPRIA 2007, Part II, LNCS 4478, pp , c Springer-Verlag Berlin Heidelberg 2007

2 484 O. Okun and H. Priisalu The most significant parameter of random forest is mtry, the number of features used at each split of decision tree. In [2] they claimed that the performance of random forest is often relatively insensitive to the choice of mtry as long as mtry is far from its minimum or maximum possible values (1 or m, respectively, where m is the total number of features). Another parameter is the number of trees, which should be quite large (say, 500 to several thousands). In gene expression based cancer classification there are two goals: to achieve as high as possible classification rate with as few as possible genes. Often researchers concentrate on high accuracy while overlooking the analysis of the selected genes. Based on tests with two gene expression datasets, we discovered in this article that although the random forest performance in terms of error rate may be similar or the same for two different values of mtry, generankings, produced by two forests applied to a certain dataset, can be weakly correlated. In other words, genes that are very important in one case can be almost irrelevant in another case. This is the first overlooked issue emphasising that feature importance provided by random forest should be treated with caution. Another overlooked issue concerns a less severe but nevertheless important problem. It is often said that random forests are competitive with respect to other classifiers used in cancer research. We do not argue against this claim, but would like to emphasise that dataset complexity computed before trying random forest on a certain dataset can provide a useful performance estimate. Again, we demonstrate based on several complexity measures borrowed from [7] that the performance of random forest can be roughly predicted from these measures. Our goal was not to obtain precise numerical predictions but rather to attain a kind of indication of the expected performance without classifying a dataset. 2 Random Forest A random forest is a collection of fully grown CART-like (CART stands for Classification and Regression Tree) decision trees combined by averaging the predictions of individual trees in the forest. For each tree, given that the total number of cases in a dataset is N, a training set is first generated by randomly choosing N times with replacement from all N cases (bootstrap sample). It can be shown [8] that this botstrap sample includes only about 2/3 of the original data. The rest of the cases is used as a test (or out-of-bag) set in order to estimate the out-of-bag (OOB) error of classification, which serves as a fair estimate of accuracy. If there are m features, a number mtry m is specified such that at each node, mtry out of m features are randomly selected (thus, random forest uses two random mechanisms: bootstrap aggregation and random feature selection) and the best split on these mtry features is used to split the node. Various splitting criteria can be employed such as Gini index, information gain, node impurity. The value of mtry is constant during the forest growing (typical values of mtry are chosen to be approximately equal to either m 2 or m,or2 m). Unlike CART, each tree in the forest is fully grown without pruning. Each tree is a weak classifier and because of this fact, averaging the

3 Random Forest for Gene Expression Based Cancer Classification 485 predictions of many weak classifiers results in significant accuracy improvement compared to a single tree. In other words, since the unpruned trees are low-bias, high-variance models, averaging over an ensemble of trees reduces variance while keeping bias low. In addition to being the useful estimate of classification accuracy, the out-ofbagerrorisalsousedtogetestimatesof feature importance. However, based on the out-of-bag error alone, it is difficult to define a sharp division between important and unimportant features. 3 Datasets Two datasets were chosen for experiments. They differ in technology used to produce a dataset and in dataset complexity. Dataset complexity is discussed in detail below. 3.1 SAGE Dataset SAGE stands for Serial Analysis of Gene Expression [9,10]. This is technology alternative to microarrays (cdnas and oligonucleotides). Though SAGE was originally conceived for use in cancer studies, there is not much research using SAGE datasets regarding ensembles of classifiers (to our best knowledge, this is the first research on random forests based on SAGE data). SAGE provides a statistical description of the mrna population present in a cell without prior selection of the genes to be studied [11]. This is the main distinction of SAGE over microarray approaches (cdna and oligonucleotide) that are limited to the genes represented in the chip. SAGE counts the number of transcripts or tags for each gene, where the tags substitute the expression levels. As a result, counting sequence tags yields positive integer numbers in contrast to microarray measurements. In the chosen dataset [12], there are expressions of 822 genes in 74 cases (24 cases are normal while 50 cases are cancerous) [13]. Unlike many other datasets with one or few types of cancer, it contains 9 different types of cancer. We decided to ignore the difference between cancer types and to treat all cancerous cases as belonging to a single class. No preprocessing was done. 3.2 Colon Dataset This microarray (oligonucleotide) dataset [14], introduced in [15], contains expressions of 2,000 genes for 62 cases (22 normal and 40 colon tumour cases). Preprocessing includes the logarithmic transformation to base 10, followed by normalisation to zero mean and unit variance as usually done with this dataset. 4 Dataset Complexity It is known that the performance of individual classifiers and their ensembles is strongly data-dependent. It is often impossible to give any theoretical bounds on

4 486 O. Okun and H. Priisalu performance or these bounds are limited to few very specific cases and too weak to be useful in practice. To gain insight into a supervised classification problem such as gene expression based cancer classification, one can adopt complexity measures introduced and studied in [7]. Knowing the dataset complexity can help to predict the behaviour of a certain classifier before it is applied to the dataset, though the prediction may be not absolute because of finite dataset size. Complexity measures described below assume two-class problems and they are classifier-independent, i.e., they do not rely on a certain classification model. Employing classifier-dependent measures would not provide an absolute scale for comparison. For example, it is well known that a nearest neighbour classifier can sometimes easily classify a highly nonlinear dataset. The following characteristics were adopted to estimate the dataset complexity. 4.1 Fisher s Discriminant Ratio (F1) Fisher s discriminant ratio is defined as f = (μ1 μ2)2,whereμ σ1 2 1, μ 2, σ +σ2 1 2, σ2 2 are 2 the means and variances of the two classes, respectively. The higher f (f corresponds to two classes represented by two spatially separated points), the easier the classification problem. Hence F 1=max{f i }, i =1,...,m. 4.2 Volume of Overlap Region (F2) A similar measure is the overlap of the tails of the two class-conditional distributions. Let min(g i,c j )andmax(g i,c j ) be the minimum and maximum values of feature g i in class c j. Then the overlap measure F 2isdefinedtobe F 2=Πi=1 m MIN(max(g i,c 1),max(g i,c 2)) MAX(min(g i,c 1),min(g i,c 2)) MAX(max(g i,c 1),max(g i,c 2)) MIN(min(g i,c 1),min(g i,c 2)).IfF2 0, it implies that there is at least one feature for which value ranges of the two classes do not overlap. In other words, the smaller F 2, the easier the dataset to classify. 4.3 Feature Efficiency (F3) This measure accounts for how much each feature individually contributes to the class separation. Each feature takes values in a certain interval. If there is an overlap of intervals of two classes, there is ambiguity of classification in the overlapping region. The larger the number of cases lying outside this region, the easier class separation. For linearly separated classes, the overlapping region is empty and therefore all cases are outside of it. For highly overlapped classes, this region is large and the number of cases lying outside is small. Thus, feature efficiency is defined as the fraction of cases outside the overlapping region. F 3 corresponds to the maximum feature efficiency. 5 Experimental Details In all experiments below we used Random Forest software from Salford Systems (San Diego, CA, USA), version 1.0. The number of trees in the forest was equal to the default value, 500.

5 Random Forest for Gene Expression Based Cancer Classification Complexity Measures As can be seen from the described complexity measures, they are computed before classification. Values of all complexity measures are summarised in Table 1 for both datasets. SAGE dataset appears to be more complex for classification than Colon one. It is therefore natural to expect a worse performance of random forest on the SAGE data. A higher complexity of the SAGE data is not very surprising since this dataset comprises nine different types of cancer treated as one class, while the colon data only includes one cancer type. Table 2 confirms this idea as well as the results from Table 1. Hence, it can be good to estimate the dataset complexity before applying random forest to the dataset in order to have a rough estimate of classification accuracy which can be achieved. Table 2 points to dramatic performance degradation of random forest occurred on the SAGE data, compared to the Colon data. This in turn implies that random forest might not achieve acceptable performance in complex problems. Table 1. Summary of dataset complexity measures for both datasets. Italicised values point to a more complex dataset according to each measure. Dataset F 1 F 2 F 3 SAGE e Colon e Table 2. OOB error rates. For each dataset, three typical values of mtry were tried mtry SAGE mtry Colon Receiver Operating Characteristic Except for OOB error, we also utilised a Receiver Operating Characteristic (ROC) for performance evaluation. ROC is a plot of false positive rate (X-axis) versus true positive rate (Y-axis) of a binary classifier. The true positive rate (TPR) is defined as the ratio of the number of correctly classified positive cases to the total number of positive cases. The false positive rate (FPR) is defined as the ratio of incorrectly classified negative cases to the total number of negative cases. Cancer (normal) cases are positives (negatives). TPR and FPR vary together as a threshold on a classifier s continuous output varies. The diagonal line y = x corresponds to a classifier which predicts a class membership by randomly guessing it. Hence, all useful classifiers must have ROC curves above this line. The best possible classifier would yield a graph that is a point in the upper left corner of the ROC space, i.e., all true positives are found and no false positives are found.

6 488 O. Okun and H. Priisalu The ROC curve is a two-dimensional plot of classifier performance. To compare classifiers one typically prefers to work with a single scalar value. This value is called the Area Under Curve or AUC. It is calculated by adding the areas under the ROC curve between each pair of consecutive FPR values, using, for example, the trapezoidal rule. Because the AUC is a portion of the area of the unit square, its value will always lie between 0 and 1. Because random guessing produces the diagonal line between (0,0) and (1,1), which has an area of 0.5, no realistic classifier should have an AUC less than 0.5 [16]. In fact, the better a classifier performs, the higher the AUC. The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive case higher than a randomly chosen negative case [16]. AUC values for both datasets and typical choices of mtry are shown in Table 3. Table 3. AUC values. For each dataset, three typical values of mtry were tried. mtry SAGE mtry Colon Looking at Tables 2 and 3, one can notice that the performance of random forest on each dataset remains almost the same as mtry varies.thisistheexpected result just confirming conclusions of other researchers. We went, however, one step further and analysed the gene rankings produced according to the Gini index of feature importance. The Gini index is computed as follows. For every node split by a feature in every tree in the forest we have a measure of how much the split improved the separation between classes. Accumulating these improvements leads to scores that are then standardised. The most important gene always gets a score of and a rank of 1. The second most important gene will get a smaller score and a rank of 2, etc. We used these ranks to compute rank correlation coefficients. We opted for the rank correlation coefficients such as Kendall s τ and Spearman s ρ instead of the linear (Pearson) correlation coefficient, because they provide appropriate results even if the correlation between two variables is not linear. Both Kendall s τ and Spearman s ρ with a correction for ties were computed for all possible pairs of ranked genes lists (for details, see [17]). There were three pairs for each dataset because of three values of mtry. Two statistical tests were done: two-tailed test that correlation is not zero and one-tailed test that correlation is greater than zero. For SAGE, positive correlation if existed at significance levels 0.05 and 0.01 was about at maximum, while for Colon its value was even smaller ( ). It means that gene ranks turned out to be almost uncorrelated. Hence, given two similar OOB error rates, one should use feature importance provided by random forest with caution in order to avoid spurious conclusions about biological relevance of top ranked genes.

7 Random Forest for Gene Expression Based Cancer Classification 489 The fact that different subsets of genes can be equally relevant when predicting cancer has been already highlighted in several works [18,19]. It was argued that one of the possible explanations for such multiplicity and non-uniqueness is a strong influence of the training set on gene selection. In other words, different groups of patients can lead to different gene importance rankings due to genuine differences between patients (cancer grade, stage, etc.). In random forest, bootstrap naturally produces different training sets and these sets have a significant overlap. Although there are many trees in random forest, it seems that multiplicity and non-uniqueness still cannot be avoided. This observation implies that for random forest the rank in the list is not necessarily a reliable indicator of gene importance. Despite of this pessimistic conclusion, random forest remains a good predictive method that probably needs to be complemented by more rigorous and careful analysis of the results. 6 Conclusion We considered the overlooked issues related to random forests for cancer classification based on gene expression. To facilitate biological interpretation, it is important to know which genes are relevant to cancer. It was claimed that random forest can attach importance measure to each gene, which may point to gene relevance. We showed that despite of similar OOB errors for several typical choices of mtry, gene importance can significantly vary. Perhaps, one alternative could be to combine explicit feature selection and random forest (see, e.g. [1,4]), but it needs extra verification since it was reported in [1] (see Stability (uniqueness) of results there) that this strategy does not always lead to very stable results. In addition, dataset complexity computed before running random forest can be a useful performance predictor. Based on it, users can decide whether to apply random forest or not. References 1. Díaz-Uriarte, R., Alvarez de Andrés, S.: Gene Selection and Classification of Microarray Data Using Random Forest.: BMC Bioinformatics, vol. 7 (2006) 2. Svetnik, V., Liaw, A., Tong, C., Wang, T.: Application of Breiman s Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules. In: Roli, F., Kittler, J., Windeatt, T. (eds.) MCS LNCS, vol. 3077, pp Springer, Heidelberg (2004) 3. Wu, B., Abbot, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., Zhao, H.: Comparison of Statistical Methods for Classification of Ovarian Cancer Using Mass Spectrometry Data. Bioinformatics 19, (2003) 4. Geurts, P., Fillet, M., de Seny, D., Meuwis, M.-A., Malaise, M., Merville, M.-P., Wehenkel, L.: Proteomic Mass Spectra Classification Using Decision Tree Based Ensemble Methods. Bioinformatics 21, (2005)

8 490 O. Okun and H. Priisalu 5. Alvarez, S., Díaz-Uriarte, R., Osorio, A., Barroso, A., Melchor, L., Paz, M.F., Honrado, E., Rodríguez, R., Urioste, M., Valle, L., Díez, O., Cigudosa, J.C., Dopazo, J., Esteller, M., Benitez, J.: A Predictor Based on the Somatic Genomic Changes of the BRCA1/BRCA2 Breast Cancer Tumors Identifies the Non-BRCA1/BRCA2 Tumors with BRCA1 Promoter Hypermethylation. Clinical Cancer Research 11, (2005) 6. Gunther, E.C., Stone, D.J., Gerwein, R.W., Bento, P., Heyes, M.P.: Prediction of Clinical Drug Efficacy by Classification of Drug-Induced Genomic Expression Profiles in Vitro. Proc. Natl. Acad. Sci. 100, (2003) 7. Ho, T.K., Basu, M.: Complexity Measures of Supervised Classification Problems. IEEE Trans. Patt. Analysis and Machine Intell. 24, (2002) 8. Breiman, L.: Random Forests. Machine Learning 45, 5 32 (2001) Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W.: Serial Analysis of Gene Expression. Science 270, (1995) 11. Aldaz, M.C.: Serial Analysis of Gene Expression (SAGE) in Cancer Research. In: Ladanyi, M., Gerald, W.L. (eds.) Expression Profiling of Human Tumors: Diagnostic and Research Applications, pp Humana Press, Totowa, NJ (2003) Gandrillon, O.: Guide to the Gene Expression Data. In: Berka, P., Crémilleux, B. (eds.) Proc. the ECML/PKDD Discovery Challenge Workshop, Pisa, Italy, pp (2004) Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. In: Proc. Natl. Acad. Sci. 96, (1999) 16. Fawcett, T.: An Introduction to ROC Analysis. Patt. Recogn. Letters 27, (2006) 17. Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Chapman & Hall/CRC, Boca Raton, London, New York, Washington, DC (2004) 18. Ein-Dor, L., Kela, I., Getz, G., Givol, D., Domany, E.: Outcome Signature Genes in Breast Cancer: Is There a Unique Set? Bioinformatics 21, (2005) 19. Michiels, S., Koscielny, S., Hill, C.: Prediction of Cancer Outcome with Microarrays: a Multiple Random Validation Strategy. Lancet 365, (2005)

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

Data complexity measures for analyzing the effect of SMOTE over microarrays

Data complexity measures for analyzing the effect of SMOTE over microarrays ESANN 216 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 27-29 April 216, i6doc.com publ., ISBN 978-2878727-8. Data complexity

More information

Predictive Biomarkers

Predictive Biomarkers Uğur Sezerman Evolutionary Selection of Near Optimal Number of Features for Classification of Gene Expression Data Using Genetic Algorithms Predictive Biomarkers Biomarker: A gene, protein, or other change

More information

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection 202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray

More information

Prediction-based Threshold for Medication Alert

Prediction-based Threshold for Medication Alert MEDINFO 2013 C.U. Lehmann et al. (Eds.) 2013 IMIA and IOS Press. This article is published online with Open Access IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial

More information

A Random Forest Model for the Analysis of Chemical Descriptors for the Elucidation of HIV 1 Protease Protein Ligand Interactions

A Random Forest Model for the Analysis of Chemical Descriptors for the Elucidation of HIV 1 Protease Protein Ligand Interactions A Random Forest Model for the Analysis of Chemical Descriptors for the Elucidation of HIV 1 Protease Protein Ligand Interactions Gene M. Ko, A. Srinivas Reddy, Sunil Kumar, Barbara A. Bailey, and Rajni

More information

Comparison of discrimination methods for the classification of tumors using gene expression data

Comparison of discrimination methods for the classification of tumors using gene expression data Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley

More information

An Improved Algorithm To Predict Recurrence Of Breast Cancer

An Improved Algorithm To Predict Recurrence Of Breast Cancer An Improved Algorithm To Predict Recurrence Of Breast Cancer Umang Agrawal 1, Ass. Prof. Ishan K Rajani 2 1 M.E Computer Engineer, Silver Oak College of Engineering & Technology, Gujarat, India. 2 Assistant

More information

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range Lae-Jeong Park and Jung-Ho Moon Department of Electrical Engineering, Kangnung National University Kangnung, Gangwon-Do,

More information

Multi-class Cancer Classification Using Ensembles of Classifiers: Preliminary Results

Multi-class Cancer Classification Using Ensembles of Classifiers: Preliminary Results Multi-class Cancer Classification Using Ensembles of Classifiers: Preliminary Results Oleg Okun 1 and Helen Priisalu 2 1 University of Oulu, Finland 2 Tallinn University of Technology, Estonia Abstract.

More information

When Overlapping Unexpectedly Alters the Class Imbalance Effects

When Overlapping Unexpectedly Alters the Class Imbalance Effects When Overlapping Unexpectedly Alters the Class Imbalance Effects V. García 1,2, R.A. Mollineda 2,J.S.Sánchez 2,R.Alejo 1,2, and J.M. Sotoca 2 1 Lab. Reconocimiento de Patrones, Instituto Tecnológico de

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

Hybridized KNN and SVM for gene expression data classification

Hybridized KNN and SVM for gene expression data classification Mei, et al, Hybridized KNN and SVM for gene expression data classification Hybridized KNN and SVM for gene expression data classification Zhen Mei, Qi Shen *, Baoxian Ye Chemistry Department, Zhengzhou

More information

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN

Review. Imagine the following table being obtained as a random. Decision Test Diseased Not Diseased Positive TP FP Negative FN TN Outline 1. Review sensitivity and specificity 2. Define an ROC curve 3. Define AUC 4. Non-parametric tests for whether or not the test is informative 5. Introduce the binormal ROC model 6. Discuss non-parametric

More information

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering

Gene expression analysis. Roadmap. Microarray technology: how it work Applications: what can we do with it Preprocessing: Classification Clustering Gene expression analysis Roadmap Microarray technology: how it work Applications: what can we do with it Preprocessing: Image processing Data normalization Classification Clustering Biclustering 1 Gene

More information

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts

Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts Predictive Model for Detection of Colorectal Cancer in Primary Care by Analysis of Complete Blood Counts Kinar, Y., Kalkstein, N., Akiva, P., Levin, B., Half, E.E., Goldshtein, I., Chodick, G. and Shalev,

More information

Nearest Shrunken Centroid as Feature Selection of Microarray Data

Nearest Shrunken Centroid as Feature Selection of Microarray Data Nearest Shrunken Centroid as Feature Selection of Microarray Data Myungsook Klassen Computer Science Department, California Lutheran University 60 West Olsen Rd, Thousand Oaks, CA 91360 mklassen@clunet.edu

More information

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

INTRODUCTION TO MACHINE LEARNING. Decision tree learning INTRODUCTION TO MACHINE LEARNING Decision tree learning Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:

Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23: Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers. Richard Simon, J Clin Oncol 23:7332-7341 Presented by Deming Mi 7/25/2006 Major reasons for few prognostic factors to

More information

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

BIOINFORMATICS. Permutation importance: a corrected feature importance measure

BIOINFORMATICS. Permutation importance: a corrected feature importance measure BIOINFORMATICS Vol. 00 no. 00 2009 Pages 1 8 Permutation importance: a corrected feature importance measure André Altmann 1,, Laura Toloşi 1,, Oliver Sander 1,, Thomas Lengauer 1 1 Department of Computational

More information

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Bob Obenchain, Risk Benefit Statistics, August 2015 Our motivation for using a Cut-Point

More information

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015

More information

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin Key words : Bayesian approach, classical approach, confidence interval, estimation, randomization,

More information

The Effect of Guessing on Item Reliability

The Effect of Guessing on Item Reliability The Effect of Guessing on Item Reliability under Answer-Until-Correct Scoring Michael Kane National League for Nursing, Inc. James Moloney State University of New York at Brockport The answer-until-correct

More information

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341)

ROC Curve. Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341) ROC Curve Brawijaya Professional Statistical Analysis BPSA MALANG Jl. Kertoasri 66 Malang (0341) 580342 ROC Curve The ROC Curve procedure provides a useful way to evaluate the performance of classification

More information

Assessing Functional Neural Connectivity as an Indicator of Cognitive Performance *

Assessing Functional Neural Connectivity as an Indicator of Cognitive Performance * Assessing Functional Neural Connectivity as an Indicator of Cognitive Performance * Brian S. Helfer 1, James R. Williamson 1, Benjamin A. Miller 1, Joseph Perricone 1, Thomas F. Quatieri 1 MIT Lincoln

More information

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes

Comparing Multifunctionality and Association Information when Classifying Oncogenes and Tumor Suppressor Genes 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann http://genomebiology.com/22/3/2/research/69. Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann Address: Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich,

More information

Introduction to Discrimination in Microarray Data Analysis

Introduction to Discrimination in Microarray Data Analysis Introduction to Discrimination in Microarray Data Analysis Jane Fridlyand CBMB University of California, San Francisco Genentech Hall Auditorium, Mission Bay, UCSF October 23, 2004 1 Case Study: Van t

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016

The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016 The Logic of Data Analysis Using Statistical Techniques M. E. Swisher, 2016 This course does not cover how to perform statistical tests on SPSS or any other computer program. There are several courses

More information

4. Model evaluation & selection

4. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2017 4. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification

Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification Using CART to Mine SELDI ProteinChip Data for Biomarkers and Disease Stratification Kenna Mawk, D.V.M. Informatics Product Manager Ciphergen Biosystems, Inc. Outline Introduction to ProteinChip Technology

More information

PRACTICAL STATISTICS FOR MEDICAL RESEARCH

PRACTICAL STATISTICS FOR MEDICAL RESEARCH PRACTICAL STATISTICS FOR MEDICAL RESEARCH Douglas G. Altman Head of Medical Statistical Laboratory Imperial Cancer Research Fund London CHAPMAN & HALL/CRC Boca Raton London New York Washington, D.C. Contents

More information

Evaluating Classifiers for Disease Gene Discovery

Evaluating Classifiers for Disease Gene Discovery Evaluating Classifiers for Disease Gene Discovery Kino Coursey Lon Turnbull khc0021@unt.edu lt0013@unt.edu Abstract Identification of genes involved in human hereditary disease is an important bioinfomatics

More information

MODEL SELECTION STRATEGIES. Tony Panzarella

MODEL SELECTION STRATEGIES. Tony Panzarella MODEL SELECTION STRATEGIES Tony Panzarella Lab Course March 20, 2014 2 Preamble Although focus will be on time-to-event data the same principles apply to other outcome data Lab Course March 20, 2014 3

More information

Estimation of Area under the ROC Curve Using Exponential and Weibull Distributions

Estimation of Area under the ROC Curve Using Exponential and Weibull Distributions XI Biennial Conference of the International Biometric Society (Indian Region) on Computational Statistics and Bio-Sciences, March 8-9, 22 43 Estimation of Area under the ROC Curve Using Exponential and

More information

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017

Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 Machine Learning! Robert Stengel! Robotics and Intelligent Systems MAE 345,! Princeton University, 2017 A.K.A. Artificial Intelligence Unsupervised learning! Cluster analysis Patterns, Clumps, and Joining

More information

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen Karsten Borgwardt:

More information

Biomarker adaptive designs in clinical trials

Biomarker adaptive designs in clinical trials Review Article Biomarker adaptive designs in clinical trials James J. Chen 1, Tzu-Pin Lu 1,2, Dung-Tsa Chen 3, Sue-Jane Wang 4 1 Division of Bioinformatics and Biostatistics, National Center for Toxicological

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Term Project Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is the

More information

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang

Classification. Methods Course: Gene Expression Data Analysis -Day Five. Rainer Spang Classification Methods Course: Gene Expression Data Analysis -Day Five Rainer Spang Ms. Smith DNA Chip of Ms. Smith Expression profile of Ms. Smith Ms. Smith 30.000 properties of Ms. Smith The expression

More information

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections New: Bias-variance decomposition, biasvariance tradeoff, overfitting, regularization, and feature selection Yi

More information

An Empirical and Formal Analysis of Decision Trees for Ranking

An Empirical and Formal Analysis of Decision Trees for Ranking An Empirical and Formal Analysis of Decision Trees for Ranking Eyke Hüllermeier Department of Mathematics and Computer Science Marburg University 35032 Marburg, Germany eyke@mathematik.uni-marburg.de Stijn

More information

6. Unusual and Influential Data

6. Unusual and Influential Data Sociology 740 John ox Lecture Notes 6. Unusual and Influential Data Copyright 2014 by John ox Unusual and Influential Data 1 1. Introduction I Linear statistical models make strong assumptions about the

More information

Statistical Assessment of the Global Regulatory Role of Histone. Acetylation in Saccharomyces cerevisiae. (Support Information)

Statistical Assessment of the Global Regulatory Role of Histone. Acetylation in Saccharomyces cerevisiae. (Support Information) Statistical Assessment of the Global Regulatory Role of Histone Acetylation in Saccharomyces cerevisiae (Support Information) Authors: Guo-Cheng Yuan, Ping Ma, Wenxuan Zhong and Jun S. Liu Linear Relationship

More information

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015 Introduction to diagnostic accuracy meta-analysis Yemisi Takwoingi October 2015 Learning objectives To appreciate the concept underlying DTA meta-analytic approaches To know the Moses-Littenberg SROC method

More information

(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN)

(CORRELATIONAL DESIGN AND COMPARATIVE DESIGN) UNIT 4 OTHER DESIGNS (CORRELATIONAL DESIGN AND COMPARATIVE DESIGN) Quasi Experimental Design Structure 4.0 Introduction 4.1 Objectives 4.2 Definition of Correlational Research Design 4.3 Types of Correlational

More information

Chapter 17 Sensitivity Analysis and Model Validation

Chapter 17 Sensitivity Analysis and Model Validation Chapter 17 Sensitivity Analysis and Model Validation Justin D. Salciccioli, Yves Crutain, Matthieu Komorowski and Dominic C. Marshall Learning Objectives Appreciate that all models possess inherent limitations

More information

A scored AUC Metric for Classifier Evaluation and Selection

A scored AUC Metric for Classifier Evaluation and Selection A scored AUC Metric for Classifier Evaluation and Selection Shaomin Wu SHAOMIN.WU@READING.AC.UK School of Construction Management and Engineering, The University of Reading, Reading RG6 6AW, UK Peter Flach

More information

The Analysis of Proteomic Spectra from Serum Samples. Keith Baggerly Biostatistics & Applied Mathematics MD Anderson Cancer Center

The Analysis of Proteomic Spectra from Serum Samples. Keith Baggerly Biostatistics & Applied Mathematics MD Anderson Cancer Center The Analysis of Proteomic Spectra from Serum Samples Keith Baggerly Biostatistics & Applied Mathematics MD Anderson Cancer Center PROTEOMICS 1 What Are Proteomic Spectra? DNA makes RNA makes Protein Microarrays

More information

An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework

An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework An Improved Patient-Specific Mortality Risk Prediction in ICU in a Random Forest Classification Framework Soumya GHOSE, Jhimli MITRA 1, Sankalp KHANNA 1 and Jason DOWLING 1 1. The Australian e-health and

More information

Colon cancer subtypes from gene expression data

Colon cancer subtypes from gene expression data Colon cancer subtypes from gene expression data Nathan Cunningham Giuseppe Di Benedetto Sherman Ip Leon Law Module 6: Applied Statistics 26th February 2016 Aim Replicate findings of Felipe De Sousa et

More information

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes.

Statistics 202: Data Mining. c Jonathan Taylor. Final review Based in part on slides from textbook, slides of Susan Holmes. Final review Based in part on slides from textbook, slides of Susan Holmes December 5, 2012 1 / 1 Final review Overview Before Midterm General goals of data mining. Datatypes. Preprocessing & dimension

More information

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al.

Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Comments on Significance of candidate cancer genes as assessed by the CaMP score by Parmigiani et al. Holger Höfling Gad Getz Robert Tibshirani June 26, 2007 1 Introduction Identifying genes that are involved

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T. Diagnostic Tests 1 Introduction Suppose we have a quantitative measurement X i on experimental or observed units i = 1,..., n, and a characteristic Y i = 0 or Y i = 1 (e.g. case/control status). The measurement

More information

Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers

Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers Bootstrapped Integrative Hypothesis Test, COPD-Lung Cancer Differentiation, and Joint mirnas Biomarkers Kai-Ming Jiang 1,2, Bao-Liang Lu 1,2, and Lei Xu 1,2,3(&) 1 Department of Computer Science and Engineering,

More information

MOST: detecting cancer differential gene expression

MOST: detecting cancer differential gene expression Biostatistics (2008), 9, 3, pp. 411 418 doi:10.1093/biostatistics/kxm042 Advance Access publication on November 29, 2007 MOST: detecting cancer differential gene expression HENG LIAN Division of Mathematical

More information

Bivariate variable selection for classification problem

Bivariate variable selection for classification problem Bivariate variable selection for classification problem Vivian W. Ng Leo Breiman Abstract In recent years, large amount of attention has been placed on variable or feature selection in various domains.

More information

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays

RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Supplementary Materials RASA: Robust Alternative Splicing Analysis for Human Transcriptome Arrays Junhee Seok 1*, Weihong Xu 2, Ronald W. Davis 2, Wenzhong Xiao 2,3* 1 School of Electrical Engineering,

More information

Clustering mass spectrometry data using order statistics

Clustering mass spectrometry data using order statistics Proteomics 2003, 3, 1687 1691 DOI 10.1002/pmic.200300517 1687 Douglas J. Slotta 1 Lenwood S. Heath 1 Naren Ramakrishnan 1 Rich Helm 2 Malcolm Potts 3 1 Department of Computer Science 2 Department of Wood

More information

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies

Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Statistical Analysis of Single Nucleotide Polymorphism Microarrays in Cancer Studies Stanford Biostatistics Workshop Pierre Neuvial with Henrik Bengtsson and Terry Speed Department of Statistics, UC Berkeley

More information

IEEE SIGNAL PROCESSING LETTERS, VOL. 13, NO. 3, MARCH A Self-Structured Adaptive Decision Feedback Equalizer

IEEE SIGNAL PROCESSING LETTERS, VOL. 13, NO. 3, MARCH A Self-Structured Adaptive Decision Feedback Equalizer SIGNAL PROCESSING LETTERS, VOL 13, NO 3, MARCH 2006 1 A Self-Structured Adaptive Decision Feedback Equalizer Yu Gong and Colin F N Cowan, Senior Member, Abstract In a decision feedback equalizer (DFE),

More information

Validating the Visual Saliency Model

Validating the Visual Saliency Model Validating the Visual Saliency Model Ali Alsam and Puneet Sharma Department of Informatics & e-learning (AITeL), Sør-Trøndelag University College (HiST), Trondheim, Norway er.puneetsharma@gmail.com Abstract.

More information

A Statistical Framework for Classification of Tumor Type from microrna Data

A Statistical Framework for Classification of Tumor Type from microrna Data DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016 A Statistical Framework for Classification of Tumor Type from microrna Data JOSEFINE RÖHSS KTH ROYAL INSTITUTE OF TECHNOLOGY

More information

Detection Theory: Sensitivity and Response Bias

Detection Theory: Sensitivity and Response Bias Detection Theory: Sensitivity and Response Bias Lewis O. Harvey, Jr. Department of Psychology University of Colorado Boulder, Colorado The Brain (Observable) Stimulus System (Observable) Response System

More information

Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta GA, USA.

Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta GA, USA. A More Intuitive Interpretation of the Area Under the ROC Curve A. Cecile J.W. Janssens, PhD Department of Epidemiology, Rollins School of Public Health, Emory University, Atlanta GA, USA. Corresponding

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Classification of cancer profiles. ABDBM Ron Shamir

Classification of cancer profiles. ABDBM Ron Shamir Classification of cancer profiles 1 Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: location; morphology, cytogenesis;

More information

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ

INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ INTRODUCTION TO STATISTICS SORANA D. BOLBOACĂ OBJECTIVES Definitions Stages of Scientific Knowledge Quantification and Accuracy Types of Medical Data Population and sample Sampling methods DEFINITIONS

More information

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool

Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool Performance Analysis of Different Classification Methods in Data Mining for Diabetes Dataset Using WEKA Tool Sujata Joshi Assistant Professor, Dept. of CSE Nitte Meenakshi Institute of Technology Bangalore,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Midterm, 2016 Exam policy: This exam allows one one-page, two-sided cheat sheet; No other materials. Time: 80 minutes. Be sure to write your name and

More information

METHODS FOR DETECTING CERVICAL CANCER

METHODS FOR DETECTING CERVICAL CANCER Chapter III METHODS FOR DETECTING CERVICAL CANCER 3.1 INTRODUCTION The successful detection of cervical cancer in a variety of tissues has been reported by many researchers and baseline figures for the

More information

Applying Machine Learning Methods in Medical Research Studies

Applying Machine Learning Methods in Medical Research Studies Applying Machine Learning Methods in Medical Research Studies Daniel Stahl Department of Biostatistics and Health Informatics Psychiatry, Psychology & Neuroscience (IoPPN), King s College London daniel.r.stahl@kcl.ac.uk

More information

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used

More information

On testing dependency for data in multidimensional contingency tables

On testing dependency for data in multidimensional contingency tables On testing dependency for data in multidimensional contingency tables Dominika Polko 1 Abstract Multidimensional data analysis has a very important place in statistical research. The paper considers the

More information

BayesRandomForest: An R

BayesRandomForest: An R BayesRandomForest: An R implementation of Bayesian Random Forest for Regression Analysis of High-dimensional Data Oyebayo Ridwan Olaniran (rid4stat@yahoo.com) Universiti Tun Hussein Onn Malaysia Mohd Asrul

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Literature Survey Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is

More information

Stage-Specific Predictive Models for Cancer Survivability

Stage-Specific Predictive Models for Cancer Survivability University of Wisconsin Milwaukee UWM Digital Commons Theses and Dissertations December 2016 Stage-Specific Predictive Models for Cancer Survivability Elham Sagheb Hossein Pour University of Wisconsin-Milwaukee

More information

Psychology, 2010, 1: doi: /psych Published Online August 2010 (

Psychology, 2010, 1: doi: /psych Published Online August 2010 ( Psychology, 2010, 1: 194-198 doi:10.4236/psych.2010.13026 Published Online August 2010 (http://www.scirp.org/journal/psych) Using Generalizability Theory to Evaluate the Applicability of a Serial Bayes

More information

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The parsimony principle: A quick review Find the tree that requires the fewest

More information

Detection and Classification of Diabetic Retinopathy using Retinal Images

Detection and Classification of Diabetic Retinopathy using Retinal Images Detection and Classification of Diabetic Retinopathy using Retinal Images Kanika Verma, Prakash Deep and A. G. Ramakrishnan, Senior Member, IEEE Medical Intelligence and Language Engineering Lab Department

More information

A Hybrid Approach for Mining Metabolomic Data

A Hybrid Approach for Mining Metabolomic Data A Hybrid Approach for Mining Metabolomic Data Dhouha Grissa 1,3, Blandine Comte 1, Estelle Pujos-Guillot 2, and Amedeo Napoli 3 1 INRA, UMR1019, UNH-MAPPING, F-63000 Clermont-Ferrand, France, 2 INRA, UMR1019,

More information

Chapter 9: Comparing two means

Chapter 9: Comparing two means Chapter 9: Comparing two means Smart Alex s Solutions Task 1 Is arachnophobia (fear of spiders) specific to real spiders or will pictures of spiders evoke similar levels of anxiety? Twelve arachnophobes

More information

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA

MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION TO BREAST CANCER DATA International Journal of Software Engineering and Knowledge Engineering Vol. 13, No. 6 (2003) 579 592 c World Scientific Publishing Company MODEL-BASED CLUSTERING IN GENE EXPRESSION MICROARRAYS: AN APPLICATION

More information

Sheila Barron Statistics Outreach Center 2/8/2011

Sheila Barron Statistics Outreach Center 2/8/2011 Sheila Barron Statistics Outreach Center 2/8/2011 What is Power? When conducting a research study using a statistical hypothesis test, power is the probability of getting statistical significance when

More information

Supplementary Materials

Supplementary Materials Supplementary Materials July 2, 2015 1 EEG-measures of consciousness Table 1 makes explicit the abbreviations of the EEG-measures. Their computation closely follows Sitt et al. (2014) (supplement). PE

More information

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data

Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data Probability-Based Protein Identification for Post-Translational Modifications and Amino Acid Variants Using Peptide Mass Fingerprint Data Tong WW, McComb ME, Perlman DH, Huang H, O Connor PB, Costello

More information

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/

Classifica4on. CSCI1950 Z Computa4onal Methods for Biology Lecture 18. Ben Raphael April 8, hip://cs.brown.edu/courses/csci1950 z/ CSCI1950 Z Computa4onal Methods for Biology Lecture 18 Ben Raphael April 8, 2009 hip://cs.brown.edu/courses/csci1950 z/ Binary classifica,on Given a set of examples (x i, y i ), where y i = + 1, from unknown

More information

Learning with Rare Cases and Small Disjuncts

Learning with Rare Cases and Small Disjuncts Appears in Proceedings of the 12 th International Conference on Machine Learning, Morgan Kaufmann, 1995, 558-565. Learning with Rare Cases and Small Disjuncts Gary M. Weiss Rutgers University/AT&T Bell

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

3. Model evaluation & selection

3. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2016 3. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

Correlation and Regression

Correlation and Regression Dublin Institute of Technology ARROW@DIT Books/Book Chapters School of Management 2012-10 Correlation and Regression Donal O'Brien Dublin Institute of Technology, donal.obrien@dit.ie Pamela Sharkey Scott

More information

Modeling Sentiment with Ridge Regression

Modeling Sentiment with Ridge Regression Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,

More information