A methodology for the analysis of medical data

Size: px
Start display at page:

Download "A methodology for the analysis of medical data"

Transcription

1 Please cite this book chapter as: A. Tsanas, M.A. Little, P.E. McSharry, A methodology for the analysis of medical data, Handbook of Systems and Complexity in Health, Springer, New York, pp , 2013 A methodology for the analysis of medical data by A. Tsanas 1,2,*, M.A. Little 2,3, P.E. McSharry 1,2,4 * Asterisk denotes corresponding author. (tsanas@maths.ox.ac.uk, tsanasthanasis@gmail.com) 1 Systems Analysis, Modelling and Prediction (SAMP), Department of Engineering Science, University of Oxford, Oxford, UK 2 Oxford Centre for Industrial and Applied Mathematics (OCIAM), University of Oxford, Oxford, UK 3 Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA 4 Smith School of Enterprise and the Environment, University of Oxford, Oxford, UK Abstract This chapter aims to provide a methodology for the quantitative analysis of certain kinds of medical data. It is mainly aimed at clinical practitioners who are interested in data analysis, and intends to offer a succinct guide that may prove useful across a wide range of medical applications. To illustrate the proposed steps in this methodological guide, we use a classical medical dataset to show how these steps are applied in practice. The generic applicability of this guide is demonstrated by investigating several publicly available datasets in diverse medical applications, ranging from cancer diagnosis to micro-array data analysis. We start by introducing some commonly used terminology, and briefly discuss some standard ideas behind statistical analysis and hypothesis testing. Subsequently, we describe classification techniques to predict the outcome given the explanatory variables. These techniques offer an improved understanding that goes beyond the use of correlations and 1

2 to quantify importance (or more mathematically accurately statistical significance). The key finding is that a powerful nonlinear classifier is consistently superior to logistic regression (a classifier often used by clinicians), offering a relative improvement in performance of 36% in predicting the outcome across six different datasets. We urge clinicians to apply a similar methodology for investigating the predictive information contained within their datasets, which may otherwise remain concealed at the initial step of statistical exploration. Introduction and terminology Imagine a subject going to the clinic for a medical diagnosis, for example to assess the functionality of his cardiovascular system. The doctor requests a number of clinical tests (for example stress-test to obtain the electro-cardiogram (ECG) and Doppler ultrasound), takes into account a number of other factors (for example the demographics of the subject), and makes his final diagnosis using the current data and his prior knowledge. For his diagnosis, the doctor will usually compute some characteristics of the original raw signal. For example, when the raw signal is the ECG, clinicians may want to use the mean heart rate or the heart rate variability (these characteristics may also be readily provided by medical software) because experience has taught them these characteristics are useful in diagnosis. The discipline of statistical machine learning (informally data analysis) offers a framework which allows researchers to decipher what the computed characteristics reveal, and how these characteristics could be used to offer a decision support tool. A further aim is to investigate whether additional characteristics, which may have been previously ignored, could or should be taken into account. A guide on detecting patterns is outside the scope of this study; instead we will focus on the case where a number of characteristics have been 2

3 collected (as indicated above, these characteristics might have been extracted from the original raw signals, demographic data, values of genes, concentration of a particular component in a given area, and others). Characteristics which are qualitative can be assigned to an ordinal scale 1. For example, in the case that medical practitioners characterize cells as having (a) low concentrations, (b) moderate concentrations, and (c) large concentration, we could define a scale that would read:,, For reasons that will become clear later it is advisable to use progressively increasing values of the ordinal scale starting with healthy condition and characterizing pathological situations with higher values. Each characteristic is represented by a single scalar value. We have purposefully avoided the use of mathematical terms so far, but here we need to define some terms that are commonly used in statistical settings: explanatory variables (or features), and response variable (or simply response). The term feature is equivalent to the computed characteristic, and the term response variable can be thought to be equivalent to the diagnosis or the clinical outcome. In most medical settings, the diagnosis or clinical outcome can take a small range of possible values. For example, the final diagnosis of a clinician may simply be a yes or no to a question (e.g. whether a subject has cancer), and might also include a third clinical outcome, e.g. possibly. These two or three possible outcomes can be represented using an ordinal scale as indicated previously for the case of the characteristics, i.e. the response variable takes the following possible values:. This example can be generalized to a more general setting, where for example a number of explanatory variables are used to assign subjects to different pathologies. The possible values that the response variable can have are simply known as categories or classes. When the response variable can take any number of finite classes the 1 The term ordinal scale refers to a hierarchical ordering of values to differentiate the different possible outcomes, where the difference between successive values within the scale is not equal. 3

4 problem of predicting the response variable is known as classification 2. When the response variable can take any real value (any possible number from - to ), the problem is known as regression. Classification problems are met considerably more frequently in medical applications, and hence we focus exclusively on those cases. It is important to stress that accurate statistical inference is only possible when a relatively large number of data samples are collected. A good rule of thumb is to use at least 15 data samples from each class in the response variable. Data in most medical applications can be represented in the form: [ ] [ ] is known as the design matrix (or data matrix), where each row includes the explanatory variables for each subject. That is, each row contains the concatenated vector of the explanatory variables which characterize the subject s condition. For example could be the age of the first subject, the gender, the mean heart rate, the heart rate variability and so on. Effectively, simply summarizes the explanatory variables for the number of observations (samples) (each row usually refers to a different subject). Each column in contains the values of one explanatory variable across all samples, and is indicated as ( ),. The response variable is believed to be associated with based on prior knowledge in the given problem. It is populated with the outcomes for each sample, for example for the first sample we could have to denote healthy state, to denote pathological state for the second subject and so on. Once the data is summarized in a format like the one presented above with and, the aim is to decipher the concealed information. Questions such as the following are frequently met in medical contexts (the list is only indicative): 1. How can we associate and? That is, what is the relationship between the explanatory variables and the response variable? 2 When the response variable can only be one of two classes, the problem is referred to as binary classification; when there are more than two classes, the problem is known as multi-class classification. Binary classification problems are met very frequently in medical applications, for example to differentiate whether the patients live or die. 4

5 2. Is there a convenient way to estimate the response variable when presented with the explanatory variables of a subject? 3. Which of the explanatory variables are useful in actually determining the response variable? 4. What is the relationship between the explanatory variables? Is it possible that some of the explanatory variables are redundant and need not be computed? We will demonstrate that when analysis is confined only to reporting statistical significance values ( ) these clinically important questions cannot be adequately answered. Data exploration and statistical analysis Usually, the first step in data analysis is to explore the statistical properties of the data, and to produce some plots to get an intuitive feel. Initially the probability densities of the explanatory variables can be plotted, and the simplest approach is to use histograms 3. Histograms provide a nice overview of the distribution of values for each explanatory variable, and for the response variable. They use a number of bins (for example 10) which span the range of possible values of the investigated variable, and count the number of data samples that fall into the range of each bin, thus providing a general impresssion of the spread of the values for this variable. In addition to density plots, we suggest using scatter plots: a scatter plot has on the x-axis one explanatory variable, and on the y-axis the response variable. Scatter plots are useful to visualize whether there is any obvious relationship between the investigated explanatory variable and the response variable. Scatter plots can be used for each of the explanatory variables to present very simply the { } points in a figure. 3 In general, histograms are considered a simple but rather crude approach. Kernel density estimation is typically preferable (see Hastie et al. (2009) for more details), and can be thought of as a smoothened version of histograms. 5

6 Visual inspection of density plots and scatter plots is usually followed by formal statistical tests in order to determine qualitatively and quantitatively how well the explanatory variables are related to the response variable. Correlation analysis offers a good indication of the association between each explanatory variable and the response variable, and between explanatory variables (pairwise correlations). However, we emphasize that correlation does not necessarily suggest causation (change in the values of the explanatory variable affecting the response variable) in general (Aldrich, 1995). Correlation coefficients are regarded as a valuable hint indicating a potential relationship between the explanatory variable and the response. We endorse the use of the Spearman rank correlation coefficient, which can account for general monotonic relationships and which in general is preferable compared to the linear (Pearson) correlation coefficient (which is more appropriate in linear settings). Strictly speaking, formal statistical hypothesis tests (see the following paragraph) should be used to check whether the data follow normality (one example of normality is data that have histograms resembling a bell shaped curve). In practice, medical data will typically deviate from normality, and hence the Spearman correlation coefficient should generally be used. Both the Spearman rank correlation and the linear correlation coefficient lie in the numeric range, - and are interpreted using (a) the sign of the correlation coefficient which denotes the direction of the relationship, and (b) the magnitude (absolute value) of the correlation coefficient. Negative sign indicates that the direction of the relationship between the variables is opposite: the increase in the values of one variable leads to the decrease in the values of the other. The larger the magnitude of the correlation coefficient, the stronger the statistical relationship between the variables is. There is no general rule to determine when a relationship is statistically strong; it depends on the specifics of the application (Cohen et al, 2002). In medical contexts, statistical relationships are fairly weak and typically the magnitude of the correlation coefficient is lower than 0.3 (once again, we stress that this 6

7 value can only be used as guidance and referring to relationships between variables as statistical strong when the magnitude of the correlation coefficient is above a certain threshold is considered arbitrary). To differentiate the relationships between feature and response, and between features, we introduce some additional terminology. The correlation coefficient between the feature and the response variable is denoted by ( ) and is known as relevance; similarly, the correlation coefficient between the feature and another feature is denoted by ( ) and is known as redundancy. Statistical hypothesis tests are commonly used in data analysis applications to determine whether the observed result follows a certain hypothesis, which in statistical terminology is known as the null hypothesis. Often, the null hypothesis is the opposite of what we aim to demonstrate; therefore in practice the objective is often met when we can reject the null hypothesis in favour of the alternative hypothesis. Statistical hypothesis tests compute significance values, the well-established -values, which can be interpreted as the probability of obtaining a similar result by chance if the null hypothesis is true. The null hypothesis is rejected when the -value is lower than a pre-specified significance level, typically 0.05 or 0.01, and the result is deemed to be statistically significant at the significance level chosen. Thus, for example, denotes a statistically significant result at the 5% significance level (i.e. there is less than 5% probability that the observed relationship is due to chance). Reducing the number of explanatory variables A common problem in data analysis applications arises when using a large number of explanatory variables, and is known as the curse of dimensionality: potentially, using fewer explanatory variables could lead to a simpler model which may allow more accurate estimation of the response variable (Hastie et al., 2009). This initially puzzling assertion (one 7

8 could imagine that collecting as much information as possible in the form of explanatory variables can only be positive to infer properties from the data), occurs because in practice we do not have an infinite number of samples. The problem is exacerbated in the cases where the number of explanatory variables is larger than the number of samples, (e.g. in microarray settings where the number of genes is typically in the order of thousands and the number of samples in the order of samples). Moreover, in practice, some explanatory variables contribute little information to predicting the response variable. In other scenarios, some explanatory variables can be considered redundant in the presence of other explanatory variables (i.e. they contribute little additional information towards predicting the response variable, when some other explanatory variables are already used). There are two fundamentally different approaches to reduce the number of features: feature transformation and feature selection. Feature transformation aims to transform the original features into new features, which may be more appropriate for quantifying the information in the dataset towards predicting the response variable. However, feature transformation is problematic in settings with a very large number of features (Torkkola, 2003), and is not easily interpretable because the physical meaning of the original features cannot be retrieved. On the contrary, feature selection is particularly desirable in many disciplines because the originally computed features typically quantify some characteristic which is interpretable to experts in that domain. There is a substantial body of literature addressing the topic of feature selection from many angles, and we refer to Guyon (2006) for a more extensive discussion. For our purposes, we suggest using a simple technique which aims to select a subset of the original (large) pool of explanatory variables. Informally, we aim to select a small number of columns from the design matrix, and delete the remaining columns. The new design matrix will have the same number of samples, but a lower number of explanatory variables:, where m remains to be decided. It is reasonable then to select those explanatory 8

9 variables which are highly correlated with the response variable. However, one problem with this approach is that potentially some explanatory variables will be highly correlated between them, which means that they will be redundant (as stated above, they would have little additional contribution towards predicting the response variable). Therefore, we need to find a compromise to account both for (a) including the most relevant explanatory variables towards predicting the response variable, and (b) excluding the most redundant explanatory variables. Although there are many feature selection techniques in the literature, we endorse using a conceptually simple and intuitively appealing idea proposed by Peng et al. (2005), which we modify slightly here for simplicity. Specifically, mrmr relies on an intuitive heuristic criterion compromising between feature relevance and feature redundancy and can be expressed with the following equation: [ ( ) ( ) ] (1) where denotes the j th feature amongst the initial features, is a feature that has been already selected in the feature index subset ( is an integer, contains the indices of all the features in the initial feature space, that is 1, contains the indices of selected features and denotes the indices of the features not in the selected subset). Peng et al. (2005) have used a more complicated criterion to quantify relevance and redundancy instead of the correlation coefficient used here, but the conceptual idea remains the same. The steps used to incrementally select features are described in Table 1. We remark that when there is a single dataset used to train and test the classifier, the features should be selected using a cross-validation setting (see the following section). That is, a subset of the original data samples should be selected and the feature selection process should be run on this subset. It is advisable to repeat this feature selection process a number of times, where each time a different sample subset is drawn from the original dataset (we 9

10 suggest using 100 iterations where on each iteration we randomly select 90% of the data and use this subset for selecting the features). Theoretically, in all the subsets the selected features should be identical: this would be the true ordering. In practice, however, it is likely that a different feature ordering will result for the sample subsets drawn from the original data matrix. Then, we can either select the feature subset that occurs most often, or select individually the features which appear consecutively most often. One robust mechanism to select features in this case is described in Tsanas et al. (2012). Table 1: Incremental feature selection steps in mrmr 1. (Selecting the first feature index) include the feature index j:.i(f j y)/ in the j Q initially empty set S, that is *j+ S 2. (Selecting the next m features, one at each step, by repeating the following) apply the criterion in Eq. (1) to incrementally select the next feature index j, and include it in the set: S *j+ S 3. obtain the feature subset by selecting the features {f j } j=, j ε S from the original data matrix X. m Mapping explanatory variables to the response variable As we have mentioned in the beginning of this chapter, in a wide range of problems we are interested in determining a functional mapping of the explanatory variables to the response variable, that is find a function which uses to predict : ( ). This can be achieved in two ways: a) we can impose a structure in the functional form of, and aim to determine the parameters of that functional form (parametric setting), or b) allow the data itself determine the structure and the parameters of that structure (non-parametric setting). Both approaches have the right to exist, and there is considerable interest amongst 10

11 statisticians regarding the merits of either approach (Breiman, 2001a). One example of the parametric setting has the form ( ), where the vector with the parameters ( ) needs to be estimated. We remind the reader that represents the number of explanatory variables in the design matrix we use. In case we have selected a smaller subset of the explanatory variables (e.g. using the feature selection technique mrmr described in the previous section), then the parametric model form would use parameters. Parametric settings are generally simpler than non-parametric settings, and may be more easily interpretable. If the functional form (model structure) is known a priori, then parametric settings can be very useful. However, imposing an inaccurate model structure may lead to false interpretation of the properties of the data. Hence, in practice using a nonparametric functional form may often be more appropriate. We will now briefly introduce some widely used classifiers which our readers may already be familiar with, and one more complicated classifier, which often works very well in practice. For specific algorithmic details we refer to Bishop (2007), and Hastie et al. (2009). The first classifier is known by the name Logistic Regression (LR), which may be considered a misnomer since, by definition, it works on classification problems (Bishop, 2007). This classifier is frequently used by clinicians when constructing the functional form ( ) to identify the effects of features on the response (Breiman, 2001a). Originally LR was proposed for binary classification settings, but it has been generalized for multi-class classification problems as well. Conceptually, this classifier aims to provide a model which relies linearly on the explanatory variables. LR models have found extensive use in medical applications where the aim is to understand how the explanatory variables affect the response variable (i.e. it can be considered as a conceptual extension of using the correlation coefficients we have previously mentioned) by looking at the coefficients associated with each feature. Nevertheless, practice has shown that LR models can lead to faulty conclusions 11

12 in the presence of correlated features (Bishop, 2007), and hence we suggest extreme caution in interpreting the values of the LR coefficients. LR models require a relatively large number of samples compared to the number of explanatory variables in order to have confidence in the computed results (Bishop, 2007; Hastie et al., 2009). Random Forests (RF) is a powerful non-parametric classifier, which can provide a model where the explanatory variables combine nonlinearly to estimate the response variable (Breiman, 2001b). It is constructed by combining many base experts, the trees (by default 500 trees), and then uses majority voting from the trees to decide on the final output. We will not go into the mathematical details for the construction of the trees, since they are readily available elsewhere (Breiman, 2001b; Hastie et al., 2009); instead we will provide an intuitive overview of this powerful machine learning approach. Interestingly, the way trees are built is not dissimilar to the mindset of clinicians, where there are successive binary splits of the data before reaching a conclusion on how to classify a new sample. Effectively, trees partition the data based on a single feature at each decision point (node), to split the population. This approach could be compared to the method that clinicians use in deciding the course of optimal treatment for a patient. For example, their first criterion could be age, where optimal treatment is different for people who are over 50 years old. Then, for those patients who are less than 50 gender may be crucial and different therapy is going to be applied. Similarly, for those who are over 50, possibly gender is not important, but there is some other parameter that the clinicians would consider before deciding on the treatment. This scenario can be schematically presented in Fig. 1, and this is how trees actually work (note that a particular feature can be used more than once, and it is possible that some features will not be used at all). Similarly to a clinician, the final decision of the tree is reached when we follow the nodes in the tree and we are confident there are no more useful 12

13 splits. In practice, the trees are grown until we have reached a pre-specified minimum number of samples assigned to each node. Leaf node y=v 1 More nodes More nodes More nodes Fig. 1: Example of a tree. In each node, the decision is to go to the left side of the tree if the statement is true, and go to the right side of the tree if the statement is false. The point where a decision is made to assign the output to a specific value is known as leaf. Here, we have used variables which may be familiar to clinicians over which we split the data. Many dissimilar trees constitute the random forest. The algorithmic trick to grow diverse trees is to limit the number of features that can be used in each node by each tree. The binary splits used in the tree generation process lead to nonlinear combinations between the features, and hence RF often outperform linear classifiers such as LR (we will see specific examples in the following sections). Model validation and generalization Once the functional form has been determined, we need to establish how accurate the mapping ( ) could be if a new dataset with similar properties to the dataset used to 13

14 obtain is collected. This is known as the generalization performance of the model which is typically estimated using a) cross validation, b) bootstrapping, or c) an additional dataset, which has not been used to train the model (i.e. in the determination of ). We endorse the use of cross validation (CV), a well-known statistical re-sampling technique (Webb, 2002), because it is usually the simplest approach. Specifically, in CV the original dataset is split into a training subset, which is used to determine, and a testing subset, which is used to assess the classifier s generalization performance. The ratio of the training subset over testing subset (number of samples in each subset) is determined by the researcher and is known as -fold cross validation, with typical ratios being 5-fold (5:1) and 10-fold (10:1) (Hastie et al., 2009). The model parameters are determined using the training subset, and errors are computed using the testing subset (outof-sample error or testing error). The process should be repeated a large number of times (e.g ), where the dataset is randomly permuted in each run prior to splitting in training and testing subsets, in order to obtain statistical confidence in this assessment. Depending on the requirements of the problem, different loss functions can be introduced. In all cases, on each repetition we record an error which has the form (* + = ), where represents the number of samples in the training or testing subset, is the true class and is the estimated class of the th sample. The choice of loss function is critical and depends on the demands of the application. The simplest loss function is misclassification, i.e. identifying the number of samples incorrectly assigned to a different class compared to the true class. In multi-class classification settings where the response variable classes are ordinal or continuous it may be useful to have a more convoluted loss function. One commonly used loss function which is applicable in settings where the response variable is continuous is the mean absolute error (MAE): 14

15 (2) where contains the indices of the training or testing set. Errors from all repetitions are averaged, and the generalization performance of the classifier is determined using the out-ofsample error. Note that in binary classification settings, MAE is equivalent to misclassification (that is, counting the number of samples assigned by the classifier to a different class than the true class). For convenience, we have expressed all results in percentage scores, i.e. ( ) (3) Summary of the proposed methodology This chapter provided a succinct data analysis guide, pruning away all mathematically complicated details and distilling only the essential knowledge for the successful practical application of the algorithmic tools. We summarize the proposed methodology in the following steps: 1) Apply various statistical tests to the data. These could include testing for the assumption of Gaussianity, determining p-values and correlation coefficients, and understanding the underlying structure of the quantities involved in the analysis. Produce density plots and scatter plots to visualize the data. These plots may inspire the use of transforming some of the features, for example by using the logtransformation if some features are not well spread out. 2) Apply standard classification algorithms (e.g. logistic regression) and also use more complicated, nonlinear methods such as random forests. All the explanatory variables are used to predict the response variable(s). 15

16 3) Select features using mrmr to derive parsimonious subsets. Use each of these subsets as inputs in step 2 and record the predictions and errors. 4) Potentially, in some datasets reducing the number of input variables can in itself reduce the error metric (curse of dimensionality); while in other cases the use of a larger number of features could offer insignificant performance improvement or slight deterioration in the computed performance. By definition, a large number of features make the resulting model computationally expensive and occluding its interpretability and is therefore undesirable. There are various tools to address this somewhat subjective need to compromise between the number of features, and model performance (we want the classifier to give as accurate results as possible). One approach is to use information criteria, and another is to use the one-standard-error rule. We refer to Hastie et al. (2009) for more details regarding both approaches. 5) Use new data or more likely use 10-fold cross-validation with at least 100 repetitions to ensure the results are robust. The list can easily be modified and is purposefully general, so that it is applicable to a wide range of medical applications. The field of exploratory data analysis and knowledge discovery cannot be possibly covered adequately here; we refer to the survey of Kurgan and Musilek for a relatively recent authoritative overview (Kurgan and Musilek, 2006). Example applying the proposed methodology in a medical problem To demonstrate the proposed methodology in a practical problem we use the Hepatitis dataset, which has been widely studied in the literature. The dataset is available for download from the UCI machine learning repository at The problem is to investigate whether a set of features can be used to predict whether the 16

17 patient lives or dies. The design matrix has elements, that is, it comprises 155 samples and 19 features. Each of the 19 features quantifies some characteristic which the researchers who collected the data believed that affects the response. The response in this problem can be one of two possible values which denote whether the patient lives or dies, i.e. we have a binary classification problem. Table 2: Correlations between features and the response variable for the Hepatitis dataset Feature name Spearman correlation Statistical significance of the Samples coefficient correlation ( ) used ALBUMIN ASCITES PROTIME SPIDERS BILIRUBIN VARICES MALAISE HISTOLOGY FATIGUE AGE SPLEEN PALPABLE ALK PHOSPHATE SEX STEROID ANOREXIA ANTIVIRALS SGOT LIVER BIG LIVER FIRM The last column in this Table indicates the number of samples available for that feature. In practice, some characteristics may not be measured which explains why we do not have 155 samples for all features. Diaconis and Efron (1983) reported 17% misclassification, whilst Breiman (2001a) reported 13% misclassification using 10-fold cross validation. To get an intuitive feel for the data, we generate density plots using histograms; in addition we generate scatter plots to visualize the relationship between each explanatory variable and the response. Next we compute the Spearman correlation coefficient between the explanatory variables and the response, which are presented in Table 2. The histograms and 17

18 the scatter plots of the five most highly correlated features appear in Fig. 2. These results give a good indication that some of the features in the dataset are well correlated with the response. The fact that the spread of the features is not bell-shaped may inspire some transformation so that the distributions become more evenly spread. For example, we could use some simple transformation of the features, e.g. the log-transformation: we compute the logarithm of all samples in a feature that we want to transform (see Tsanas et al. (2010b)). Here, we will not experiment further with any transformation of the features. Fig. 2: Histogram and scatter plots of the five most correlated features with the response for the Hepatitis dataset. The horizontal axes in the scatter plots are the normalized features to facilitate direct comparison, and the vertical axes correspond to the response variable. The gray lines are the best linear fit of the data, giving a visual impression of the behaviour of the feature. This preliminary analysis concludes the first step in the proposed methodological guide. The subsequent steps will be integrated in the following section, where in addition to the Hepatitis dataset, additional indicative medical datasets are investigated. 18

19 Comparison of logistic regression and random forests in various datasets In this section we introduce additional datasets from real-world medical problems to demonstrate how predictive modelling is influenced when using LR and RF. Moreover, we investigate the effect of using feature selection to obtain a robust parsimonious dataset, and feed this subset into the LR or RF classifier. All the datasets are available from the UCI machine learning repository 4, with the exception of one dataset which was downloaded from due to space constraints we keep description of those datasets to a minimum and refer the reader to the original studies cited in Table 3 and to the UCI machine learning repository for further details. The Parkinson s dataset was generated in Little et al. (2009). The aim is to characterize speech signals computing some distinctive characteristics (features) in the voices of people with Parkinson s disease versus healthy controls (binary classification problem). An extension of this concept is inferring Parkinson s disease symptom severity using speech signals (Tsanas et al., 2010a; Tsanas et al., 2011). Little et al. (2009) reported approximately 10% misclassification cases when using a subset of only four features from the 22 originally computed features. The Liver dataset is also a binary classification dataset, where the explanatory variables refer to blood tests and the number of drinks per day. The Lymphography dataset focuses on a four class classification problem (the four classes are: healthy control, metastases, malign lymph, fibrosis) where the explanatory variables are various characteristics that oncologists consider relevant such as lymphatics, changes in lymphoma type, node characteristics. The Breast tissue dataset has nine features from impedance measurements to predict the type of tissue such as carcinoma and adipose tissue (six classes). The SRBCT dataset contains 83 samples and 2308 gene expression values. The response variable denotes the tumour type (4 classes). We have downloaded the dataset from 4 The UCI machine learning repository hosts many datasets which are freely available at 19

20 which has split the original dataset into two subsets: a training subset with 63 samples and a testing subset with 20 samples. The SRBCT dataset is particularly challenging because the number of features (the genes in this case) is considerably larger compared to the number of samples. This is known to be a scenario where LR often fails to generalize well. Table 4 presents the misclassifications computed for each dataset (we report the results in the dataset kept for testing). All the features are used to obtain these results. We also experiment with mrmr to select features, and feed the features in the classifier (LR or RF). These results appear in Fig. 3. Collectively, these findings suggest that RF consistently and significantly ( ) outperform LR, with a mean relative improvement across datasets of about 26% (without including the SRBCT dataset where LR massively underperforms). Moreover, in settings where the number of features is larger than the number of samples the improvement with RF is even more impressive. Interestingly, in some datasets a lower number of features leads to a lower misclassification error. This is the manifestation of the curse of dimensionality, where additional features increase the signal to noise ratio in the data, and are detrimental for the performance of the classifier. We remark that RF is fairly robust and that LR is particularly sensitive to the number of features with respect to the number of samples, thus verifying previous reports (Breiman, 2001a; Hastie et al., 2009). Overall, the aim of this study was to encourage researchers to investigate data beyond simply reporting correlations and statistical significance values. We believe that following the simple steps outlined in this study provides a concise guide towards inferring key properties of the examined dataset. 20

21 Table 3: Summary of datasets Dataset Design matrix Associated task Feature type Hepatitis (Diaconis and Efron, 1988) Classification (2 classes) C (17), D (2) Parkinson s (Little et al., 2009) Classification (2 classes) C (22) Liver Classification (2 classes) D (6) Lymphography Classification (4 classes) D (18) Breast tissue (Jossinet, 1996) Classification (6 classes) C (9) SRBCT (Khan et al., 2001) Classification (4 classes) C (2308) The design matrix is in the form, where is the number of samples and is the number of features. Samples with missing values were amputated. Feature type denotes whether the features in the design matrix are continuous (C) or discrete (D). Table 4: Comparison of LR and RF when using all explanatory variables Dataset Misclassificat ion (%) with LR Misclassificat ion (%) with RF Misclassificat ion difference LR-RF Relative improvement (%) Validation scheme Hepatitis ± ± fold CV Parkinson s ± ± fold CV Liver ± ± fold CV Lymphography LOO Breast tissue LOO SRBCT Test set The misclassification (%) results are reported in the form mean ± standard deviation. LR stands for logistic regression and RF for Random Forests. The misclassification difference between LR and RF was in all cases statistically significant ( ) using the Mann-Whitney statistical hypothesis test. The relative improvement expresses in % terms the performance boosting when using RF over LR. It was defined as:, ( ) ( ) -, ( ) -. The closer the relative improvement is to 100%, the greater the improvement RF provides over LR. The validation scheme we used depended on the number of samples and the number of classes in the dataset: we used cross-validation (CV) when we had a relatively large number of samples for the number of classes in the dataset (at least 15 samples for each class), and leave-one-sample-out (LOO) when we had less than 15 samples per class. When 10-fold CV was used to validate the accuracy of the classifier, we have also used 100 repetitions for statistical confidence. 21

22 Fig. 3: Misclassification percentages as a function of the number of the features selected using mrmr are included in the classifier. These results suggest that Random Forests (RF) consistently outperform Logistic Regression (LR) across a wide variety of medical datasets. 22

23 Take-home message Merely reporting correlations and p-values is not indicative of how accurately the outcome can be estimated using the explanatory variables. Selecting a subset of the originally computed explanatory variables is always beneficial in terms of interpretation, and also often in terms of estimation accuracy. Random forests demonstrate a relative improvement over logistic regression of 36%, and is substantially better in settings where the number of explanatory variables is larger than the number of samples. Acknowledgments A. Tsanas gratefully acknowledges the financial support of Intel Corporation and the Engineering and Physical Sciences Research Council (EPSRC). M.A. Little acknowledges the financial support of the Wellcome Trust, grant number WT090651MF. P.E. McSharry acknowledges the financial support of the European Commission through the SafeWind project (ENK7-CT ). We also want to thank all the researchers who deposited their datasets in the UCI machine learning repository. References J. Aldrich, Correlations Genuine and Spurious in Pearson and Yule, Statistical Science, Vol. 10 (4), pp , 1995 C. Bishop, Pattern recognition and machine learning, Springer, 2007 L. Breiman, Statistical modelling: the two cultures, Statistical Science, Vol. 16, No. 3, pp (with comments and discussion), 2001a L. Breiman, Random forests, Machine learning, Vol. 45, pp. 5-32, 2001b J. Cohen, P. Cohen, S.G. West, L.S. Aiken, Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, Routledge Academic, 3 rd ed.,

24 P. Diaconis and B. Efron, Computer intensive methods in statics, Scientific American, Vol. 248, pp , 1983 I. Guyon, S. Gunn, M. Nikravesh, L. A. Zadeh (Eds.), Feature Extraction: Foundations and Applications, Springer, 2006 T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning: data mining, inference, and prediction. Springer, 2 nd ed., 2009 J. Jossinet, Variability of impedivity in normal and pathological breast tissue, Med. & Biol. Eng. & Comput, Vol. 34, pp , 1996 J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, P.S. Meltzer, Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine, Vol. 7, pp , 2001 L.A. Kurgan and P. Musilek, A survey of Knowledge Discovery and Data Mining process models, The Knowledge Engineering Review, Vol. 21(1), pp. 1-24, 2006 M.A. Little, P.E. McSharry, E.J. Hunter, J. Spielman, L.O. Ramig, Suitability of dysphonia measurements for telemonitoring of Parkinson s disease, IEEE Transactions on Biomedical Engineering, Vol. 56(4), pp , 2009 H. Peng, F. Long and C. Ding, Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy, IEEE Transactions on pattern analysis and machine intelligence, Vol. 27, No. 8, pp , 2005 K. Torkkola, Feature extraction by non-parametric mutual information maximization, Journal of Machine Learning Research, Vol. 3, pp , 2003 A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Accurate telemonitoring of Parkinson s disease progression by non-invasive speech tests, IEEE Transactions Biomedical Engineering, Vol. 57, pp , 2010a A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson's disease progression, IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), Dallas, Texas, US, pp , March 2010b A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson s disease symptom severity, Journal of the Royal Society Interface, Vol. 8, pp , 2011 A. Tsanas, M.A. Little, P.E. McSharry, J. Spielman, L.O. Ramig, Novel speech signal processing algorithms for high-accuracy classification of Parkinson s disease, IEEE Transactions on Biomedical Engineering, Vol. 59, pp , 2012 A. Webb, Statistical Pattern Recognition, John Wiley and Sons Ltd,

Nonlinear signal processing and statistical machine learning techniques to remotely quantify Parkinson's disease symptom severity

Nonlinear signal processing and statistical machine learning techniques to remotely quantify Parkinson's disease symptom severity Nonlinear signal processing and statistical machine learning techniques to remotely quantify Parkinson's disease symptom severity A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig 9 March 2011 Project

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

An Improved Algorithm To Predict Recurrence Of Breast Cancer

An Improved Algorithm To Predict Recurrence Of Breast Cancer An Improved Algorithm To Predict Recurrence Of Breast Cancer Umang Agrawal 1, Ass. Prof. Ishan K Rajani 2 1 M.E Computer Engineer, Silver Oak College of Engineering & Technology, Gujarat, India. 2 Assistant

More information

Comparative Accuracy of a Diagnostic Index Modeled Using (Optimized) Regression vs. Novometrics

Comparative Accuracy of a Diagnostic Index Modeled Using (Optimized) Regression vs. Novometrics Comparative Accuracy of a Diagnostic Index Modeled Using (Optimized) Regression vs. Novometrics Ariel Linden, Dr.P.H. and Paul R. Yarnold, Ph.D. Linden Consulting Group, LLC Optimal Data Analysis LLC Diagnostic

More information

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings

THE data used in this project is provided. SEIZURE forecasting systems hold promise. Seizure Prediction from Intracranial EEG Recordings 1 Seizure Prediction from Intracranial EEG Recordings Alex Fu, Spencer Gibbs, and Yuqi Liu 1 INTRODUCTION SEIZURE forecasting systems hold promise for improving the quality of life for patients with epilepsy.

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

METHODS FOR DETECTING CERVICAL CANCER

METHODS FOR DETECTING CERVICAL CANCER Chapter III METHODS FOR DETECTING CERVICAL CANCER 3.1 INTRODUCTION The successful detection of cervical cancer in a variety of tissues has been reported by many researchers and baseline figures for the

More information

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015

More information

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

Reliability of Ordination Analyses

Reliability of Ordination Analyses Reliability of Ordination Analyses Objectives: Discuss Reliability Define Consistency and Accuracy Discuss Validation Methods Opening Thoughts Inference Space: What is it? Inference space can be defined

More information

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection 202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray

More information

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008 Journal of Machine Learning Research 9 (2008) 59-64 Published 1/08 Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008 Jerome Friedman Trevor Hastie Robert

More information

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India 20th International Congress on Modelling and Simulation, Adelaide, Australia, 1 6 December 2013 www.mssanz.org.au/modsim2013 Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Machine Learning to Inform Breast Cancer Post-Recovery Surveillance Final Project Report CS 229 Autumn 2017 Category: Life Sciences Maxwell Allman (mallman) Lin Fan (linfan) Jamie Kang (kangjh) 1 Introduction

More information

A scored AUC Metric for Classifier Evaluation and Selection

A scored AUC Metric for Classifier Evaluation and Selection A scored AUC Metric for Classifier Evaluation and Selection Shaomin Wu SHAOMIN.WU@READING.AC.UK School of Construction Management and Engineering, The University of Reading, Reading RG6 6AW, UK Peter Flach

More information

J2.6 Imputation of missing data with nonlinear relationships

J2.6 Imputation of missing data with nonlinear relationships Sixth Conference on Artificial Intelligence Applications to Environmental Science 88th AMS Annual Meeting, New Orleans, LA 20-24 January 2008 J2.6 Imputation of missing with nonlinear relationships Michael

More information

Chapter 17 Sensitivity Analysis and Model Validation

Chapter 17 Sensitivity Analysis and Model Validation Chapter 17 Sensitivity Analysis and Model Validation Justin D. Salciccioli, Yves Crutain, Matthieu Komorowski and Dominic C. Marshall Learning Objectives Appreciate that all models possess inherent limitations

More information

CHAPTER ONE CORRELATION

CHAPTER ONE CORRELATION CHAPTER ONE CORRELATION 1.0 Introduction The first chapter focuses on the nature of statistical data of correlation. The aim of the series of exercises is to ensure the students are able to use SPSS to

More information

Learning with Rare Cases and Small Disjuncts

Learning with Rare Cases and Small Disjuncts Appears in Proceedings of the 12 th International Conference on Machine Learning, Morgan Kaufmann, 1995, 558-565. Learning with Rare Cases and Small Disjuncts Gary M. Weiss Rutgers University/AT&T Bell

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections

Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections Review: Logistic regression, Gaussian naïve Bayes, linear regression, and their connections New: Bias-variance decomposition, biasvariance tradeoff, overfitting, regularization, and feature selection Yi

More information

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach Identifying Parkinson s Patients: A Functional Gradient Boosting Approach Devendra Singh Dhami 1, Ameet Soni 2, David Page 3, and Sriraam Natarajan 1 1 Indiana University Bloomington 2 Swarthmore College

More information

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data 1. Purpose of data collection...................................................... 2 2. Samples and populations.......................................................

More information

Diagnosis of multiple cancer types by shrunken centroids of gene expression

Diagnosis of multiple cancer types by shrunken centroids of gene expression Diagnosis of multiple cancer types by shrunken centroids of gene expression Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu PNAS 99:10:6567-6572, 14 May 2002 Nearest Centroid

More information

Discovering Meaningful Cut-points to Predict High HbA1c Variation

Discovering Meaningful Cut-points to Predict High HbA1c Variation Proceedings of the 7th INFORMS Workshop on Data Mining and Health Informatics (DM-HI 202) H. Yang, D. Zeng, O. E. Kundakcioglu, eds. Discovering Meaningful Cut-points to Predict High HbAc Variation Si-Chi

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Literature Survey Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is

More information

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range Lae-Jeong Park and Jung-Ho Moon Department of Electrical Engineering, Kangnung National University Kangnung, Gangwon-Do,

More information

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE

MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE Abstract MACHINE LEARNING BASED APPROACHES FOR PREDICTION OF PARKINSON S DISEASE Arvind Kumar Tiwari GGS College of Modern Technology, SAS Nagar, Punjab, India The prediction of Parkinson s disease is

More information

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA

A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA A MONTE CARLO STUDY OF MODEL SELECTION PROCEDURES FOR THE ANALYSIS OF CATEGORICAL DATA Elizabeth Martin Fischer, University of North Carolina Introduction Researchers and social scientists frequently confront

More information

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes

Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Comparison of Gene Set Analysis with Various Score Transformations to Test the Significance of Sets of Genes Ivan Arreola and Dr. David Han Department of Management of Science and Statistics, University

More information

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots Correlational Research Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1 Correlational Research A quantitative methodology used to determine whether, and to what degree, a relationship

More information

Mammogram Analysis: Tumor Classification

Mammogram Analysis: Tumor Classification Mammogram Analysis: Tumor Classification Term Project Report Geethapriya Raghavan geeragh@mail.utexas.edu EE 381K - Multidimensional Digital Signal Processing Spring 2005 Abstract Breast cancer is the

More information

Modeling Sentiment with Ridge Regression

Modeling Sentiment with Ridge Regression Modeling Sentiment with Ridge Regression Luke Segars 2/20/2012 The goal of this project was to generate a linear sentiment model for classifying Amazon book reviews according to their star rank. More generally,

More information

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA BIOSTATISTICAL METHODS AND RESEARCH DESIGNS Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA Keywords: Case-control study, Cohort study, Cross-Sectional Study, Generalized

More information

Introduction to ROC analysis

Introduction to ROC analysis Introduction to ROC analysis Andriy I. Bandos Department of Biostatistics University of Pittsburgh Acknowledgements Many thanks to Sam Wieand, Nancy Obuchowski, Brenda Kurland, and Todd Alonzo for previous

More information

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Comparison of Collaborative Filtering Methods for Medication Reconciliation A Comparison of Collaborative Filtering Methods for Medication Reconciliation Huanian Zheng, Rema Padman, Daniel B. Neill The H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, 15213,

More information

Chapter 1: Explaining Behavior

Chapter 1: Explaining Behavior Chapter 1: Explaining Behavior GOAL OF SCIENCE is to generate explanations for various puzzling natural phenomenon. - Generate general laws of behavior (psychology) RESEARCH: principle method for acquiring

More information

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY

CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY CANCER DIAGNOSIS USING DATA MINING TECHNOLOGY Muhammad Shahbaz 1, Shoaib Faruq 2, Muhammad Shaheen 1, Syed Ather Masood 2 1 Department of Computer Science and Engineering, UET, Lahore, Pakistan Muhammad.Shahbaz@gmail.com,

More information

Gray level cooccurrence histograms via learning vector quantization

Gray level cooccurrence histograms via learning vector quantization Gray level cooccurrence histograms via learning vector quantization Timo Ojala, Matti Pietikäinen and Juha Kyllönen Machine Vision and Media Processing Group, Infotech Oulu and Department of Electrical

More information

Chapter 7: Descriptive Statistics

Chapter 7: Descriptive Statistics Chapter Overview Chapter 7 provides an introduction to basic strategies for describing groups statistically. Statistical concepts around normal distributions are discussed. The statistical procedures of

More information

Selection and Combination of Markers for Prediction

Selection and Combination of Markers for Prediction Selection and Combination of Markers for Prediction NACC Data and Methods Meeting September, 2010 Baojiang Chen, PhD Sarah Monsell, MS Xiao-Hua Andrew Zhou, PhD Overview 1. Research motivation 2. Describe

More information

NIH Public Access Author Manuscript Conf Proc IEEE Eng Med Biol Soc. Author manuscript; available in PMC 2013 February 01.

NIH Public Access Author Manuscript Conf Proc IEEE Eng Med Biol Soc. Author manuscript; available in PMC 2013 February 01. NIH Public Access Author Manuscript Published in final edited form as: Conf Proc IEEE Eng Med Biol Soc. 2012 August ; 2012: 2700 2703. doi:10.1109/embc.2012.6346521. Characterizing Non-Linear Dependencies

More information

alternate-form reliability The degree to which two or more versions of the same test correlate with one another. In clinical studies in which a given function is going to be tested more than once over

More information

Things you need to know about the Normal Distribution. How to use your statistical calculator to calculate The mean The SD of a set of data points.

Things you need to know about the Normal Distribution. How to use your statistical calculator to calculate The mean The SD of a set of data points. Things you need to know about the Normal Distribution How to use your statistical calculator to calculate The mean The SD of a set of data points. The formula for the Variance (SD 2 ) The formula for the

More information

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS)

TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) TITLE: A Data-Driven Approach to Patient Risk Stratification for Acute Respiratory Distress Syndrome (ARDS) AUTHORS: Tejas Prahlad INTRODUCTION Acute Respiratory Distress Syndrome (ARDS) is a condition

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

Review Statistics review 2: Samples and populations Elise Whitley* and Jonathan Ball

Review Statistics review 2: Samples and populations Elise Whitley* and Jonathan Ball Available online http://ccforum.com/content/6/2/143 Review Statistics review 2: Samples and populations Elise Whitley* and Jonathan Ball *Lecturer in Medical Statistics, University of Bristol, UK Lecturer

More information

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Author's response to reviews Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Authors: Jestinah M Mahachie John

More information

Remarks on Bayesian Control Charts

Remarks on Bayesian Control Charts Remarks on Bayesian Control Charts Amir Ahmadi-Javid * and Mohsen Ebadi Department of Industrial Engineering, Amirkabir University of Technology, Tehran, Iran * Corresponding author; email address: ahmadi_javid@aut.ac.ir

More information

Chapter 11. Experimental Design: One-Way Independent Samples Design

Chapter 11. Experimental Design: One-Way Independent Samples Design 11-1 Chapter 11. Experimental Design: One-Way Independent Samples Design Advantages and Limitations Comparing Two Groups Comparing t Test to ANOVA Independent Samples t Test Independent Samples ANOVA Comparing

More information

Agreement Coefficients and Statistical Inference

Agreement Coefficients and Statistical Inference CHAPTER Agreement Coefficients and Statistical Inference OBJECTIVE This chapter describes several approaches for evaluating the precision associated with the inter-rater reliability coefficients of the

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training.

Nature Neuroscience: doi: /nn Supplementary Figure 1. Behavioral training. Supplementary Figure 1 Behavioral training. a, Mazes used for behavioral training. Asterisks indicate reward location. Only some example mazes are shown (for example, right choice and not left choice maze

More information

UvA-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) UvA-DARE (Digital Academic Repository) A classification model for the Leiden proteomics competition Hoefsloot, H.C.J.; Berkenbos-Smit, S.; Smilde, A.K. Published in: Statistical Applications in Genetics

More information

Local Image Structures and Optic Flow Estimation

Local Image Structures and Optic Flow Estimation Local Image Structures and Optic Flow Estimation Sinan KALKAN 1, Dirk Calow 2, Florentin Wörgötter 1, Markus Lappe 2 and Norbert Krüger 3 1 Computational Neuroscience, Uni. of Stirling, Scotland; {sinan,worgott}@cn.stir.ac.uk

More information

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN

Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Statistical analysis DIANA SAPLACAN 2017 * SLIDES ADAPTED BASED ON LECTURE NOTES BY ALMA LEORA CULEN Vs. 2 Background 3 There are different types of research methods to study behaviour: Descriptive: observations,

More information

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin Key words : Bayesian approach, classical approach, confidence interval, estimation, randomization,

More information

A HMM-based Pre-training Approach for Sequential Data

A HMM-based Pre-training Approach for Sequential Data A HMM-based Pre-training Approach for Sequential Data Luca Pasa 1, Alberto Testolin 2, Alessandro Sperduti 1 1- Department of Mathematics 2- Department of Developmental Psychology and Socialisation University

More information

Feasibility Study in Digital Screening of Inflammatory Breast Cancer Patients using Selfie Image

Feasibility Study in Digital Screening of Inflammatory Breast Cancer Patients using Selfie Image Feasibility Study in Digital Screening of Inflammatory Breast Cancer Patients using Selfie Image Reshma Rajan and Chang-hee Won CSNAP Lab, Temple University Technical Memo Abstract: Inflammatory breast

More information

Section 6: Analysing Relationships Between Variables

Section 6: Analysing Relationships Between Variables 6. 1 Analysing Relationships Between Variables Section 6: Analysing Relationships Between Variables Choosing a Technique The Crosstabs Procedure The Chi Square Test The Means Procedure The Correlations

More information

Accurate telemonitoring of Parkinson s disease progression by non-invasive speech tests

Accurate telemonitoring of Parkinson s disease progression by non-invasive speech tests Non-invasive telemonitoring of Parkinson s disease, Tsanas et al. 1 Accurate telemonitoring of Parkinson s disease progression by non-invasive speech tests Athanasios Tsanas*, Max A. Little, Member, IEEE,

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods*

Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods* Personalized Colorectal Cancer Survivability Prediction with Machine Learning Methods* 1 st Samuel Li Princeton University Princeton, NJ seli@princeton.edu 2 nd Talayeh Razzaghi New Mexico State University

More information

Credal decision trees in noisy domains

Credal decision trees in noisy domains Credal decision trees in noisy domains Carlos J. Mantas and Joaquín Abellán Department of Computer Science and Artificial Intelligence University of Granada, Granada, Spain {cmantas,jabellan}@decsai.ugr.es

More information

INADEQUACIES OF SIGNIFICANCE TESTS IN

INADEQUACIES OF SIGNIFICANCE TESTS IN INADEQUACIES OF SIGNIFICANCE TESTS IN EDUCATIONAL RESEARCH M. S. Lalithamma Masoomeh Khosravi Tests of statistical significance are a common tool of quantitative research. The goal of these tests is to

More information

Bayesian and Frequentist Approaches

Bayesian and Frequentist Approaches Bayesian and Frequentist Approaches G. Jogesh Babu Penn State University http://sites.stat.psu.edu/ babu http://astrostatistics.psu.edu All models are wrong But some are useful George E. P. Box (son-in-law

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

ParkDiag: A Tool to Predict Parkinson Disease using Data Mining Techniques from Voice Data

ParkDiag: A Tool to Predict Parkinson Disease using Data Mining Techniques from Voice Data ParkDiag: A Tool to Predict Parkinson Disease using Data Mining Techniques from Voice Data Tarigoppula V.S. Sriram 1, M. Venkateswara Rao 2, G.V. Satya Narayana 3 and D.S.V.G.K. Kaladhar 4 1 CSE, Raghu

More information

Multichannel Classification of Single EEG Trials with Independent Component Analysis

Multichannel Classification of Single EEG Trials with Independent Component Analysis In J. Wang et a]. (Eds.), Advances in Neural Networks-ISNN 2006, Part 111: 54 1-547. Berlin: Springer. Multichannel Classification of Single EEG Trials with Independent Component Analysis Dik Kin Wong,

More information

Funnelling Used to describe a process of narrowing down of focus within a literature review. So, the writer begins with a broad discussion providing b

Funnelling Used to describe a process of narrowing down of focus within a literature review. So, the writer begins with a broad discussion providing b Accidental sampling A lesser-used term for convenience sampling. Action research An approach that challenges the traditional conception of the researcher as separate from the real world. It is associated

More information

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California

Computer Age Statistical Inference. Algorithms, Evidence, and Data Science. BRADLEY EFRON Stanford University, California Computer Age Statistical Inference Algorithms, Evidence, and Data Science BRADLEY EFRON Stanford University, California TREVOR HASTIE Stanford University, California ggf CAMBRIDGE UNIVERSITY PRESS Preface

More information

Early Detection of Lung Cancer

Early Detection of Lung Cancer Early Detection of Lung Cancer Aswathy N Iyer Dept Of Electronics And Communication Engineering Lymie Jose Dept Of Electronics And Communication Engineering Anumol Thomas Dept Of Electronics And Communication

More information

Agents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based Models

Agents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based Models Int J Comput Math Learning (2009) 14:51 60 DOI 10.1007/s10758-008-9142-6 COMPUTER MATH SNAPHSHOTS - COLUMN EDITOR: URI WILENSKY* Agents with Attitude: Exploring Coombs Unfolding Technique with Agent-Based

More information

South Australian Research and Development Institute. Positive lot sampling for E. coli O157

South Australian Research and Development Institute. Positive lot sampling for E. coli O157 final report Project code: Prepared by: A.MFS.0158 Andreas Kiermeier Date submitted: June 2009 South Australian Research and Development Institute PUBLISHED BY Meat & Livestock Australia Limited Locked

More information

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Bob Obenchain, Risk Benefit Statistics, August 2015 Our motivation for using a Cut-Point

More information

Introduction & Basics

Introduction & Basics CHAPTER 1 Introduction & Basics 1.1 Statistics the Field... 1 1.2 Probability Distributions... 4 1.3 Study Design Features... 9 1.4 Descriptive Statistics... 13 1.5 Inferential Statistics... 16 1.6 Summary...

More information

DPPred: An Effective Prediction Framework with Concise Discriminative Patterns

DPPred: An Effective Prediction Framework with Concise Discriminative Patterns IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID DPPred: An Effective Prediction Framework with Concise Discriminative Patterns Jingbo Shang, Meng Jiang, Wenzhu Tong, Jinfeng Xiao, Jian

More information

EECS 433 Statistical Pattern Recognition

EECS 433 Statistical Pattern Recognition EECS 433 Statistical Pattern Recognition Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 19 Outline What is Pattern

More information

The Long Tail of Recommender Systems and How to Leverage It

The Long Tail of Recommender Systems and How to Leverage It The Long Tail of Recommender Systems and How to Leverage It Yoon-Joo Park Stern School of Business, New York University ypark@stern.nyu.edu Alexander Tuzhilin Stern School of Business, New York University

More information

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012

Intelligent Edge Detector Based on Multiple Edge Maps. M. Qasim, W.L. Woon, Z. Aung. Technical Report DNA # May 2012 Intelligent Edge Detector Based on Multiple Edge Maps M. Qasim, W.L. Woon, Z. Aung Technical Report DNA #2012-10 May 2012 Data & Network Analytics Research Group (DNA) Computing and Information Science

More information

On the Combination of Collaborative and Item-based Filtering

On the Combination of Collaborative and Item-based Filtering On the Combination of Collaborative and Item-based Filtering Manolis Vozalis 1 and Konstantinos G. Margaritis 1 University of Macedonia, Dept. of Applied Informatics Parallel Distributed Processing Laboratory

More information

Overview of Non-Parametric Statistics

Overview of Non-Parametric Statistics Overview of Non-Parametric Statistics LISA Short Course Series Mark Seiss, Dept. of Statistics April 7, 2009 Presentation Outline 1. Homework 2. Review of Parametric Statistics 3. Overview Non-Parametric

More information

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA W. Sibanda 1* and P. Pretorius 2 1 DST/NWU Pre-clinical

More information

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann

Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann http://genomebiology.com/22/3/2/research/69. Research Supervised clustering of genes Marcel Dettling and Peter Bühlmann Address: Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich,

More information

Keywords Artificial Neural Networks (ANN), Echocardiogram, BPNN, RBFNN, Classification, survival Analysis.

Keywords Artificial Neural Networks (ANN), Echocardiogram, BPNN, RBFNN, Classification, survival Analysis. Design of Classifier Using Artificial Neural Network for Patients Survival Analysis J. D. Dhande 1, Dr. S.M. Gulhane 2 Assistant Professor, BDCE, Sevagram 1, Professor, J.D.I.E.T, Yavatmal 2 Abstract The

More information

Generalization and Theory-Building in Software Engineering Research

Generalization and Theory-Building in Software Engineering Research Generalization and Theory-Building in Software Engineering Research Magne Jørgensen, Dag Sjøberg Simula Research Laboratory {magne.jorgensen, dagsj}@simula.no Abstract The main purpose of this paper is

More information

Performance of Median and Least Squares Regression for Slightly Skewed Data

Performance of Median and Least Squares Regression for Slightly Skewed Data World Academy of Science, Engineering and Technology 9 Performance of Median and Least Squares Regression for Slightly Skewed Data Carolina Bancayrin - Baguio Abstract This paper presents the concept of

More information

Predicting Kidney Cancer Survival from Genomic Data

Predicting Kidney Cancer Survival from Genomic Data Predicting Kidney Cancer Survival from Genomic Data Christopher Sauer, Rishi Bedi, Duc Nguyen, Benedikt Bünz Abstract Cancers are on par with heart disease as the leading cause for mortality in the United

More information

Impute vs. Ignore: Missing Values for Prediction

Impute vs. Ignore: Missing Values for Prediction Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Impute vs. Ignore: Missing Values for Prediction Qianyu Zhang, Ashfaqur Rahman, and Claire D Este

More information

Quantitative Methods in Computing Education Research (A brief overview tips and techniques)

Quantitative Methods in Computing Education Research (A brief overview tips and techniques) Quantitative Methods in Computing Education Research (A brief overview tips and techniques) Dr Judy Sheard Senior Lecturer Co-Director, Computing Education Research Group Monash University judy.sheard@monash.edu

More information

Multiple Bivariate Gaussian Plotting and Checking

Multiple Bivariate Gaussian Plotting and Checking Multiple Bivariate Gaussian Plotting and Checking Jared L. Deutsch and Clayton V. Deutsch The geostatistical modeling of continuous variables relies heavily on the multivariate Gaussian distribution. It

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING

A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING HOON SOHN Postdoctoral Research Fellow ESA-EA, MS C96 Los Alamos National Laboratory Los Alamos, NM 87545 CHARLES

More information

Six Sigma Glossary Lean 6 Society

Six Sigma Glossary Lean 6 Society Six Sigma Glossary Lean 6 Society ABSCISSA ACCEPTANCE REGION ALPHA RISK ALTERNATIVE HYPOTHESIS ASSIGNABLE CAUSE ASSIGNABLE VARIATIONS The horizontal axis of a graph The region of values for which the null

More information

Basic Biostatistics. Chapter 1. Content

Basic Biostatistics. Chapter 1. Content Chapter 1 Basic Biostatistics Jamalludin Ab Rahman MD MPH Department of Community Medicine Kulliyyah of Medicine Content 2 Basic premises variables, level of measurements, probability distribution Descriptive

More information