A methodology for the analysis of medical data

Size: px

Start display at page:

Download "A methodology for the analysis of medical data"

Gwendolyn Bishop
5 years ago
Views:

1 Please cite this book chapter as: A. Tsanas, M.A. Little, P.E. McSharry, A methodology for the analysis of medical data, Handbook of Systems and Complexity in Health, Springer, New York, pp , 2013 A methodology for the analysis of medical data by A. Tsanas 1,2,*, M.A. Little 2,3, P.E. McSharry 1,2,4 * Asterisk denotes corresponding author. (tsanas@maths.ox.ac.uk, tsanasthanasis@gmail.com) 1 Systems Analysis, Modelling and Prediction (SAMP), Department of Engineering Science, University of Oxford, Oxford, UK 2 Oxford Centre for Industrial and Applied Mathematics (OCIAM), University of Oxford, Oxford, UK 3 Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA 4 Smith School of Enterprise and the Environment, University of Oxford, Oxford, UK Abstract This chapter aims to provide a methodology for the quantitative analysis of certain kinds of medical data. It is mainly aimed at clinical practitioners who are interested in data analysis, and intends to offer a succinct guide that may prove useful across a wide range of medical applications. To illustrate the proposed steps in this methodological guide, we use a classical medical dataset to show how these steps are applied in practice. The generic applicability of this guide is demonstrated by investigating several publicly available datasets in diverse medical applications, ranging from cancer diagnosis to micro-array data analysis. We start by introducing some commonly used terminology, and briefly discuss some standard ideas behind statistical analysis and hypothesis testing. Subsequently, we describe classification techniques to predict the outcome given the explanatory variables. These techniques offer an improved understanding that goes beyond the use of correlations and 1

2 to quantify importance (or more mathematically accurately statistical significance). The key finding is that a powerful nonlinear classifier is consistently superior to logistic regression (a classifier often used by clinicians), offering a relative improvement in performance of 36% in predicting the outcome across six different datasets. We urge clinicians to apply a similar methodology for investigating the predictive information contained within their datasets, which may otherwise remain concealed at the initial step of statistical exploration. Introduction and terminology Imagine a subject going to the clinic for a medical diagnosis, for example to assess the functionality of his cardiovascular system. The doctor requests a number of clinical tests (for example stress-test to obtain the electro-cardiogram (ECG) and Doppler ultrasound), takes into account a number of other factors (for example the demographics of the subject), and makes his final diagnosis using the current data and his prior knowledge. For his diagnosis, the doctor will usually compute some characteristics of the original raw signal. For example, when the raw signal is the ECG, clinicians may want to use the mean heart rate or the heart rate variability (these characteristics may also be readily provided by medical software) because experience has taught them these characteristics are useful in diagnosis. The discipline of statistical machine learning (informally data analysis) offers a framework which allows researchers to decipher what the computed characteristics reveal, and how these characteristics could be used to offer a decision support tool. A further aim is to investigate whether additional characteristics, which may have been previously ignored, could or should be taken into account. A guide on detecting patterns is outside the scope of this study; instead we will focus on the case where a number of characteristics have been 2

3 collected (as indicated above, these characteristics might have been extracted from the original raw signals, demographic data, values of genes, concentration of a particular component in a given area, and others). Characteristics which are qualitative can be assigned to an ordinal scale 1. For example, in the case that medical practitioners characterize cells as having (a) low concentrations, (b) moderate concentrations, and (c) large concentration, we could define a scale that would read:,, For reasons that will become clear later it is advisable to use progressively increasing values of the ordinal scale starting with healthy condition and characterizing pathological situations with higher values. Each characteristic is represented by a single scalar value. We have purposefully avoided the use of mathematical terms so far, but here we need to define some terms that are commonly used in statistical settings: explanatory variables (or features), and response variable (or simply response). The term feature is equivalent to the computed characteristic, and the term response variable can be thought to be equivalent to the diagnosis or the clinical outcome. In most medical settings, the diagnosis or clinical outcome can take a small range of possible values. For example, the final diagnosis of a clinician may simply be a yes or no to a question (e.g. whether a subject has cancer), and might also include a third clinical outcome, e.g. possibly. These two or three possible outcomes can be represented using an ordinal scale as indicated previously for the case of the characteristics, i.e. the response variable takes the following possible values:. This example can be generalized to a more general setting, where for example a number of explanatory variables are used to assign subjects to different pathologies. The possible values that the response variable can have are simply known as categories or classes. When the response variable can take any number of finite classes the 1 The term ordinal scale refers to a hierarchical ordering of values to differentiate the different possible outcomes, where the difference between successive values within the scale is not equal. 3

4 problem of predicting the response variable is known as classification 2. When the response variable can take any real value (any possible number from - to ), the problem is known as regression. Classification problems are met considerably more frequently in medical applications, and hence we focus exclusively on those cases. It is important to stress that accurate statistical inference is only possible when a relatively large number of data samples are collected. A good rule of thumb is to use at least 15 data samples from each class in the response variable. Data in most medical applications can be represented in the form: [ ] [ ] is known as the design matrix (or data matrix), where each row includes the explanatory variables for each subject. That is, each row contains the concatenated vector of the explanatory variables which characterize the subject s condition. For example could be the age of the first subject, the gender, the mean heart rate, the heart rate variability and so on. Effectively, simply summarizes the explanatory variables for the number of observations (samples) (each row usually refers to a different subject). Each column in contains the values of one explanatory variable across all samples, and is indicated as ( ),. The response variable is believed to be associated with based on prior knowledge in the given problem. It is populated with the outcomes for each sample, for example for the first sample we could have to denote healthy state, to denote pathological state for the second subject and so on. Once the data is summarized in a format like the one presented above with and, the aim is to decipher the concealed information. Questions such as the following are frequently met in medical contexts (the list is only indicative): 1. How can we associate and? That is, what is the relationship between the explanatory variables and the response variable? 2 When the response variable can only be one of two classes, the problem is referred to as binary classification; when there are more than two classes, the problem is known as multi-class classification. Binary classification problems are met very frequently in medical applications, for example to differentiate whether the patients live or die. 4

5 2. Is there a convenient way to estimate the response variable when presented with the explanatory variables of a subject? 3. Which of the explanatory variables are useful in actually determining the response variable? 4. What is the relationship between the explanatory variables? Is it possible that some of the explanatory variables are redundant and need not be computed? We will demonstrate that when analysis is confined only to reporting statistical significance values ( ) these clinically important questions cannot be adequately answered. Data exploration and statistical analysis Usually, the first step in data analysis is to explore the statistical properties of the data, and to produce some plots to get an intuitive feel. Initially the probability densities of the explanatory variables can be plotted, and the simplest approach is to use histograms 3. Histograms provide a nice overview of the distribution of values for each explanatory variable, and for the response variable. They use a number of bins (for example 10) which span the range of possible values of the investigated variable, and count the number of data samples that fall into the range of each bin, thus providing a general impresssion of the spread of the values for this variable. In addition to density plots, we suggest using scatter plots: a scatter plot has on the x-axis one explanatory variable, and on the y-axis the response variable. Scatter plots are useful to visualize whether there is any obvious relationship between the investigated explanatory variable and the response variable. Scatter plots can be used for each of the explanatory variables to present very simply the { } points in a figure. 3 In general, histograms are considered a simple but rather crude approach. Kernel density estimation is typically preferable (see Hastie et al. (2009) for more details), and can be thought of as a smoothened version of histograms. 5

6 Visual inspection of density plots and scatter plots is usually followed by formal statistical tests in order to determine qualitatively and quantitatively how well the explanatory variables are related to the response variable. Correlation analysis offers a good indication of the association between each explanatory variable and the response variable, and between explanatory variables (pairwise correlations). However, we emphasize that correlation does not necessarily suggest causation (change in the values of the explanatory variable affecting the response variable) in general (Aldrich, 1995). Correlation coefficients are regarded as a valuable hint indicating a potential relationship between the explanatory variable and the response. We endorse the use of the Spearman rank correlation coefficient, which can account for general monotonic relationships and which in general is preferable compared to the linear (Pearson) correlation coefficient (which is more appropriate in linear settings). Strictly speaking, formal statistical hypothesis tests (see the following paragraph) should be used to check whether the data follow normality (one example of normality is data that have histograms resembling a bell shaped curve). In practice, medical data will typically deviate from normality, and hence the Spearman correlation coefficient should generally be used. Both the Spearman rank correlation and the linear correlation coefficient lie in the numeric range, - and are interpreted using (a) the sign of the correlation coefficient which denotes the direction of the relationship, and (b) the magnitude (absolute value) of the correlation coefficient. Negative sign indicates that the direction of the relationship between the variables is opposite: the increase in the values of one variable leads to the decrease in the values of the other. The larger the magnitude of the correlation coefficient, the stronger the statistical relationship between the variables is. There is no general rule to determine when a relationship is statistically strong; it depends on the specifics of the application (Cohen et al, 2002). In medical contexts, statistical relationships are fairly weak and typically the magnitude of the correlation coefficient is lower than 0.3 (once again, we stress that this 6

7 value can only be used as guidance and referring to relationships between variables as statistical strong when the magnitude of the correlation coefficient is above a certain threshold is considered arbitrary). To differentiate the relationships between feature and response, and between features, we introduce some additional terminology. The correlation coefficient between the feature and the response variable is denoted by ( ) and is known as relevance; similarly, the correlation coefficient between the feature and another feature is denoted by ( ) and is known as redundancy. Statistical hypothesis tests are commonly used in data analysis applications to determine whether the observed result follows a certain hypothesis, which in statistical terminology is known as the null hypothesis. Often, the null hypothesis is the opposite of what we aim to demonstrate; therefore in practice the objective is often met when we can reject the null hypothesis in favour of the alternative hypothesis. Statistical hypothesis tests compute significance values, the well-established -values, which can be interpreted as the probability of obtaining a similar result by chance if the null hypothesis is true. The null hypothesis is rejected when the -value is lower than a pre-specified significance level, typically 0.05 or 0.01, and the result is deemed to be statistically significant at the significance level chosen. Thus, for example, denotes a statistically significant result at the 5% significance level (i.e. there is less than 5% probability that the observed relationship is due to chance). Reducing the number of explanatory variables A common problem in data analysis applications arises when using a large number of explanatory variables, and is known as the curse of dimensionality: potentially, using fewer explanatory variables could lead to a simpler model which may allow more accurate estimation of the response variable (Hastie et al., 2009). This initially puzzling assertion (one 7

8 could imagine that collecting as much information as possible in the form of explanatory variables can only be positive to infer properties from the data), occurs because in practice we do not have an infinite number of samples. The problem is exacerbated in the cases where the number of explanatory variables is larger than the number of samples, (e.g. in microarray settings where the number of genes is typically in the order of thousands and the number of samples in the order of samples). Moreover, in practice, some explanatory variables contribute little information to predicting the response variable. In other scenarios, some explanatory variables can be considered redundant in the presence of other explanatory variables (i.e. they contribute little additional information towards predicting the response variable, when some other explanatory variables are already used). There are two fundamentally different approaches to reduce the number of features: feature transformation and feature selection. Feature transformation aims to transform the original features into new features, which may be more appropriate for quantifying the information in the dataset towards predicting the response variable. However, feature transformation is problematic in settings with a very large number of features (Torkkola, 2003), and is not easily interpretable because the physical meaning of the original features cannot be retrieved. On the contrary, feature selection is particularly desirable in many disciplines because the originally computed features typically quantify some characteristic which is interpretable to experts in that domain. There is a substantial body of literature addressing the topic of feature selection from many angles, and we refer to Guyon (2006) for a more extensive discussion. For our purposes, we suggest using a simple technique which aims to select a subset of the original (large) pool of explanatory variables. Informally, we aim to select a small number of columns from the design matrix, and delete the remaining columns. The new design matrix will have the same number of samples, but a lower number of explanatory variables:, where m remains to be decided. It is reasonable then to select those explanatory 8

9 variables which are highly correlated with the response variable. However, one problem with this approach is that potentially some explanatory variables will be highly correlated between them, which means that they will be redundant (as stated above, they would have little additional contribution towards predicting the response variable). Therefore, we need to find a compromise to account both for (a) including the most relevant explanatory variables towards predicting the response variable, and (b) excluding the most redundant explanatory variables. Although there are many feature selection techniques in the literature, we endorse using a conceptually simple and intuitively appealing idea proposed by Peng et al. (2005), which we modify slightly here for simplicity. Specifically, mrmr relies on an intuitive heuristic criterion compromising between feature relevance and feature redundancy and can be expressed with the following equation: [ ( ) ( ) ] (1) where denotes the j th feature amongst the initial features, is a feature that has been already selected in the feature index subset ( is an integer, contains the indices of all the features in the initial feature space, that is 1, contains the indices of selected features and denotes the indices of the features not in the selected subset). Peng et al. (2005) have used a more complicated criterion to quantify relevance and redundancy instead of the correlation coefficient used here, but the conceptual idea remains the same. The steps used to incrementally select features are described in Table 1. We remark that when there is a single dataset used to train and test the classifier, the features should be selected using a cross-validation setting (see the following section). That is, a subset of the original data samples should be selected and the feature selection process should be run on this subset. It is advisable to repeat this feature selection process a number of times, where each time a different sample subset is drawn from the original dataset (we 9

10 suggest using 100 iterations where on each iteration we randomly select 90% of the data and use this subset for selecting the features). Theoretically, in all the subsets the selected features should be identical: this would be the true ordering. In practice, however, it is likely that a different feature ordering will result for the sample subsets drawn from the original data matrix. Then, we can either select the feature subset that occurs most often, or select individually the features which appear consecutively most often. One robust mechanism to select features in this case is described in Tsanas et al. (2012). Table 1: Incremental feature selection steps in mrmr 1. (Selecting the first feature index) include the feature index j:.i(f j y)/ in the j Q initially empty set S, that is *j+ S 2. (Selecting the next m features, one at each step, by repeating the following) apply the criterion in Eq. (1) to incrementally select the next feature index j, and include it in the set: S *j+ S 3. obtain the feature subset by selecting the features {f j } j=, j ε S from the original data matrix X. m Mapping explanatory variables to the response variable As we have mentioned in the beginning of this chapter, in a wide range of problems we are interested in determining a functional mapping of the explanatory variables to the response variable, that is find a function which uses to predict : ( ). This can be achieved in two ways: a) we can impose a structure in the functional form of, and aim to determine the parameters of that functional form (parametric setting), or b) allow the data itself determine the structure and the parameters of that structure (non-parametric setting). Both approaches have the right to exist, and there is considerable interest amongst 10

11 statisticians regarding the merits of either approach (Breiman, 2001a). One example of the parametric setting has the form ( ), where the vector with the parameters ( ) needs to be estimated. We remind the reader that represents the number of explanatory variables in the design matrix we use. In case we have selected a smaller subset of the explanatory variables (e.g. using the feature selection technique mrmr described in the previous section), then the parametric model form would use parameters. Parametric settings are generally simpler than non-parametric settings, and may be more easily interpretable. If the functional form (model structure) is known a priori, then parametric settings can be very useful. However, imposing an inaccurate model structure may lead to false interpretation of the properties of the data. Hence, in practice using a nonparametric functional form may often be more appropriate. We will now briefly introduce some widely used classifiers which our readers may already be familiar with, and one more complicated classifier, which often works very well in practice. For specific algorithmic details we refer to Bishop (2007), and Hastie et al. (2009). The first classifier is known by the name Logistic Regression (LR), which may be considered a misnomer since, by definition, it works on classification problems (Bishop, 2007). This classifier is frequently used by clinicians when constructing the functional form ( ) to identify the effects of features on the response (Breiman, 2001a). Originally LR was proposed for binary classification settings, but it has been generalized for multi-class classification problems as well. Conceptually, this classifier aims to provide a model which relies linearly on the explanatory variables. LR models have found extensive use in medical applications where the aim is to understand how the explanatory variables affect the response variable (i.e. it can be considered as a conceptual extension of using the correlation coefficients we have previously mentioned) by looking at the coefficients associated with each feature. Nevertheless, practice has shown that LR models can lead to faulty conclusions 11

12 in the presence of correlated features (Bishop, 2007), and hence we suggest extreme caution in interpreting the values of the LR coefficients. LR models require a relatively large number of samples compared to the number of explanatory variables in order to have confidence in the computed results (Bishop, 2007; Hastie et al., 2009). Random Forests (RF) is a powerful non-parametric classifier, which can provide a model where the explanatory variables combine nonlinearly to estimate the response variable (Breiman, 2001b). It is constructed by combining many base experts, the trees (by default 500 trees), and then uses majority voting from the trees to decide on the final output. We will not go into the mathematical details for the construction of the trees, since they are readily available elsewhere (Breiman, 2001b; Hastie et al., 2009); instead we will provide an intuitive overview of this powerful machine learning approach. Interestingly, the way trees are built is not dissimilar to the mindset of clinicians, where there are successive binary splits of the data before reaching a conclusion on how to classify a new sample. Effectively, trees partition the data based on a single feature at each decision point (node), to split the population. This approach could be compared to the method that clinicians use in deciding the course of optimal treatment for a patient. For example, their first criterion could be age, where optimal treatment is different for people who are over 50 years old. Then, for those patients who are less than 50 gender may be crucial and different therapy is going to be applied. Similarly, for those who are over 50, possibly gender is not important, but there is some other parameter that the clinicians would consider before deciding on the treatment. This scenario can be schematically presented in Fig. 1, and this is how trees actually work (note that a particular feature can be used more than once, and it is possible that some features will not be used at all). Similarly to a clinician, the final decision of the tree is reached when we follow the nodes in the tree and we are confident there are no more useful 12

splits. In practice, the trees are grown until we have reached a pre-specified minimum number of samples assigned to each node. Leaf node y=v 1 More nodes More nodes More nodes Fig.

The point where a decision is made to assign the output to a specific value is known as leaf. Here, we have used variables which may be familiar to clinicians over which we split the data.

The binary splits used in the tree generation process lead to nonlinear combinations between the features, and hence RF often outperform linear classifiers such as LR (we will see specific

13 splits. In practice, the trees are grown until we have reached a pre-specified minimum number of samples assigned to each node. Leaf node y=v 1 More nodes More nodes More nodes Fig. 1: Example of a tree. In each node, the decision is to go to the left side of the tree if the statement is true, and go to the right side of the tree if the statement is false. The point where a decision is made to assign the output to a specific value is known as leaf. Here, we have used variables which may be familiar to clinicians over which we split the data. Many dissimilar trees constitute the random forest. The algorithmic trick to grow diverse trees is to limit the number of features that can be used in each node by each tree. The binary splits used in the tree generation process lead to nonlinear combinations between the features, and hence RF often outperform linear classifiers such as LR (we will see specific examples in the following sections). Model validation and generalization Once the functional form has been determined, we need to establish how accurate the mapping ( ) could be if a new dataset with similar properties to the dataset used to 13

14 obtain is collected. This is known as the generalization performance of the model which is typically estimated using a) cross validation, b) bootstrapping, or c) an additional dataset, which has not been used to train the model (i.e. in the determination of ). We endorse the use of cross validation (CV), a well-known statistical re-sampling technique (Webb, 2002), because it is usually the simplest approach. Specifically, in CV the original dataset is split into a training subset, which is used to determine, and a testing subset, which is used to assess the classifier s generalization performance. The ratio of the training subset over testing subset (number of samples in each subset) is determined by the researcher and is known as -fold cross validation, with typical ratios being 5-fold (5:1) and 10-fold (10:1) (Hastie et al., 2009). The model parameters are determined using the training subset, and errors are computed using the testing subset (outof-sample error or testing error). The process should be repeated a large number of times (e.g ), where the dataset is randomly permuted in each run prior to splitting in training and testing subsets, in order to obtain statistical confidence in this assessment. Depending on the requirements of the problem, different loss functions can be introduced. In all cases, on each repetition we record an error which has the form (* + = ), where represents the number of samples in the training or testing subset, is the true class and is the estimated class of the th sample. The choice of loss function is critical and depends on the demands of the application. The simplest loss function is misclassification, i.e. identifying the number of samples incorrectly assigned to a different class compared to the true class. In multi-class classification settings where the response variable classes are ordinal or continuous it may be useful to have a more convoluted loss function. One commonly used loss function which is applicable in settings where the response variable is continuous is the mean absolute error (MAE): 14

15 (2) where contains the indices of the training or testing set. Errors from all repetitions are averaged, and the generalization performance of the classifier is determined using the out-ofsample error. Note that in binary classification settings, MAE is equivalent to misclassification (that is, counting the number of samples assigned by the classifier to a different class than the true class). For convenience, we have expressed all results in percentage scores, i.e. ( ) (3) Summary of the proposed methodology This chapter provided a succinct data analysis guide, pruning away all mathematically complicated details and distilling only the essential knowledge for the successful practical application of the algorithmic tools. We summarize the proposed methodology in the following steps: 1) Apply various statistical tests to the data. These could include testing for the assumption of Gaussianity, determining p-values and correlation coefficients, and understanding the underlying structure of the quantities involved in the analysis. Produce density plots and scatter plots to visualize the data. These plots may inspire the use of transforming some of the features, for example by using the logtransformation if some features are not well spread out. 2) Apply standard classification algorithms (e.g. logistic regression) and also use more complicated, nonlinear methods such as random forests. All the explanatory variables are used to predict the response variable(s). 15

16 3) Select features using mrmr to derive parsimonious subsets. Use each of these subsets as inputs in step 2 and record the predictions and errors. 4) Potentially, in some datasets reducing the number of input variables can in itself reduce the error metric (curse of dimensionality); while in other cases the use of a larger number of features could offer insignificant performance improvement or slight deterioration in the computed performance. By definition, a large number of features make the resulting model computationally expensive and occluding its interpretability and is therefore undesirable. There are various tools to address this somewhat subjective need to compromise between the number of features, and model performance (we want the classifier to give as accurate results as possible). One approach is to use information criteria, and another is to use the one-standard-error rule. We refer to Hastie et al. (2009) for more details regarding both approaches. 5) Use new data or more likely use 10-fold cross-validation with at least 100 repetitions to ensure the results are robust. The list can easily be modified and is purposefully general, so that it is applicable to a wide range of medical applications. The field of exploratory data analysis and knowledge discovery cannot be possibly covered adequately here; we refer to the survey of Kurgan and Musilek for a relatively recent authoritative overview (Kurgan and Musilek, 2006). Example applying the proposed methodology in a medical problem To demonstrate the proposed methodology in a practical problem we use the Hepatitis dataset, which has been widely studied in the literature. The dataset is available for download from the UCI machine learning repository at The problem is to investigate whether a set of features can be used to predict whether the 16

17 patient lives or dies. The design matrix has elements, that is, it comprises 155 samples and 19 features. Each of the 19 features quantifies some characteristic which the researchers who collected the data believed that affects the response. The response in this problem can be one of two possible values which denote whether the patient lives or dies, i.e. we have a binary classification problem. Table 2: Correlations between features and the response variable for the Hepatitis dataset Feature name Spearman correlation Statistical significance of the Samples coefficient correlation ( ) used ALBUMIN ASCITES PROTIME SPIDERS BILIRUBIN VARICES MALAISE HISTOLOGY FATIGUE AGE SPLEEN PALPABLE ALK PHOSPHATE SEX STEROID ANOREXIA ANTIVIRALS SGOT LIVER BIG LIVER FIRM The last column in this Table indicates the number of samples available for that feature. In practice, some characteristics may not be measured which explains why we do not have 155 samples for all features. Diaconis and Efron (1983) reported 17% misclassification, whilst Breiman (2001a) reported 13% misclassification using 10-fold cross validation. To get an intuitive feel for the data, we generate density plots using histograms; in addition we generate scatter plots to visualize the relationship between each explanatory variable and the response. Next we compute the Spearman correlation coefficient between the explanatory variables and the response, which are presented in Table 2. The histograms and 17

18 the scatter plots of the five most highly correlated features appear in Fig. 2. These results give a good indication that some of the features in the dataset are well correlated with the response. The fact that the spread of the features is not bell-shaped may inspire some transformation so that the distributions become more evenly spread. For example, we could use some simple transformation of the features, e.g. the log-transformation: we compute the logarithm of all samples in a feature that we want to transform (see Tsanas et al. (2010b)). Here, we will not experiment further with any transformation of the features. Fig. 2: Histogram and scatter plots of the five most correlated features with the response for the Hepatitis dataset. The horizontal axes in the scatter plots are the normalized features to facilitate direct comparison, and the vertical axes correspond to the response variable. The gray lines are the best linear fit of the data, giving a visual impression of the behaviour of the feature. This preliminary analysis concludes the first step in the proposed methodological guide. The subsequent steps will be integrated in the following section, where in addition to the Hepatitis dataset, additional indicative medical datasets are investigated. 18

19 Comparison of logistic regression and random forests in various datasets In this section we introduce additional datasets from real-world medical problems to demonstrate how predictive modelling is influenced when using LR and RF. Moreover, we investigate the effect of using feature selection to obtain a robust parsimonious dataset, and feed this subset into the LR or RF classifier. All the datasets are available from the UCI machine learning repository 4, with the exception of one dataset which was downloaded from due to space constraints we keep description of those datasets to a minimum and refer the reader to the original studies cited in Table 3 and to the UCI machine learning repository for further details. The Parkinson s dataset was generated in Little et al. (2009). The aim is to characterize speech signals computing some distinctive characteristics (features) in the voices of people with Parkinson s disease versus healthy controls (binary classification problem). An extension of this concept is inferring Parkinson s disease symptom severity using speech signals (Tsanas et al., 2010a; Tsanas et al., 2011). Little et al. (2009) reported approximately 10% misclassification cases when using a subset of only four features from the 22 originally computed features. The Liver dataset is also a binary classification dataset, where the explanatory variables refer to blood tests and the number of drinks per day. The Lymphography dataset focuses on a four class classification problem (the four classes are: healthy control, metastases, malign lymph, fibrosis) where the explanatory variables are various characteristics that oncologists consider relevant such as lymphatics, changes in lymphoma type, node characteristics. The Breast tissue dataset has nine features from impedance measurements to predict the type of tissue such as carcinoma and adipose tissue (six classes). The SRBCT dataset contains 83 samples and 2308 gene expression values. The response variable denotes the tumour type (4 classes). We have downloaded the dataset from 4 The UCI machine learning repository hosts many datasets which are freely available at 19

20 which has split the original dataset into two subsets: a training subset with 63 samples and a testing subset with 20 samples. The SRBCT dataset is particularly challenging because the number of features (the genes in this case) is considerably larger compared to the number of samples. This is known to be a scenario where LR often fails to generalize well. Table 4 presents the misclassifications computed for each dataset (we report the results in the dataset kept for testing). All the features are used to obtain these results. We also experiment with mrmr to select features, and feed the features in the classifier (LR or RF). These results appear in Fig. 3. Collectively, these findings suggest that RF consistently and significantly ( ) outperform LR, with a mean relative improvement across datasets of about 26% (without including the SRBCT dataset where LR massively underperforms). Moreover, in settings where the number of features is larger than the number of samples the improvement with RF is even more impressive. Interestingly, in some datasets a lower number of features leads to a lower misclassification error. This is the manifestation of the curse of dimensionality, where additional features increase the signal to noise ratio in the data, and are detrimental for the performance of the classifier. We remark that RF is fairly robust and that LR is particularly sensitive to the number of features with respect to the number of samples, thus verifying previous reports (Breiman, 2001a; Hastie et al., 2009). Overall, the aim of this study was to encourage researchers to investigate data beyond simply reporting correlations and statistical significance values. We believe that following the simple steps outlined in this study provides a concise guide towards inferring key properties of the examined dataset. 20

21 Table 3: Summary of datasets Dataset Design matrix Associated task Feature type Hepatitis (Diaconis and Efron, 1988) Classification (2 classes) C (17), D (2) Parkinson s (Little et al., 2009) Classification (2 classes) C (22) Liver Classification (2 classes) D (6) Lymphography Classification (4 classes) D (18) Breast tissue (Jossinet, 1996) Classification (6 classes) C (9) SRBCT (Khan et al., 2001) Classification (4 classes) C (2308) The design matrix is in the form, where is the number of samples and is the number of features. Samples with missing values were amputated. Feature type denotes whether the features in the design matrix are continuous (C) or discrete (D). Table 4: Comparison of LR and RF when using all explanatory variables Dataset Misclassificat ion (%) with LR Misclassificat ion (%) with RF Misclassificat ion difference LR-RF Relative improvement (%) Validation scheme Hepatitis ± ± fold CV Parkinson s ± ± fold CV Liver ± ± fold CV Lymphography LOO Breast tissue LOO SRBCT Test set The misclassification (%) results are reported in the form mean ± standard deviation. LR stands for logistic regression and RF for Random Forests. The misclassification difference between LR and RF was in all cases statistically significant ( ) using the Mann-Whitney statistical hypothesis test. The relative improvement expresses in % terms the performance boosting when using RF over LR. It was defined as:, ( ) ( ) -, ( ) -. The closer the relative improvement is to 100%, the greater the improvement RF provides over LR. The validation scheme we used depended on the number of samples and the number of classes in the dataset: we used cross-validation (CV) when we had a relatively large number of samples for the number of classes in the dataset (at least 15 samples for each class), and leave-one-sample-out (LOO) when we had less than 15 samples per class. When 10-fold CV was used to validate the accuracy of the classifier, we have also used 100 repetitions for statistical confidence. 21

22 Fig. 3: Misclassification percentages as a function of the number of the features selected using mrmr are included in the classifier. These results suggest that Random Forests (RF) consistently outperform Logistic Regression (LR) across a wide variety of medical datasets. 22

23 Take-home message Merely reporting correlations and p-values is not indicative of how accurately the outcome can be estimated using the explanatory variables. Selecting a subset of the originally computed explanatory variables is always beneficial in terms of interpretation, and also often in terms of estimation accuracy. Random forests demonstrate a relative improvement over logistic regression of 36%, and is substantially better in settings where the number of explanatory variables is larger than the number of samples. Acknowledgments A. Tsanas gratefully acknowledges the financial support of Intel Corporation and the Engineering and Physical Sciences Research Council (EPSRC). M.A. Little acknowledges the financial support of the Wellcome Trust, grant number WT090651MF. P.E. McSharry acknowledges the financial support of the European Commission through the SafeWind project (ENK7-CT ). We also want to thank all the researchers who deposited their datasets in the UCI machine learning repository. References J. Aldrich, Correlations Genuine and Spurious in Pearson and Yule, Statistical Science, Vol. 10 (4), pp , 1995 C. Bishop, Pattern recognition and machine learning, Springer, 2007 L. Breiman, Statistical modelling: the two cultures, Statistical Science, Vol. 16, No. 3, pp (with comments and discussion), 2001a L. Breiman, Random forests, Machine learning, Vol. 45, pp. 5-32, 2001b J. Cohen, P. Cohen, S.G. West, L.S. Aiken, Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, Routledge Academic, 3 rd ed.,

24 P. Diaconis and B. Efron, Computer intensive methods in statics, Scientific American, Vol. 248, pp , 1983 I. Guyon, S. Gunn, M. Nikravesh, L. A. Zadeh (Eds.), Feature Extraction: Foundations and Applications, Springer, 2006 T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning: data mining, inference, and prediction. Springer, 2 nd ed., 2009 J. Jossinet, Variability of impedivity in normal and pathological breast tissue, Med. & Biol. Eng. & Comput, Vol. 34, pp , 1996 J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, P.S. Meltzer, Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine, Vol. 7, pp , 2001 L.A. Kurgan and P. Musilek, A survey of Knowledge Discovery and Data Mining process models, The Knowledge Engineering Review, Vol. 21(1), pp. 1-24, 2006 M.A. Little, P.E. McSharry, E.J. Hunter, J. Spielman, L.O. Ramig, Suitability of dysphonia measurements for telemonitoring of Parkinson s disease, IEEE Transactions on Biomedical Engineering, Vol. 56(4), pp , 2009 H. Peng, F. Long and C. Ding, Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy, IEEE Transactions on pattern analysis and machine intelligence, Vol. 27, No. 8, pp , 2005 K. Torkkola, Feature extraction by non-parametric mutual information maximization, Journal of Machine Learning Research, Vol. 3, pp , 2003 A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Accurate telemonitoring of Parkinson s disease progression by non-invasive speech tests, IEEE Transactions Biomedical Engineering, Vol. 57, pp , 2010a A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson's disease progression, IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), Dallas, Texas, US, pp , March 2010b A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig, Nonlinear speech analysis algorithms mapped to a standard metric achieve clinically useful quantification of average Parkinson s disease symptom severity, Journal of the Royal Society Interface, Vol. 8, pp , 2011 A. Tsanas, M.A. Little, P.E. McSharry, J. Spielman, L.O. Ramig, Novel speech signal processing algorithms for high-accuracy classification of Parkinson s disease, IEEE Transactions on Biomedical Engineering, Vol. 59, pp , 2012 A. Webb, Statistical Pattern Recognition, John Wiley and Sons Ltd,

Nonlinear signal processing and statistical machine learning techniques to remotely quantify Parkinson's disease symptom severity

Nonlinear signal processing and statistical machine learning techniques to remotely quantify Parkinson's disease symptom severity A. Tsanas, M.A. Little, P.E. McSharry, L.O. Ramig 9 March 2011 Project