Chapter 1: Introduction

Size: px
Start display at page:

Download "Chapter 1: Introduction"

Transcription

1 Chapter 1: Introduction Index 1.1. Background 1.2. Problem statement 1.3. Aim and objectives 1.1. Background HIV/AIDS is a leading health problem in the sub-saharan African region. The need to formulate well-thought out and effective measures to understand the dynamics of the HIV/AIDS cannot be emphasized enough. Seroprevalence data is HIV data collected based on a blood survey conducted on expecting mothers visiting antenatal clinics throughout the Republic of South Africa. It is general knowledge that data collected from antenatal seroprevalence surveys tends to overestimate HIV prevalence due to the fact that information is observed from one sector of the population, the pregnant women. It is also known that women infected with HIV have lower pregnancy rates than uninfected women. Notwithstanding the shortcomings of the antenatal seroprevalence surveys, this tool still ranks highly as a reliable approach to estimate HIV prevalence amongst the entire adult population in a country. In South Africa, the prevalence of HIV has been used for many years to gauge the spread of the HIV pandemic. The introduction of life-saving anti-retroviral drugs (ARVs) has increased the difficulty of interpreting the prevalence data due to changes in survival period from infection to death. In that regard, the incidence of HIV infection (i.e. the rate at which new infections are acquired over a defined period of time) is a much more sensitive measure of the current state of the epidemic and of the impact of programs. Mathematical and statistical models are imperative and essential in enhancing our unde r- standing of the changes in the behavior of the HIV epidemic. On that basis, the aim of any mathematical and statistical modeling methodology is to extract enough knowledge from a given database. A number of different models of HIV and AIDS have been developed, ranging from simple extrapolations of past curves to complex transmission models (UNAIDS, 2010). 1

2 1.2. Problem statement The antenatal HIV seroprevalence HIV data is comprised of the following demographic characteristics for each pregnant woman; age, partner s age, population group, level of education, gravidity, parity, marital status, province, region, HIV and syphilis. It is therefore clear that the seroprevalence database presents a wealth of information. As shown in the above modeling techniques and research surveys, very little work has been done to fully understand this vast amount of data. This research will attempt to answer questions like; what does the antenatal HIV seroprevalence database tell us and how can this database be used to improve the intervention conducted by the government to curb the spread of the HIV pandemic. This will therefore entail using relevant statistical techniques to fully understand the database (Sibanda & Pretorius 2011). Central to this research will be the objective of understanding in detail the differential effects of demographic characteristics of pregnant women on their risk of acquiring HIV infection using unorthodox methodologies like design of experiments, artificial neural networks and binary logistic regression. Design of experiments is traditionally a structured intensive methodology used for finding solutions to problems of an engineering nature. The technique enables the formulation of sound engineering solutions. Neural networks consist of artificial neurons that process information. In most cases, a neural network is an adaptive system that changes its structure during a learning phase. In that regard, neural networks are used to model complex relationships between inputs and outputs to find a pattern in data. Neural networks have been applied to a wide range of applications such as character recognition, image compression and stock market prediction. This current research will therefore attempt to use neural networks in studying the antenatal HIV seroprevalence data. Logistic regression is a statistical methodology for inferring the outcomes of a categorical dependent variable, based on one or more predictor variables. In this regard, the probabilities describing the possible outcome of a single event are modeled, as a function of explanatory variables, using a logistic function. Statistically, the categoric outcomes may be binary or ordinal, and the predictor variables may be continuous or categorical. In this research, this will involve modeling the presence or absence of HIV infection using demographic characteristics as predictor variables. 2

3 1.3. Aims and objectives With the enormous amount of data presented by annual South African antenatal HIV seroprevalence, it is important to develop powerful techniques to study and understand the data in order to generate valuable knowledge for sound decision making. Descriptive and predictive data mining techniques that involved detailed data characterization, classification, and outlier analysis will be used. Central to this study will be the characterization of differential effects of demographic characteristics on the risk of acquiring HIV amongst pregnant women, using unorthodox statistical methods like design of experiments and artificial neural networks. To validate the modeling results a binary logistic regression methodology will be used. Most epidemiologists prefer the binary logistic model to study epidemiological data especially where binary categorical outcomes are involved. In addition, this research further investigated the usefulness of decision trees in understanding the effects of demographic characteristics on the risk of acquiring an HIV infection. The Tree node in SAS Enterprise Miner TM is part of the SAS SEMMA (Sample, Explore, Modify, Assess) data mining tools. The Tree therefore represents a segmentation of data created by applying a number of simple rules. Each rule is applied after another, resulting in a hierarchy of segments within segments, giving rise to a hierarchy resembling a tree. In addition to nominal, binary and ordinal targets, the tree can be successfully used to predict outcomes for interval targets. It has been widely reported in science literature that the advantage of decision trees over other modeling methodologies such as neural networks, is that the technique produces a model that can easily be explained. The decision tree has an added advantage of being able to treat missing data as inputs. The receiver operating curve (ROC) curves developed using SAS Enterprise Miner TM (SAS Institute Inc. 2012) were used to compare the classification accuracy of the different modeling methodologies. In general ROC curves are drawn by varying the cut-off point that determines which event probabilities are considered to predict the event Specific Objectives of this project Objective One This study will attempt to utilize a screening design of experiment (DOE) technique to develop a ranked list of important through unimportant demographic factors that affect the spread of HIV in the South African population (Sibanda & Pretorius 2011a). 3

4 Objective Two This research step will explore the application of response surface methodology (RSM) to study the intricate relationships between antenatal data demographic characteristics and the risk of acquiring an HIV infection. RSM techniques allow for the estimation of interaction and quadratic effects (Sibanda & Pretorius 2012a) Objective Three The third objective will compare results from two response surface methodologies in determining the effect of demographic characteristics on HIV status of antenatal clinic attendees. The two response surface methodologies to be studied will be the central composite face-centered and Box-Behnken designs. The purpose of this study will be to show that the results obtained in research objective two are not design-specific and thus can be reproduced using a different response surface model (Sibanda & Pretorius 2013) Objective Four The fourth objective of this research will attempt to validate the response surface methodology results through the use of a binary logistic regression model. This aspect of our research was brought about by recommendations from epidemiologists that the S-shape of the logistic function is most favored for the study of the HIV risk amongst antenatal clinic attendees in South Africa. Furthermore, binary logistic regression models are models of choice in the study of binary categoric data. This step is important as the design of experiment methodologies are not usually used for epidemiological modeling (Sibanda & Pretorius 2012b) Objective Five This aspect of our research will be focused on writing a review scientific report on the application of artificial neural networks to study HIV/AIDS. Traditionally, neural networks have been applied to a broad range of fields such as data mining, engineering and biology. In recent years neural networks have found application in data mining projects for the purposes of prediction, classification, knowledge discovery, response modeling and time series analysis. In this work, an attempt will be made to highlight cutting edge scientific research that used artificial neural networks to study HIV/AIDS. An attempt in this review will be to cast the spotlight on the latter research as it pertains to human behavior, diagnostic, vaccine and biomedical research (Sibanda & Pretorius 2012c). 4

5 Objective Six Objective six of this research will attempt the novel application of multilayer perceptrons (MLP) neural networks to further study the effect of demographic characteristics on the risk of acquiring an HIV infection amongst antenatal clinic attendees in South Africa (Sibanda & Pretorius 2011b) Objective Seven This part of our research will involve the application of receiver of characteristics (ROC) curves to compare the classification accuracy of the modeling methodologies used in this project, namely the design of experiments, logistic regression, neural networks and decision trees. It is imperative to be able to use a scientifically sound technique to compare the performance of the different classifiers Objective Eight To complete this study a scorecard design was also employed to validate the results from the logistic regression, neural networks and decision trees. Scorecard design is generally a method used in insurance industry to score credit applicants. It is therefore a technique for assessing the relative risk of providing credit to an applicant. For the purposes of this research, a table will be developed comprising of a set of demographic characteristics, in which case each characteristic will be consist of various attributes, each one assigned with a number of points. The points will then be summed and compared to a decision threshold to determine the relative risk of each characteristic. The advantage of the scorecard is the ease with which the information can be interpreted. In addition, the risk factors and the corresponding bins are easy to interpret and are based on expert knowledge. The scorecards can also be made predictive by using logistic regression to combine the risk factors into a predictive scorecard (Viane et. al., 2002). 5

6 1.4. Design of Experiments Introduction Design of experiments was invented at Rothamsted Experimental Station in the 1920s. Although the experimental design method was first used in an agricultural context, the method has found successful applications military, commerce and other industries. The fundamental principles in design of experiments are solutions to the problems in experimentation posed by the two types of nuisance factors and serve to improve the efficiency of experiments. Those fundamental principles are randomization, replication, blocking, orthogonality and factorial experimentation. Randomization is a method that protects against an unknown bias distorting the results of the experiment. Orthogonality in an experiment results in the factor effects being uncorrelated and therefore more easily interpreted. The factors in an orthogonal experimental design are varied independently of each other. Factorial experimentation is a method in which the effects due to each factor and to combinations of factors are estimated. Factorial designs are geometrically constructed and vary all the factors simultaneously and orthogonally. The main uses of design of experiments are; Screening many factors, Discovering interactions among factors, Optimizing a process Selection of Design of Experiment (DOE) Choice of a DOE is dependent on the aims of the investigation and the number of variables involved Experimental Design objectives Comparative objective This approach is tailor-made for an experiment characterized by multiple variables, with the sole purpose of infer on the importance of one variable in the presence of other variables. The overriding objective is to ascertain if a given variable is important or not. The randomized block design is a typical example of a comparative design. 6

7 Screening objective The aim of this approach is to choose the cardinal effects from the large number of insiginificant or unimportant effects. The designs are also called main effects designs. Typical examples of screening designs are the Plackett-Burman, full and fractional factorial designs Plackett-Burman designs These designs were developed by R.L. Plackett and J.P. Burman in The goal of these experiments is to determine the reliance of a variable on other unconnected variables. In these designs, interactions between factors are considered negligible Factorial Designs The definition of a full factorial design is an experiment that is made up of more than two variables and enables the investigation of the effects of inputs on a response as well as facilitating an understanding of interactional effects between inputs on a selected response Response Surface objective These experiments are developed to investigate the possible synergic interplay between variables. This provides an insight into the local curvature of the response surface under investigation. Typical examples of response surface methodologies are the central composite and Box- Behnken designs Central Composite designs (CCDs) These designs are called Box-Wilson CCDs and are comprised of a factorial design characterized by center-points. In addition, these designs possess stellar points to facilitate the investigation of the curvity of the plot. There are fundamentally three types of CCDs Central Composite Circumscribed (CCC) These designs are characterized by circinate and spheroidal symmetry. The models require five levels for each input Central Composite Inscribed (CCI) The main characteristic of these designs is that the specified cut-off points are truly limits. Like the CCC designs these designs also require five levels for each input. 7

8 Central Composite Composite Face-centered (CCF) For the CCF designs, the stellar points are positioned in the middle of each face of the factorial space. On the contrary, these models only require three levels for each factor Box Behnken Designs These independent quadratic designs are not characterized by an in-built factorial design. Box- Behnken designs are rotatable and require three levels of each factor, and have limited capability for orthogonal blocking compared to the central composite designs Regression Model This is the use of regression technique to model a response as a mathematical function of factors Artificial Neural Networks Introduction The initial research on perceptrons was conducted by Frank Rosenblatt in The early perceptrons were comprised of three layers namely; the input, the middle layer whose function was to coalesce inputs with weights via a threshold function and finally the output layer Classification of neural networks (NN) Functionally, neural networks are classified into two broad categories namely feed-forward and recurrent networks as shown in Fig This classification is based on the training regime of the NN. Examples of feed-forward networks are single-layer, multi-layer and radial basis neural networks. Typical examples of recurrent NNs are competitive and Hopfield networks. 8

9 Fig.2.1: Classification of neural network architectures Multilayer neural networks have found increasing application in numerous scientific research areas. In some instances neural networks have been found to be as robust as traditional statistical techniques. However, unlike traditional statistical methods the multilayer perceptron do not make prior assumptions with regards to the data distribution. The advantages of the neural networks include their ability to model highly non-linear functions, as well as being able to accurately generalize on previously unseen data The Multilayer Perceptron (MLP) Model The MLP is made up of a network of interconnected neurons (Fig. 2.2). In turn, neurons are connected by weights and output signals generated by the sum of the inputs to the neuron. 9

10 Fig. 2.2: A schematic representation of an MLP with three layers The most widely used activation function for the multilayer perceptron is the logistic function (Fig. 2.3). (2.1) where the variable P, stands for the population, e is Euler's number and the variable t is the time. 10

11 Fig. 2.3: Logistic transfer function The outcome produced by a neuron is multiplied by the respective weight and fed-forward to become input to the neurons in the next layer of the network. That is why MLPs are referred to as feed-forward NNs. There are many variations of the multilayer perceptron available, mostly characterised by the number of layers within a neuron. Research has shown that an appropriate choice of connecting weights and transfer functions is important. Multilayer perceptrons are taught through training and for training to occur there is a need to generate a training dataset. There are two types of training techniques for the multilayer perceptrons, namely supervised and unsupervised training Types of Neural Network Training Supervised Training In this type of training, MLPs are supplied with a dataset as well as the expected outputs from each of the dataset. This is the most widely used training regime for neural networks. The MLP is allowed to undergo a series of epochs, up and until the resulting output is closely matched to the expected output, with a very low rate of error. 11

12 Error Unsupervised Training On the other hand, in this type of training the MLP is not provided with expected output. This type of training is mostly used in situations where the NN is designed to place inputs into several groups. Just like supervised training, the training entails numerous iterations. The classification groups are exposed as the training of the neural network Training a Multilayer Perceptron using the Back-propagation algorithm As already discussed above, the training of a multilayer perceptron involves the modification and adjustment of weights. Progressive changes in weights and plotting the corresponding changes in weights generates an error plot (Fig. 2.4). The central aim of training NNs is to obtain the optimal combination of weights resulting in least error and the back-propagation training algorithm uses the gradient descent technique to obtain the least possible error. However, this is not always possible. Fig. 2.4: A three-dimensional error plot The back propagation algorithm There are fundamentally two implementation methodologies for the back-propagation algorithm. The first one is the so-called on-line training characterized by the modification network weights following the presentation of each pattern. The other method is the batch training ap- 12

13 proach that involves the summation of errors of all patterns. In practice, numerous training iterations (sometimes thousands) are needed prior to the attainment of an acceptable level of error. As a rule of the thumb, training ideally should be terminated when the neural network achieves a maximum performance on test independent data. However, that might not be the minimum network error. G. repeat steps 2-7 until error satisfactorily A. Initialization of network weights B. Input input from training data to network F. Adjustments of weights to minimize error C. Propagation of input vector through network E. Propagation of error back through network D. Calculation of error signal by comparing actual outputto desired Fig. 2.5: The back-propagation algorithm Validating Neural Network Training Validating neural networks is important because it allows for the determination if more training is required. In order to validate a NN, a validation dataset is required Determining the Number of Hidden Layers It has been shown that multilayer perceptrons with only one hidden layer are universal approximators (Hornik et al, 1989). More hidden layers can make the problem easier or harder. 13

14 Table 2.2: Number of hidden layers Number of Hidden Layers Result None Can only represent linear functions 1 Has the ability to estimate any function 2 Represents an decision boundary using an appropriate activation Function Number of Neurons in the Hidden Layer The determination of the number of neurons within a hidden layer is paramount in the decision of the final NN architecture. Even though the hidden layers are not directly connected to the external environment, they still influence greatly the final outcome. It is therefore very important to select carefully the number of hidden layers and number of neurons for the hidden layers. Less than optimal hidden layers result in under-fitting, whilst too many hidden layers result in over-fitting Activation Functions The vast majority of neural networks use activation functions, which extend the output of the NN to appropriate ranges. Examples of activation functions include the sigmoid function, hyperbolic tangent and linear function Sigmoid Activation Function In general, sigmoid activation functions utilize a sigmoid function to attain the desired activation. The sigmoid function is outlined as shown in equation 2.2; A sigmoid curve is S-shaped. (2.2) 14

15 Fig. 2.6: The Sigmoid Function It is important to note that the sigmoid activation function only returns positive values. Therefore, using the sigmoid function, the neural network will not return negative numbers Hyperbolic Tangent Activation Function Unlike the sigmoid activation function, this function does return values less than zero. On the other hand, the hyperbolic tangent function does provide negative numbers. The equation of the hyperbolic activation function (Tanh function) is shown below; (2.3) Fig. 2.7: The Hyperbolic Tangent Activation Function 15

16 Linear Activation Function This function is in reality not an activation function and this is probably the least commonly used activation function. In addition, this function does not modify a pattern before releasing it. The equation for the linear activation function is; f (x) = x (2.4) This activation function might be useful in applications where the purpose is to output the whole range of numbers Logistic Regression Introduction Binary Data Fig.2.8: The linear activation function For each observation I, the response Y i can take only two values coded 0 and 1. For this research the coded vales would stand for HIV positive (+1) and HIV negative (0). Therefore assuming, p i is the success probability for observation I. Yi has a Bernoulli distribution. 16

17 Binomial Data Each observation I is a count of r i successes out of n i trials. Assuming p i is the success probability for observation I. Therefore r i has a Binomial distribution, r i ~ B (n i, p i ). However, a binomial distribution with n i = 1 is a Bernoulli distribution Models for Binomial and Binary Data An important approach for analysing binary response data is the use of statistical model to describe the relationship inherent between the response and input variables. This approach is equally applicable to data from experimental studies where individual and experimental units have been randomized to a number of treatment groups or to observational studies where individuals have been sampled from some conceptual population by random sampling Statistical Modelling At the centre of any modelling exercise is the need to develop a mathematical representation of the inherent relationship between a response variable and a number of input variables Uses of Statistical Modelling (i) To investigate the possible relationship between a given response and a number of variables, (ii) (iii) To study the pattern of any relationship between a particular response and variables Modelling may motivate the study of underlying reasons for any model structure (iv) To estimate in what way the response would change if certain explanatory variables change Methods of Estimation The process of fitting a model to dataset involves the determination of unknown parameters in the model. The two widely used methods of fitting linear modes are the least squares and maximum likelihood approaches The Method of Least Squares There are two reasons for the use of the method of least squares, namely; (i) minimization of the difference between observations and their expected values, (ii) the parameter estimates and their derived quantities such as fitted values, have a number of optimality proportions, such as being unbiased, having minimum variance when compared with all other unbiased estimators and linearity estimates, meaning that if data assumed to have a normal distribution, the residual sum of squares on fitting a linear model has a Chi-square dis- 17

18 tribution. This is the basis for the use of F-tests to examine the significance of regression or for comparing two models The method of maximum likelihood While the method of least squares is usually adopted in fitting linear regression models, the maximum likelihood method is most frequently used. This method is based on the construction of the likelihood of the unknown parameters in the model for the sample data Transformation of Binomial Response Data This involves the transformation of the probability scale from the range (0,1) to (-,+ ). Other transformations include; The Logistic Transformations The logistic transformation is log { 1 }, which is written as logit p. 1 is the odds of a success 1 p 1 p and so the logistic transformation of p is the log odds of a success. The function logit (p) is a sigmoid curve that is symmetric about p=0.5. Fig. 2.9: The logit and probit transformation of p, as function of p. 18

19 The Probit Transformation The probit function is symmetric in p, and for any value of p in the range (0,1), the corresponding value of the probit of p will lie between - and +. When p=0.5, probit p=0. The probit transformation of p has the same general form as the logistic transformation Advantages of Logit to Probit Transformation There are three reasons why the logit transformation is preferred to the probit transformation; 1. It has a direct interpretation in terms of the logarithm of the odds of a success. This interpretation is particularly important in the analysis of data from epidemiological studies, 2. Models based on the logistic transformation are particularly appropriate for the analysis of data that have been collected retrospectively, 3. Binary data can be summarized in terms of quantities called sufficient statistics when logistic transformation is used It is for the above reasons that the logistic transformation is going to be used in this study Goodness-of-fit of a logistic regression Following the successful fitting of the model to a given dataset, the next step would be to compare the accuracy of the predicted values to the observed values and if there is good agreement then the model is considered to be acceptable. The measure of model adequacy is termed goodness-of-fit, which is described in terms of deviance, Pearson's chi-square statistic, the Hosmer-Lemeshow statistic and analogues of the R 2 statistic Deviance Statistic The D-statistic often called the Deviance measures the extent to which a current model (L c ) deviates from a full model (L f ). The full model is not useful in its own right, since it does not provide a simpler summary of the data than the individual observations themselves. However, by comparing L c with L f, the extent to which the current model adequately represents the data can be judged. To compare L c and L f, it is convenient to use minus twice the logarithm of the ratio of these maximized likelihoods, to give D =-2log ( Lc ) = -2{logL c - logl f } (2.5) Lf 19

20 where values of D will be encountered when L c is small relative to L f, indicating the current model is poor Pearson's chi-square statistic One of the most popular alternatives to the deviance is the Pearson's chi-square statistic defined by; (2.6) where; X 2 = Pearson's cumulative test statistic, which asymptotically approaches a χ 2 distribution. O i = an observed frequency; E i = an expected (theoretical) frequency, asserted by the null hypothesis; n = the number of cells in the table The deviance and the Pearson's chi-square statistics have the same asymptotic chi-square distribution, when the model is fitted correctly. The numerical values of the two statistics will generally differ, but the difference will seldom be of practical importance. Since the maximum likelihood estimates of the success probabilities maximize the likelihood function for the current model, the deviance is the goodness-of-fit statistic that is minimized by these estimates. On that basis, it is more appropriate to utilise the deviance than the Pearson chi-square statistic as measures of goodness-of-fit when linear logistic models The Hosmer-Lemeshow statistic In contrast to the deviance, the Hosmer-Lemeshow statistic is a measure of the goodness-of-fit of a model that can be used in modelling ungrouped binary data. Indeed, if the data are recorded in a grouped form, they must be ungrouped before this statistic can be evaluated Strategy for Model Selection Ideally, the process of modelling should lead to the identification of input factors to be included in the final statistical model for a given binary response dataset. The model selection strategy depends on the underlying purpose of a study. In this current study the aim is to determine which of the many demographic characteristics have a significant effect on the risk of acquiring an HIV infection amongst pregnant women in South Africa. In a nutshell, therefore the central aim of any modelling exercise is to evaluate the dependence of the response probability on the variables of interest. When the number of potential explanatory variables, including interactions, non-linear terms and so on, is too large, it might be feasible to fit all possible combina- 20

21 tions of terms, paying due regard to the hierarchical principle. Models that are not hierarchical are difficult to interpret Model Checking This involved the verification if the model fitted to a given dataset is appropriate and accurate. Indeed, a thorough examination of the extent to which the fitted model provides an appropriate description of the observed data is a vital aspect of the modelling process. Measures of model checking include residual, outlier and influential observations analysis Residuals The measure of agreement between an observation on a response variable and the respective fitted value is termed the residuals. Therefore, residuals are a measure of the adequacy of a fitted model Outliers Observations that are surprisingly distant from the remaining observations in the sample are termed outliers. Such values may be as a result of measurement error, i.e. error in reading, calculating, or reading a numerical value; they may be due to an execution error or an extreme manifestation of natural variability Influential observations A given observation is considered to be influential if its omission from a dataset results in disproportionate changes to the model under review. Although outliers may also be influential observations, influential observation need not necessarily be an outlier Comparison of Models using ROC Curve In general a binary classification technique aims to categorise events into two broad classes namely, true and a false. This in turn leads to four possible classifications for each event: a true positive, a true negative, a false positive, or a false negative. This scenario is generally termed a confusion matrix (Fig. 2.10). A confusion matrix can be used to calculate various model performance measures, as shown in equations 2.7, 2.8 and

22 Measure of Accuracy True Positive+True Negative = True Positive+True Negative+False Positive+False Negative 2.7 Measure of Precision = True Positive True Positive+Fals Positive 2.8 Measure of Recall = True Positive True Positive+False Negative 2.9 Predicted Observed Positive Negative Positive 3361 (TP) 1294 (FP) Negative 375 (FN) 2370 (TN) Fig. 2.10: Format of a Confusion Matrix Based on Fig. 2.10, Accuracy = 0.77, Precision = 0.72 and Recall = Observed Positive Negative Predicted Positive 3361 (TP) 101 (FP) Negative 375 (FN) 198 (TN) Figure 2.11: Effect of changes in false positive and true negative on the measures of accuracy Based on Fig. 2.11, Accuracy = 0.88, Precision = 0.97 and Recall = The Basics of ROC Curves The receiver operating curve (ROC) are graphs used to indicate the performance of a model over different threshold levels. These graphs were initially developed to determine the best operating points for a signal processing apparatus. ROC graphs are drawn by plotting the true positive rate against the false positive rate. Fig shows various regions covered by the ROC curve. 22

23 True Positive Rate 1 Perfect performance (a) Liberal performance (b) Conservative performance random performance 0 (c) worse than random performance always negative classification 0 1 False positive rate Methods of model evaluation Fig. 2.12: Different regions of the ROC curve The central aim of any modelling technique is to improve predictive accuracy. In the study of risk, a small improvement in predictive capability can lead to a substantial increase in benefit. The important question for an analyst is the determination whether a given model has predictive superiority over another. It is imperative for researchers who utilize predictive models for binary classification, to understand the circumstances under which each evaluation method is most appropriate. (i) Global classification rate Table 2.3: Global classification True HIV negative True HIV positive Total Predicted HIV negative x m x + m Predicted HIV positive y n y + n Total x + y m + n (x + m) + (y + n) The above model might have a global percentage classification rate for HIV negative of; Global classification rate (HIV negative) = x 100 x (x + m) + (y + n) 1 23

24 The global classification rate is ideal provided the underlying costs associated with each error are known or presumed to be the same. In this regard, the model with the highest classification would be chosen. (ii) Kolmogorov-Smirnov statistic (K-S test) This is one of the methods used for evaluating predictive binary classification models, and measures the distance between the distribution functions of two classifications. The predictive model generating the largest separability is considered to be superior. A graphical example of a K-S test is shown in Fig Cumulative % of observations Greatest separation of distributions occurs at a score HIV negative 40 HIV positive Score Cut Off Fig. 2.13: K-S test The disadvantages of the K-S test include the fact that this methodology assumes that the inherent costs of miscalculating errors are equal. (iii) Individual Misclassification In reality, however the costs of certain misclassifications are greater than others. A thorough understanding of the situation at hand is required in order to rank the costs of misclassifications. For this current research, a greater mistake might be false negative, in which a pregnant 24

25 Sensitivity woman is told that she is uninfected with an HIV, resulting in the individual not being enrolled for life-saving anti-retroviral treatment (ARVs). On the other hand, advising a false positive verdict cause unnecessary emotional distress as the individual is put on ARVs. (iv) The Receiver Operating Curve (ROC) The ROC curve plots the sensitivity of a model on the vertical axis against 1-specificity on the horizontal axis. The area under the ROC curve (AUROC) is allows for the comparison of different binary classification models. The technique is ideal in situations where there is paucity of information on costs of wrongly classifying events. The AUROC measure is equivalent to a Gini index, c-statistic and the metric θ (Thomas et al., 2002). θ= < θ < 1.0 θ= Specificity Fig. 2.14: ROC Curve illustration θ = Area under the Curve (v) Area under Receiver-of-Characteristics curve (AUROC) This statistic is also used for method validation, with an area value of 0.5 suggesting a random modem with very minimal discriminative advantage. On the other hand, and area value of 1.0 suggests a perfect model Choosing the Right Model The SAS Enterprise Miner TM (SAS Inc. 2002) programme can be successfully used to generate a number of model types that include scorecards, decision trees, logistic regression and neural networks. Some of the considerations for selecting the best model include ease of application, 25

26 understanding and justification. The researcher should also consider the predictive performance of the model in the selection of the best model Scorecards The scorecard model is one of the traditional forms of scoring models. The scorecard is made up of a table containing characteristics with their corresponding attributes. Points are allocated for each attribute and the points vary depending on whether or not the attribute is high or low risk. More points are granted to attributes that are low risk. The overall score is considered relative to a stipulated threshold number of points Decision Trees It is generally believe that a decision tree has the capability to outperform a scorecard model with regards to its ability to accurately predict outcomes. This belief is based on the fact that decision trees are able to analyse interactions between attributes. In that regard the decision tree does add value to the understanding of the risk levels of different attributes Neural Networks In general, neural networks present better accuracy of prediction compared to scorecards and decision trees. The disadvantages of the neural networks are that they are black boxes, and present difficulty in attempting to explain and justify the decisions they arrive at Development of a Scorecard (i) Development of Sample The input dataset comprised of HIV positive and HIV negative individuals. The data partition node, on SAS Enterprise Miner (SAS, Inc.) divided the dataset into 50% training, 25% validation and 25% test. Models will be compared based on the validation data. 26

27 (ii) Classing This is a procedure that involves placing inputs variables into bins. Points are provided to individual attributes on the basis of their relative risk. This relative risk of the attribute is determined by the attribute s weight-of evidence (WOE). On the other hand, the significance of the characteristic is determined by its coefficient in a logistic regression. Distribution of HIV Negative Weight of evidence = ln ( Distribution of HIV Positive ) for each group i of a characteristic 2.1 The classing process determines how many points an attribute is worth relative to the other attributes on the same characteristic. After classing has defined the attributes of a characteristic, the characteristic s predictive power (i.e. its ability to separate high risks from low risk) can be assessed with the Information Value (IV) measure. This will aid in the selection of attributes to be included in the scorecard. The IV is the weighted sum of WOE of the characteristic s attributes. The sum is weighted by the difference between the proportions of HIV negative and HIV positive individuals in the respective attribute. Information value = L i=1 (Distr HIV Negative Distr HIV Positive) 2.2 Distribution HIV Negative ln( Distribution HIV Positive ) where L is the number of attributes (levels)of the characteristic Following the identification of the relative risks of attributes within a given demographic characteristic, a logistic regression is used to measure the demographic characteristics against each other. A number of selection methods such as forward, backward and stepwise can be used in the scorecard node to eliminate the insignificant demographic characteristics. 27

28 Table 2.4: Example of Scorecard Characteristics Attribute Actual level Coded level Scorepoints Women s age (years) < > Partner s age (years) < > Education (grades) < Parity >2 1 - (iii) Logistic Regression Following the determination of the relative risks for the attributes, a logistic regression is used to calculate the regression coefficients, which in turn are multiplied by the WOE values of the attributes to form the basis for the score points in the scorecard. Table 2.6 shows an example of a scorecard. (iv) Scorepoints scaling The scaling of the scorecard points facilitates the attainment of scorepoints that are easy to interpret. Score points = Weight of Evidence * Regression Coefficient (v) Scorecard Assessment The SAS Enterprise Miner TM provides various charts that are used to assess the quality of the scorecard. (a) Scorecard distribution chart- This also shows which scores are most frequent and provides an insight into whether or not the distribution is normal and if there are outliers present. (b) Kolmogorov-Smirnoff (KS) statistic (c) Gini coefficient 28

29 (d) area under the ROC curve (AUROC) The KS statistic, Gini coefficient and AUROC are used to measure the discriminatory power of the scorecard. (vi) Model comparison This involved the comparison of the predictive accuracy neural networks, logistic regression and decision trees using the Model comparison node in SAS Enterprise Miner TM (SAS Inc. 2012). The AUROC statistic was used to achieve model comparison and the results were validated using the K-S and Gini statistics. 29

30 1.1. Introduction In this chapter, the aspects of the experimental methods, planning and design as well as tools and procedures for the analysis, are presented and motivated. Some additional details of the different experimental methodologies are explained in chapter 4 in the context of the experimental results Research Outline Comparison of all modelswith Full Regression A. Data Exploration & Classification B. Screening Design J. Development and validation of an HIV risk scorecard I. Model Assessment with ROC Curves: Validation using a Scorecard design Additional Research that was not included in the initial research proposal-to add value to the research project C. Response Surface Design (Central Composite Function) D. Comparative Study of Two Response Surface Methodologies (RSMs) H. A review of the application of neural networks in modeling HIV/AIDS G. Application of multilayer perceptron to model demographic characteristics E. Comparative Study of RSM with Binary Logistic Regression Fig. 3.1: Research Study Plan Step One: Data Exploration and Classification As explained in chapter 2, the methodology of classification will enable the summarization of voluminous and complex datasets, facilitate the detection of relationships and structure within the data set, allow for more efficient organization and retrieval of information, summaries of data can allow investigators to make predictions or discover hypothesis to account for the structure in the data as well as facilitate the formulation of general hypotheses to account for the observed data Step Two: Screening Design 30

31 This step is undertaken when the experiment has a large number of input variables that have the capacity to influence the response. It is aimed at reducing the number of variables to include only the significant ones. In this current research project, a screening design is going to be used to rank the importance of demographic characteristics on influencing the risk of acquiring HIV infection. As stated in the introduction to this thesis, each pregnant woman attending an antenatal clinic in South Africa is described using various demographic characteristics, such as population group, level of education, age, partner s age, parity, gravidity etc. In literature todate, no recorded work has been conducted in attempting to understand if these demographic characteristics predispose an individual to acquiring HIV. In other words, this work is geared towards ascertaining whether or not there is a link between demographic characteristics and the risk of acquiring HIV, if so, apply a screening design to rank the differential effects of these characteristics on the risk of acquiring the HIV infection. However, the screening design has the disadvantage of not being able to effectively characterize possible interactions between demographic characteristics (Sibanda & Pretorius 2011) Step Three: Response Surface Methodology As already indicated in the screening objective above, the easiest way of estimating a 1 st degree polynomial is to use a factorial design and this technique is sufficient to detect the main effects. However, if it is suspected that there are interactions of explanatory variables, then a more complicated design, such as a response surface methodology is needs to be implemented to estimate a second-degree polynomial model. In this study, a second-order polynomial model used a central composite face centered design estimate the model coefficients of the four selected factors believed to influence the risk of acquiring HIV infection (Sibanda & Pretorius 2012) Step Four: Comparison of two response surface methodologies This step of the research will be conducted to compare results from two response surface methodologies. This is important as it confirms if the results obtained are design-specific and provides a measure of repeatability by using a different RSM technique. A central composite design as shown in step three, is the most common response surface design, and is built on a factorial design. It requires five factorial levels. On the other hand, Box- Behnken designs use the midpoints of the cube edges instead of the corner points, which results in fewer runs, but, unlike the central composite design, all the runs must be done even if there is no curvature. Furthermore, the Box-Behnken design uses only three factor levels, and should be used when the screening experiment indicated curvature to be significant (Sibanda & Pretorius 2013) Step Five: Comparison of response surface methodology and a binary logistic regression results This step of the research will be conducted to compare results of modeling the effects of demographic characteristics on the risk of acquiring HIV using a Box Behnken design and a binary logistic regression. The logistic regression is used in epidemiology to study the relationships between a disease in two modalities (diseased or disease-free) and risk factors which may be qualitative or quantitative variables. This step of the research is used to benchmark the performance of the design of experiment methodologies, as the latter techniques are not traditionally used in disease modeling (Sibanda & Pretorius 2012). 31

32 Step Six: Application of MLPs to model demographic characteristics MLPs are feed-forward artificial neural networks and are comprised of various layers fully connected to each other. MLPs employ a supervised learning technique called backpropagation. MLPs will be trained and validated on given antenatal data, there after used to predict or classify new data. Demographic characteristics will be used as input variables while the HIV status will be the response parameter (Sibanda & Pretorius 2011) Step Seven: A review of the application of neural networks in modeling HIV/AIDS Neural network are finding increasing application in various fields ranging from engineering sciences to life sciences. This review aims to highlight the use of neural networks in the study of HIV/AIDS (Sibanda & Pretorius 2012) Step Eight: Model comparison using ROC Curves Receiver-of-characteristics (ROC) curves will be used to compare the classification accuracy of the models (Sibanda & Pretorius 2013) Step Nine: Development and validation of an HIV risk scorecard This research paper will cover the development of an HIV risk scorecard using SAS Enterprise MinerTM. The project will encompass the selection of the data sample, classing, selection of demographic characteristics, fitting of a regression model, generation of weights-of-evidence (WOE), calculation of information values (IVs), creation and validation of an HIV risk scorecard (Sibanda & Pretorius 2013) Software tools Design of experiments, neural networks and logistic regression analysis tool used in this study was SAS software, produced by SAS Institute, Cary, NC, USA. SAS Enterprise Miner TM was used to compare results from the three modeling methodologies, namely design of experiments, artificial neural networks and binary logistic regression. 32

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA W. Sibanda 1* and P. Pretorius 2 1 DST/NWU Pre-clinical

More information

Received: 19 November 2012 / Revised: 16 April 2013 / Accepted: 24 April 2013 / Published online: 9 May 2013 Ó Springer-Verlag Wien 2013

Received: 19 November 2012 / Revised: 16 April 2013 / Accepted: 24 April 2013 / Published online: 9 May 2013 Ó Springer-Verlag Wien 2013 Netw Model Anal Health Inform Bioinforma (23) 2:37 46 DOI.7/s372-3-32-z ORIGINAL ARTICLE Comparative study of the application of central composite face-centred (CCF) and Box Behnken designs (BBD) to study

More information

Predicting Breast Cancer Survivability Rates

Predicting Breast Cancer Survivability Rates Predicting Breast Cancer Survivability Rates For data collected from Saudi Arabia Registries Ghofran Othoum 1 and Wadee Al-Halabi 2 1 Computer Science, Effat University, Jeddah, Saudi Arabia 2 Computer

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012 STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION by XIN SUN PhD, Kansas State University, 2012 A THESIS Submitted in partial fulfillment of the requirements

More information

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such

More information

Selection and Combination of Markers for Prediction

Selection and Combination of Markers for Prediction Selection and Combination of Markers for Prediction NACC Data and Methods Meeting September, 2010 Baojiang Chen, PhD Sarah Monsell, MS Xiao-Hua Andrew Zhou, PhD Overview 1. Research motivation 2. Describe

More information

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015 Introduction to diagnostic accuracy meta-analysis Yemisi Takwoingi October 2015 Learning objectives To appreciate the concept underlying DTA meta-analytic approaches To know the Moses-Littenberg SROC method

More information

Chapter 1: Explaining Behavior

Chapter 1: Explaining Behavior Chapter 1: Explaining Behavior GOAL OF SCIENCE is to generate explanations for various puzzling natural phenomenon. - Generate general laws of behavior (psychology) RESEARCH: principle method for acquiring

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

Trend Analysis of HIV Prevalence Rates amongst Gen X and Y Pregnant Women Attending Antenatal Clinics in South Africa between 2001 and 2010

Trend Analysis of HIV Prevalence Rates amongst Gen X and Y Pregnant Women Attending Antenatal Clinics in South Africa between 2001 and 2010 Trend Analysis of HIV Prevalence Rates amongst Gen X and Y Pregnant Women Attending Antenatal Clinics in South Africa between 2001 and 2010 Wilbert Sibanda Philip D. Pretorius School of Information Technology,

More information

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach

Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School November 2015 Analysis of Rheumatoid Arthritis Data using Logistic Regression and Penalized Approach Wei Chen

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models White Paper 23-12 Estimating Complex Phenotype Prevalence Using Predictive Models Authors: Nicholas A. Furlotte Aaron Kleinman Robin Smith David Hinds Created: September 25 th, 2015 September 25th, 2015

More information

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection 202 4th International onference on Bioinformatics and Biomedical Technology IPBEE vol.29 (202) (202) IASIT Press, Singapore Efficacy of the Extended Principal Orthogonal Decomposition on DA Microarray

More information

Ecological Statistics

Ecological Statistics A Primer of Ecological Statistics Second Edition Nicholas J. Gotelli University of Vermont Aaron M. Ellison Harvard Forest Sinauer Associates, Inc. Publishers Sunderland, Massachusetts U.S.A. Brief Contents

More information

bivariate analysis: The statistical analysis of the relationship between two variables.

bivariate analysis: The statistical analysis of the relationship between two variables. bivariate analysis: The statistical analysis of the relationship between two variables. cell frequency: The number of cases in a cell of a cross-tabulation (contingency table). chi-square (χ 2 ) test for

More information

Analysis and Interpretation of Data Part 1

Analysis and Interpretation of Data Part 1 Analysis and Interpretation of Data Part 1 DATA ANALYSIS: PRELIMINARY STEPS 1. Editing Field Edit Completeness Legibility Comprehensibility Consistency Uniformity Central Office Edit 2. Coding Specifying

More information

ANN predicts locoregional control using molecular marker profiles of. Head and Neck squamous cell carcinoma

ANN predicts locoregional control using molecular marker profiles of. Head and Neck squamous cell carcinoma ANN predicts locoregional control using molecular marker profiles of Head and Neck squamous cell carcinoma Final Project: 539 Dinesh Kumar Tewatia Introduction Radiotherapy alone or combined with chemotherapy,

More information

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

INTRODUCTION TO MACHINE LEARNING. Decision tree learning INTRODUCTION TO MACHINE LEARNING Decision tree learning Task of classification Automatically assign class to observations with features Observation: vector of features, with a class Automatically assign

More information

Lecture 21. RNA-seq: Advanced analysis

Lecture 21. RNA-seq: Advanced analysis Lecture 21 RNA-seq: Advanced analysis Experimental design Introduction An experiment is a process or study that results in the collection of data. Statistical experiments are conducted in situations in

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction Artificial neural networks are mathematical inventions inspired by observations made in the study of biological systems, though loosely based on the actual biology. An artificial

More information

isc ove ring i Statistics sing SPSS

isc ove ring i Statistics sing SPSS isc ove ring i Statistics sing SPSS S E C O N D! E D I T I O N (and sex, drugs and rock V roll) A N D Y F I E L D Publications London o Thousand Oaks New Delhi CONTENTS Preface How To Use This Book Acknowledgements

More information

Using Genetic Algorithms to Optimise Rough Set Partition Sizes for HIV Data Analysis

Using Genetic Algorithms to Optimise Rough Set Partition Sizes for HIV Data Analysis Using Genetic Algorithms to Optimise Rough Set Partition Sizes for HIV Data Analysis Bodie Crossingham and Tshilidzi Marwala School of Electrical and Information Engineering, University of the Witwatersrand

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018

Introduction to Machine Learning. Katherine Heller Deep Learning Summer School 2018 Introduction to Machine Learning Katherine Heller Deep Learning Summer School 2018 Outline Kinds of machine learning Linear regression Regularization Bayesian methods Logistic Regression Why we do this

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

Question 1 Multiple Choice (8 marks)

Question 1 Multiple Choice (8 marks) Philadelphia University Student Name: Faculty of Engineering Student Number: Dept. of Computer Engineering First Exam, First Semester: 2015/2016 Course Title: Neural Networks and Fuzzy Logic Date: 19/11/2015

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

A Biostatistics Applications Area in the Department of Mathematics for a PhD/MSPH Degree

A Biostatistics Applications Area in the Department of Mathematics for a PhD/MSPH Degree A Biostatistics Applications Area in the Department of Mathematics for a PhD/MSPH Degree Patricia B. Cerrito Department of Mathematics Jewish Hospital Center for Advanced Medicine pcerrito@louisville.edu

More information

Six Sigma Glossary Lean 6 Society

Six Sigma Glossary Lean 6 Society Six Sigma Glossary Lean 6 Society ABSCISSA ACCEPTANCE REGION ALPHA RISK ALTERNATIVE HYPOTHESIS ASSIGNABLE CAUSE ASSIGNABLE VARIATIONS The horizontal axis of a graph The region of values for which the null

More information

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range Lae-Jeong Park and Jung-Ho Moon Department of Electrical Engineering, Kangnung National University Kangnung, Gangwon-Do,

More information

Numerical Integration of Bivariate Gaussian Distribution

Numerical Integration of Bivariate Gaussian Distribution Numerical Integration of Bivariate Gaussian Distribution S. H. Derakhshan and C. V. Deutsch The bivariate normal distribution arises in many geostatistical applications as most geostatistical techniques

More information

Chapter 7: Descriptive Statistics

Chapter 7: Descriptive Statistics Chapter Overview Chapter 7 provides an introduction to basic strategies for describing groups statistically. Statistical concepts around normal distributions are discussed. The statistical procedures of

More information

4. Model evaluation & selection

4. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2017 4. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

CHAPTER VI RESEARCH METHODOLOGY

CHAPTER VI RESEARCH METHODOLOGY CHAPTER VI RESEARCH METHODOLOGY 6.1 Research Design Research is an organized, systematic, data based, critical, objective, scientific inquiry or investigation into a specific problem, undertaken with the

More information

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data TECHNICAL REPORT Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data CONTENTS Executive Summary...1 Introduction...2 Overview of Data Analysis Concepts...2

More information

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures 1 2 3 4 5 Kathleen T Quach Department of Neuroscience University of California, San Diego

More information

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process

Research Methods in Forest Sciences: Learning Diary. Yoko Lu December Research process Research Methods in Forest Sciences: Learning Diary Yoko Lu 285122 9 December 2016 1. Research process It is important to pursue and apply knowledge and understand the world under both natural and social

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method Biost 590: Statistical Consulting Statistical Classification of Scientific Studies; Approach to Consulting Lecture Outline Statistical Classification of Scientific Studies Statistical Tasks Approach to

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Reveal Relationships in Categorical Data

Reveal Relationships in Categorical Data SPSS Categories 15.0 Specifications Reveal Relationships in Categorical Data Unleash the full potential of your data through perceptual mapping, optimal scaling, preference scaling, and dimension reduction

More information

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0% Capstone Test (will consist of FOUR quizzes and the FINAL test grade will be an average of the four quizzes). Capstone #1: Review of Chapters 1-3 Capstone #2: Review of Chapter 4 Capstone #3: Review of

More information

Bayes Linear Statistics. Theory and Methods

Bayes Linear Statistics. Theory and Methods Bayes Linear Statistics Theory and Methods Michael Goldstein and David Wooff Durham University, UK BICENTENNI AL BICENTENNIAL Contents r Preface xvii 1 The Bayes linear approach 1 1.1 Combining beliefs

More information

Mark J. Anderson, Patrick J. Whitcomb Stat-Ease, Inc., Minneapolis, MN USA

Mark J. Anderson, Patrick J. Whitcomb Stat-Ease, Inc., Minneapolis, MN USA Journal of Statistical Science and Application (014) 85-9 D DAV I D PUBLISHING Practical Aspects for Designing Statistically Optimal Experiments Mark J. Anderson, Patrick J. Whitcomb Stat-Ease, Inc., Minneapolis,

More information

Section 3.2 Least-Squares Regression

Section 3.2 Least-Squares Regression Section 3.2 Least-Squares Regression Linear relationships between two quantitative variables are pretty common and easy to understand. Correlation measures the direction and strength of these relationships.

More information

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Testing. Performance Measures. Notes. Lecture 15 - ROC, AUC & Lift. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC

More information

Identification of Tissue Independent Cancer Driver Genes

Identification of Tissue Independent Cancer Driver Genes Identification of Tissue Independent Cancer Driver Genes Alexandros Manolakos, Idoia Ochoa, Kartik Venkat Supervisor: Olivier Gevaert Abstract Identification of genomic patterns in tumors is an important

More information

BACKPROPOGATION NEURAL NETWORK FOR PREDICTION OF HEART DISEASE

BACKPROPOGATION NEURAL NETWORK FOR PREDICTION OF HEART DISEASE BACKPROPOGATION NEURAL NETWORK FOR PREDICTION OF HEART DISEASE NABEEL AL-MILLI Financial and Business Administration and Computer Science Department Zarqa University College Al-Balqa' Applied University

More information

Gene Selection for Tumor Classification Using Microarray Gene Expression Data

Gene Selection for Tumor Classification Using Microarray Gene Expression Data Gene Selection for Tumor Classification Using Microarray Gene Expression Data K. Yendrapalli, R. Basnet, S. Mukkamala, A. H. Sung Department of Computer Science New Mexico Institute of Mining and Technology

More information

ARTIFICIAL NEURAL NETWORKS TO DETECT RISK OF TYPE 2 DIABETES

ARTIFICIAL NEURAL NETWORKS TO DETECT RISK OF TYPE 2 DIABETES ARTIFICIAL NEURAL NETWORKS TO DETECT RISK OF TYPE DIABETES B. Y. Baha Regional Coordinator, Information Technology & Systems, Northeast Region, Mainstreet Bank, Yola E-mail: bybaha@yahoo.com and G. M.

More information

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH

PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH PREDICTION OF BREAST CANCER USING STACKING ENSEMBLE APPROACH 1 VALLURI RISHIKA, M.TECH COMPUTER SCENCE AND SYSTEMS ENGINEERING, ANDHRA UNIVERSITY 2 A. MARY SOWJANYA, Assistant Professor COMPUTER SCENCE

More information

Biostatistics II

Biostatistics II Biostatistics II 514-5509 Course Description: Modern multivariable statistical analysis based on the concept of generalized linear models. Includes linear, logistic, and Poisson regression, survival analysis,

More information

Prediction of Malignant and Benign Tumor using Machine Learning

Prediction of Malignant and Benign Tumor using Machine Learning Prediction of Malignant and Benign Tumor using Machine Learning Ashish Shah Department of Computer Science and Engineering Manipal Institute of Technology, Manipal University, Manipal, Karnataka, India

More information

Still important ideas

Still important ideas Readings: OpenStax - Chapters 1 13 & Appendix D & E (online) Plous Chapters 17 & 18 - Chapter 17: Social Influences - Chapter 18: Group Judgments and Decisions Still important ideas Contrast the measurement

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4. Summary & Conclusion Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.0 Overview 1. Survey research 2. Survey design 3. Descriptives & graphing 4. Correlation

More information

Survey research (Lecture 1)

Survey research (Lecture 1) Summary & Conclusion Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.0 Overview 1. Survey research 2. Survey design 3. Descriptives & graphing 4. Correlation

More information

A hybrid Model to Estimate Cirrhosis Using Laboratory Testsand Multilayer Perceptron (MLP) Neural Networks

A hybrid Model to Estimate Cirrhosis Using Laboratory Testsand Multilayer Perceptron (MLP) Neural Networks IOSR Journal of Nursing and Health Science (IOSR-JNHS) e-issn: 232 1959.p- ISSN: 232 194 Volume 7, Issue 1 Ver. V. (Jan.- Feb.218), PP 32-38 www.iosrjournals.org A hybrid Model to Estimate Cirrhosis Using

More information

THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER

THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER THE USE OF MULTIVARIATE ANALYSIS IN DEVELOPMENT THEORY: A CRITIQUE OF THE APPROACH ADOPTED BY ADELMAN AND MORRIS A. C. RAYNER Introduction, 639. Factor analysis, 639. Discriminant analysis, 644. INTRODUCTION

More information

METHODS FOR DETECTING CERVICAL CANCER

METHODS FOR DETECTING CERVICAL CANCER Chapter III METHODS FOR DETECTING CERVICAL CANCER 3.1 INTRODUCTION The successful detection of cervical cancer in a variety of tissues has been reported by many researchers and baseline figures for the

More information

Chapter 17 Sensitivity Analysis and Model Validation

Chapter 17 Sensitivity Analysis and Model Validation Chapter 17 Sensitivity Analysis and Model Validation Justin D. Salciccioli, Yves Crutain, Matthieu Komorowski and Dominic C. Marshall Learning Objectives Appreciate that all models possess inherent limitations

More information

6. Unusual and Influential Data

6. Unusual and Influential Data Sociology 740 John ox Lecture Notes 6. Unusual and Influential Data Copyright 2014 by John ox Unusual and Influential Data 1 1. Introduction I Linear statistical models make strong assumptions about the

More information

Technical Specifications

Technical Specifications Technical Specifications In order to provide summary information across a set of exercises, all tests must employ some form of scoring models. The most familiar of these scoring models is the one typically

More information

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose

More information

POST GRADUATE DIPLOMA IN BIOETHICS (PGDBE) Term-End Examination June, 2016 MHS-014 : RESEARCH METHODOLOGY

POST GRADUATE DIPLOMA IN BIOETHICS (PGDBE) Term-End Examination June, 2016 MHS-014 : RESEARCH METHODOLOGY No. of Printed Pages : 12 MHS-014 POST GRADUATE DIPLOMA IN BIOETHICS (PGDBE) Term-End Examination June, 2016 MHS-014 : RESEARCH METHODOLOGY Time : 2 hours Maximum Marks : 70 PART A Attempt all questions.

More information

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0

Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0 Summary & Conclusion Lecture 10 Survey Research & Design in Psychology James Neill, 2016 Creative Commons Attribution 4.0 Overview 1. Survey research and design 1. Survey research 2. Survey design 2. Univariate

More information

Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model

Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model Delia North Temesgen Zewotir Michael Murray Abstract In South Africa, the Department of Education allocates

More information

Understandable Statistics

Understandable Statistics Understandable Statistics correlated to the Advanced Placement Program Course Description for Statistics Prepared for Alabama CC2 6/2003 2003 Understandable Statistics 2003 correlated to the Advanced Placement

More information

Statistical reports Regression, 2010

Statistical reports Regression, 2010 Statistical reports Regression, 2010 Niels Richard Hansen June 10, 2010 This document gives some guidelines on how to write a report on a statistical analysis. The document is organized into sections that

More information

Emotion Recognition using a Cauchy Naive Bayes Classifier

Emotion Recognition using a Cauchy Naive Bayes Classifier Emotion Recognition using a Cauchy Naive Bayes Classifier Abstract Recognizing human facial expression and emotion by computer is an interesting and challenging problem. In this paper we propose a method

More information

Classification of Smoking Status: The Case of Turkey

Classification of Smoking Status: The Case of Turkey Classification of Smoking Status: The Case of Turkey Zeynep D. U. Durmuşoğlu Department of Industrial Engineering Gaziantep University Gaziantep, Turkey unutmaz@gantep.edu.tr Pınar Kocabey Çiftçi Department

More information

Section 6: Analysing Relationships Between Variables

Section 6: Analysing Relationships Between Variables 6. 1 Analysing Relationships Between Variables Section 6: Analysing Relationships Between Variables Choosing a Technique The Crosstabs Procedure The Chi Square Test The Means Procedure The Correlations

More information

A Brief Introduction to Bayesian Statistics

A Brief Introduction to Bayesian Statistics A Brief Introduction to Statistics David Kaplan Department of Educational Psychology Methods for Social Policy Research and, Washington, DC 2017 1 / 37 The Reverend Thomas Bayes, 1701 1761 2 / 37 Pierre-Simon

More information

PMR5406 Redes Neurais e Lógica Fuzzy. Aula 5 Alguns Exemplos

PMR5406 Redes Neurais e Lógica Fuzzy. Aula 5 Alguns Exemplos PMR5406 Redes Neurais e Lógica Fuzzy Aula 5 Alguns Exemplos APPLICATIONS Two examples of real life applications of neural networks for pattern classification: RBF networks for face recognition FF networks

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

Learning Classifier Systems (LCS/XCSF)

Learning Classifier Systems (LCS/XCSF) Context-Dependent Predictions and Cognitive Arm Control with XCSF Learning Classifier Systems (LCS/XCSF) Laurentius Florentin Gruber Seminar aus Künstlicher Intelligenz WS 2015/16 Professor Johannes Fürnkranz

More information

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F

Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Readings: Textbook readings: OpenStax - Chapters 1 13 (emphasis on Chapter 12) Online readings: Appendix D, E & F Plous Chapters 17 & 18 Chapter 17: Social Influences Chapter 18: Group Judgments and Decisions

More information

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India 20th International Congress on Modelling and Simulation, Adelaide, Australia, 1 6 December 2013 www.mssanz.org.au/modsim2013 Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision

More information

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS International Journal of Computer Engineering & Technology (IJCET) Volume 9, Issue 4, July-Aug 2018, pp. 196-201, Article IJCET_09_04_021 Available online at http://www.iaeme.com/ijcet/issues.asp?jtype=ijcet&vtype=9&itype=4

More information

11/24/2017. Do not imply a cause-and-effect relationship

11/24/2017. Do not imply a cause-and-effect relationship Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are highly extraverted people less afraid of rejection

More information

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis?

How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis? How Does Analysis of Competing Hypotheses (ACH) Improve Intelligence Analysis? Richards J. Heuer, Jr. Version 1.2, October 16, 2005 This document is from a collection of works by Richards J. Heuer, Jr.

More information

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION

COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION COMPARATIVE STUDY ON FEATURE EXTRACTION METHOD FOR BREAST CANCER CLASSIFICATION 1 R.NITHYA, 2 B.SANTHI 1 Asstt Prof., School of Computing, SASTRA University, Thanjavur, Tamilnadu, India-613402 2 Prof.,

More information

Index. E Eftekbar, B., 152, 164 Eigenvectors, 6, 171 Elastic net regression, 6 discretization, 28 regularization, 42, 44, 46 Exponential modeling, 135

Index. E Eftekbar, B., 152, 164 Eigenvectors, 6, 171 Elastic net regression, 6 discretization, 28 regularization, 42, 44, 46 Exponential modeling, 135 A Abrahamowicz, M., 100 Akaike information criterion (AIC), 141 Analysis of covariance (ANCOVA), 2 4. See also Canonical regression Analysis of variance (ANOVA) model, 2 4, 255 canonical regression (see

More information

CS 453X: Class 18. Jacob Whitehill

CS 453X: Class 18. Jacob Whitehill CS 453X: Class 18 Jacob Whitehill More on k-means Exercise: Empty clusters (1) Assume that a set of distinct data points { x (i) } are initially assigned so that none of the k clusters is empty. How can

More information

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network UTM Computing Proceedings Innovations in Computing Technology and Applications Volume 2 Year: 2017 ISBN: 978-967-0194-95-0 1 Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon

More information

Computational Cognitive Neuroscience

Computational Cognitive Neuroscience Computational Cognitive Neuroscience Computational Cognitive Neuroscience Computational Cognitive Neuroscience *Computer vision, *Pattern recognition, *Classification, *Picking the relevant information

More information

A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING

A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING A STATISTICAL PATTERN RECOGNITION PARADIGM FOR VIBRATION-BASED STRUCTURAL HEALTH MONITORING HOON SOHN Postdoctoral Research Fellow ESA-EA, MS C96 Los Alamos National Laboratory Los Alamos, NM 87545 CHARLES

More information

Chapter 3: Describing Relationships

Chapter 3: Describing Relationships Chapter 3: Describing Relationships Objectives: Students will: Construct and interpret a scatterplot for a set of bivariate data. Compute and interpret the correlation, r, between two variables. Demonstrate

More information

A Practical Guide to Getting Started with Propensity Scores

A Practical Guide to Getting Started with Propensity Scores Paper 689-2017 A Practical Guide to Getting Started with Propensity Scores Thomas Gant, Keith Crowland Data & Information Management Enhancement (DIME) Kaiser Permanente ABSTRACT This paper gives tools

More information

Modern Regression Methods

Modern Regression Methods Modern Regression Methods Second Edition THOMAS P. RYAN Acworth, Georgia WILEY A JOHN WILEY & SONS, INC. PUBLICATION Contents Preface 1. Introduction 1.1 Simple Linear Regression Model, 3 1.2 Uses of Regression

More information

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1 Nested Factor Analytic Model Comparison as a Means to Detect Aberrant Response Patterns John M. Clark III Pearson Author Note John M. Clark III,

More information

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida

Research and Evaluation Methodology Program, School of Human Development and Organizational Studies in Education, University of Florida Vol. 2 (1), pp. 22-39, Jan, 2015 http://www.ijate.net e-issn: 2148-7456 IJATE A Comparison of Logistic Regression Models for Dif Detection in Polytomous Items: The Effect of Small Sample Sizes and Non-Normality

More information

Artificial Neural Networks and Near Infrared Spectroscopy - A case study on protein content in whole wheat grain

Artificial Neural Networks and Near Infrared Spectroscopy - A case study on protein content in whole wheat grain A White Paper from FOSS Artificial Neural Networks and Near Infrared Spectroscopy - A case study on protein content in whole wheat grain By Lars Nørgaard*, Martin Lagerholm and Mark Westerhaus, FOSS *corresponding

More information

Fundamental Clinical Trial Design

Fundamental Clinical Trial Design Design, Monitoring, and Analysis of Clinical Trials Session 1 Overview and Introduction Overview Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics, University of Washington February 17-19, 2003

More information

3. Model evaluation & selection

3. Model evaluation & selection Foundations of Machine Learning CentraleSupélec Fall 2016 3. Model evaluation & selection Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

More information

Assignment 4: True or Quasi-Experiment

Assignment 4: True or Quasi-Experiment Assignment 4: True or Quasi-Experiment Objectives: After completing this assignment, you will be able to Evaluate when you must use an experiment to answer a research question Develop statistical hypotheses

More information