Chapter 1: Introduction

Size: px

Start display at page:

Download "Chapter 1: Introduction"

Shannon Parsons
6 years ago
Views:

1 Chapter 1: Introduction Index 1.1. Background 1.2. Problem statement 1.3. Aim and objectives 1.1. Background HIV/AIDS is a leading health problem in the sub-saharan African region. The need to formulate well-thought out and effective measures to understand the dynamics of the HIV/AIDS cannot be emphasized enough. Seroprevalence data is HIV data collected based on a blood survey conducted on expecting mothers visiting antenatal clinics throughout the Republic of South Africa. It is general knowledge that data collected from antenatal seroprevalence surveys tends to overestimate HIV prevalence due to the fact that information is observed from one sector of the population, the pregnant women. It is also known that women infected with HIV have lower pregnancy rates than uninfected women. Notwithstanding the shortcomings of the antenatal seroprevalence surveys, this tool still ranks highly as a reliable approach to estimate HIV prevalence amongst the entire adult population in a country. In South Africa, the prevalence of HIV has been used for many years to gauge the spread of the HIV pandemic. The introduction of life-saving anti-retroviral drugs (ARVs) has increased the difficulty of interpreting the prevalence data due to changes in survival period from infection to death. In that regard, the incidence of HIV infection (i.e. the rate at which new infections are acquired over a defined period of time) is a much more sensitive measure of the current state of the epidemic and of the impact of programs. Mathematical and statistical models are imperative and essential in enhancing our unde r- standing of the changes in the behavior of the HIV epidemic. On that basis, the aim of any mathematical and statistical modeling methodology is to extract enough knowledge from a given database. A number of different models of HIV and AIDS have been developed, ranging from simple extrapolations of past curves to complex transmission models (UNAIDS, 2010). 1

2 1.2. Problem statement The antenatal HIV seroprevalence HIV data is comprised of the following demographic characteristics for each pregnant woman; age, partner s age, population group, level of education, gravidity, parity, marital status, province, region, HIV and syphilis. It is therefore clear that the seroprevalence database presents a wealth of information. As shown in the above modeling techniques and research surveys, very little work has been done to fully understand this vast amount of data. This research will attempt to answer questions like; what does the antenatal HIV seroprevalence database tell us and how can this database be used to improve the intervention conducted by the government to curb the spread of the HIV pandemic. This will therefore entail using relevant statistical techniques to fully understand the database (Sibanda & Pretorius 2011). Central to this research will be the objective of understanding in detail the differential effects of demographic characteristics of pregnant women on their risk of acquiring HIV infection using unorthodox methodologies like design of experiments, artificial neural networks and binary logistic regression. Design of experiments is traditionally a structured intensive methodology used for finding solutions to problems of an engineering nature. The technique enables the formulation of sound engineering solutions. Neural networks consist of artificial neurons that process information. In most cases, a neural network is an adaptive system that changes its structure during a learning phase. In that regard, neural networks are used to model complex relationships between inputs and outputs to find a pattern in data. Neural networks have been applied to a wide range of applications such as character recognition, image compression and stock market prediction. This current research will therefore attempt to use neural networks in studying the antenatal HIV seroprevalence data. Logistic regression is a statistical methodology for inferring the outcomes of a categorical dependent variable, based on one or more predictor variables. In this regard, the probabilities describing the possible outcome of a single event are modeled, as a function of explanatory variables, using a logistic function. Statistically, the categoric outcomes may be binary or ordinal, and the predictor variables may be continuous or categorical. In this research, this will involve modeling the presence or absence of HIV infection using demographic characteristics as predictor variables. 2

3 1.3. Aims and objectives With the enormous amount of data presented by annual South African antenatal HIV seroprevalence, it is important to develop powerful techniques to study and understand the data in order to generate valuable knowledge for sound decision making. Descriptive and predictive data mining techniques that involved detailed data characterization, classification, and outlier analysis will be used. Central to this study will be the characterization of differential effects of demographic characteristics on the risk of acquiring HIV amongst pregnant women, using unorthodox statistical methods like design of experiments and artificial neural networks. To validate the modeling results a binary logistic regression methodology will be used. Most epidemiologists prefer the binary logistic model to study epidemiological data especially where binary categorical outcomes are involved. In addition, this research further investigated the usefulness of decision trees in understanding the effects of demographic characteristics on the risk of acquiring an HIV infection. The Tree node in SAS Enterprise Miner TM is part of the SAS SEMMA (Sample, Explore, Modify, Assess) data mining tools. The Tree therefore represents a segmentation of data created by applying a number of simple rules. Each rule is applied after another, resulting in a hierarchy of segments within segments, giving rise to a hierarchy resembling a tree. In addition to nominal, binary and ordinal targets, the tree can be successfully used to predict outcomes for interval targets. It has been widely reported in science literature that the advantage of decision trees over other modeling methodologies such as neural networks, is that the technique produces a model that can easily be explained. The decision tree has an added advantage of being able to treat missing data as inputs. The receiver operating curve (ROC) curves developed using SAS Enterprise Miner TM (SAS Institute Inc. 2012) were used to compare the classification accuracy of the different modeling methodologies. In general ROC curves are drawn by varying the cut-off point that determines which event probabilities are considered to predict the event Specific Objectives of this project Objective One This study will attempt to utilize a screening design of experiment (DOE) technique to develop a ranked list of important through unimportant demographic factors that affect the spread of HIV in the South African population (Sibanda & Pretorius 2011a). 3

4 Objective Two This research step will explore the application of response surface methodology (RSM) to study the intricate relationships between antenatal data demographic characteristics and the risk of acquiring an HIV infection. RSM techniques allow for the estimation of interaction and quadratic effects (Sibanda & Pretorius 2012a) Objective Three The third objective will compare results from two response surface methodologies in determining the effect of demographic characteristics on HIV status of antenatal clinic attendees. The two response surface methodologies to be studied will be the central composite face-centered and Box-Behnken designs. The purpose of this study will be to show that the results obtained in research objective two are not design-specific and thus can be reproduced using a different response surface model (Sibanda & Pretorius 2013) Objective Four The fourth objective of this research will attempt to validate the response surface methodology results through the use of a binary logistic regression model. This aspect of our research was brought about by recommendations from epidemiologists that the S-shape of the logistic function is most favored for the study of the HIV risk amongst antenatal clinic attendees in South Africa. Furthermore, binary logistic regression models are models of choice in the study of binary categoric data. This step is important as the design of experiment methodologies are not usually used for epidemiological modeling (Sibanda & Pretorius 2012b) Objective Five This aspect of our research will be focused on writing a review scientific report on the application of artificial neural networks to study HIV/AIDS. Traditionally, neural networks have been applied to a broad range of fields such as data mining, engineering and biology. In recent years neural networks have found application in data mining projects for the purposes of prediction, classification, knowledge discovery, response modeling and time series analysis. In this work, an attempt will be made to highlight cutting edge scientific research that used artificial neural networks to study HIV/AIDS. An attempt in this review will be to cast the spotlight on the latter research as it pertains to human behavior, diagnostic, vaccine and biomedical research (Sibanda & Pretorius 2012c). 4

5 Objective Six Objective six of this research will attempt the novel application of multilayer perceptrons (MLP) neural networks to further study the effect of demographic characteristics on the risk of acquiring an HIV infection amongst antenatal clinic attendees in South Africa (Sibanda & Pretorius 2011b) Objective Seven This part of our research will involve the application of receiver of characteristics (ROC) curves to compare the classification accuracy of the modeling methodologies used in this project, namely the design of experiments, logistic regression, neural networks and decision trees. It is imperative to be able to use a scientifically sound technique to compare the performance of the different classifiers Objective Eight To complete this study a scorecard design was also employed to validate the results from the logistic regression, neural networks and decision trees. Scorecard design is generally a method used in insurance industry to score credit applicants. It is therefore a technique for assessing the relative risk of providing credit to an applicant. For the purposes of this research, a table will be developed comprising of a set of demographic characteristics, in which case each characteristic will be consist of various attributes, each one assigned with a number of points. The points will then be summed and compared to a decision threshold to determine the relative risk of each characteristic. The advantage of the scorecard is the ease with which the information can be interpreted. In addition, the risk factors and the corresponding bins are easy to interpret and are based on expert knowledge. The scorecards can also be made predictive by using logistic regression to combine the risk factors into a predictive scorecard (Viane et. al., 2002). 5

6 1.4. Design of Experiments Introduction Design of experiments was invented at Rothamsted Experimental Station in the 1920s. Although the experimental design method was first used in an agricultural context, the method has found successful applications military, commerce and other industries. The fundamental principles in design of experiments are solutions to the problems in experimentation posed by the two types of nuisance factors and serve to improve the efficiency of experiments. Those fundamental principles are randomization, replication, blocking, orthogonality and factorial experimentation. Randomization is a method that protects against an unknown bias distorting the results of the experiment. Orthogonality in an experiment results in the factor effects being uncorrelated and therefore more easily interpreted. The factors in an orthogonal experimental design are varied independently of each other. Factorial experimentation is a method in which the effects due to each factor and to combinations of factors are estimated. Factorial designs are geometrically constructed and vary all the factors simultaneously and orthogonally. The main uses of design of experiments are; Screening many factors, Discovering interactions among factors, Optimizing a process Selection of Design of Experiment (DOE) Choice of a DOE is dependent on the aims of the investigation and the number of variables involved Experimental Design objectives Comparative objective This approach is tailor-made for an experiment characterized by multiple variables, with the sole purpose of infer on the importance of one variable in the presence of other variables. The overriding objective is to ascertain if a given variable is important or not. The randomized block design is a typical example of a comparative design. 6

7 Screening objective The aim of this approach is to choose the cardinal effects from the large number of insiginificant or unimportant effects. The designs are also called main effects designs. Typical examples of screening designs are the Plackett-Burman, full and fractional factorial designs Plackett-Burman designs These designs were developed by R.L. Plackett and J.P. Burman in The goal of these experiments is to determine the reliance of a variable on other unconnected variables. In these designs, interactions between factors are considered negligible Factorial Designs The definition of a full factorial design is an experiment that is made up of more than two variables and enables the investigation of the effects of inputs on a response as well as facilitating an understanding of interactional effects between inputs on a selected response Response Surface objective These experiments are developed to investigate the possible synergic interplay between variables. This provides an insight into the local curvature of the response surface under investigation. Typical examples of response surface methodologies are the central composite and Box- Behnken designs Central Composite designs (CCDs) These designs are called Box-Wilson CCDs and are comprised of a factorial design characterized by center-points. In addition, these designs possess stellar points to facilitate the investigation of the curvity of the plot. There are fundamentally three types of CCDs Central Composite Circumscribed (CCC) These designs are characterized by circinate and spheroidal symmetry. The models require five levels for each input Central Composite Inscribed (CCI) The main characteristic of these designs is that the specified cut-off points are truly limits. Like the CCC designs these designs also require five levels for each input. 7

8 Central Composite Composite Face-centered (CCF) For the CCF designs, the stellar points are positioned in the middle of each face of the factorial space. On the contrary, these models only require three levels for each factor Box Behnken Designs These independent quadratic designs are not characterized by an in-built factorial design. Box- Behnken designs are rotatable and require three levels of each factor, and have limited capability for orthogonal blocking compared to the central composite designs Regression Model This is the use of regression technique to model a response as a mathematical function of factors Artificial Neural Networks Introduction The initial research on perceptrons was conducted by Frank Rosenblatt in The early perceptrons were comprised of three layers namely; the input, the middle layer whose function was to coalesce inputs with weights via a threshold function and finally the output layer Classification of neural networks (NN) Functionally, neural networks are classified into two broad categories namely feed-forward and recurrent networks as shown in Fig This classification is based on the training regime of the NN. Examples of feed-forward networks are single-layer, multi-layer and radial basis neural networks. Typical examples of recurrent NNs are competitive and Hopfield networks. 8

Fig.2.1: Classification of neural network architectures Multilayer neural networks have found increasing application in numerous scientific research areas.

9 Fig.2.1: Classification of neural network architectures Multilayer neural networks have found increasing application in numerous scientific research areas. In some instances neural networks have been found to be as robust as traditional statistical techniques. However, unlike traditional statistical methods the multilayer perceptron do not make prior assumptions with regards to the data distribution. The advantages of the neural networks include their ability to model highly non-linear functions, as well as being able to accurately generalize on previously unseen data The Multilayer Perceptron (MLP) Model The MLP is made up of a network of interconnected neurons (Fig. 2.2). In turn, neurons are connected by weights and output signals generated by the sum of the inputs to the neuron. 9

10 Fig. 2.2: A schematic representation of an MLP with three layers The most widely used activation function for the multilayer perceptron is the logistic function (Fig. 2.3). (2.1) where the variable P, stands for the population, e is Euler's number and the variable t is the time. 10

11 Fig. 2.3: Logistic transfer function The outcome produced by a neuron is multiplied by the respective weight and fed-forward to become input to the neurons in the next layer of the network. That is why MLPs are referred to as feed-forward NNs. There are many variations of the multilayer perceptron available, mostly characterised by the number of layers within a neuron. Research has shown that an appropriate choice of connecting weights and transfer functions is important. Multilayer perceptrons are taught through training and for training to occur there is a need to generate a training dataset. There are two types of training techniques for the multilayer perceptrons, namely supervised and unsupervised training Types of Neural Network Training Supervised Training In this type of training, MLPs are supplied with a dataset as well as the expected outputs from each of the dataset. This is the most widely used training regime for neural networks. The MLP is allowed to undergo a series of epochs, up and until the resulting output is closely matched to the expected output, with a very low rate of error. 11

Error 1.5.4.2. Unsupervised Training On the other hand, in this type of training the MLP is not provided with expected output.

The classification groups are exposed as the training of the neural network. 1.5.

12 Error Unsupervised Training On the other hand, in this type of training the MLP is not provided with expected output. This type of training is mostly used in situations where the NN is designed to place inputs into several groups. Just like supervised training, the training entails numerous iterations. The classification groups are exposed as the training of the neural network Training a Multilayer Perceptron using the Back-propagation algorithm As already discussed above, the training of a multilayer perceptron involves the modification and adjustment of weights. Progressive changes in weights and plotting the corresponding changes in weights generates an error plot (Fig. 2.4). The central aim of training NNs is to obtain the optimal combination of weights resulting in least error and the back-propagation training algorithm uses the gradient descent technique to obtain the least possible error. However, this is not always possible. Fig. 2.4: A three-dimensional error plot The back propagation algorithm There are fundamentally two implementation methodologies for the back-propagation algorithm. The first one is the so-called on-line training characterized by the modification network weights following the presentation of each pattern. The other method is the batch training ap- 12

proach that involves the summation of errors of all patterns. In practice, numerous training iterations (sometimes thousands) are needed prior to the attainment of an acceptable level of error.

13 proach that involves the summation of errors of all patterns. In practice, numerous training iterations (sometimes thousands) are needed prior to the attainment of an acceptable level of error. As a rule of the thumb, training ideally should be terminated when the neural network achieves a maximum performance on test independent data. However, that might not be the minimum network error. G. repeat steps 2-7 until error satisfactorily A. Initialization of network weights B. Input input from training data to network F. Adjustments of weights to minimize error C. Propagation of input vector through network E. Propagation of error back through network D. Calculation of error signal by comparing actual outputto desired Fig. 2.5: The back-propagation algorithm Validating Neural Network Training Validating neural networks is important because it allows for the determination if more training is required. In order to validate a NN, a validation dataset is required Determining the Number of Hidden Layers It has been shown that multilayer perceptrons with only one hidden layer are universal approximators (Hornik et al, 1989). More hidden layers can make the problem easier or harder. 13

14 Table 2.2: Number of hidden layers Number of Hidden Layers Result None Can only represent linear functions 1 Has the ability to estimate any function 2 Represents an decision boundary using an appropriate activation Function Number of Neurons in the Hidden Layer The determination of the number of neurons within a hidden layer is paramount in the decision of the final NN architecture. Even though the hidden layers are not directly connected to the external environment, they still influence greatly the final outcome. It is therefore very important to select carefully the number of hidden layers and number of neurons for the hidden layers. Less than optimal hidden layers result in under-fitting, whilst too many hidden layers result in over-fitting Activation Functions The vast majority of neural networks use activation functions, which extend the output of the NN to appropriate ranges. Examples of activation functions include the sigmoid function, hyperbolic tangent and linear function Sigmoid Activation Function In general, sigmoid activation functions utilize a sigmoid function to attain the desired activation. The sigmoid function is outlined as shown in equation 2.2; A sigmoid curve is S-shaped. (2.2) 14

15 Fig. 2.6: The Sigmoid Function It is important to note that the sigmoid activation function only returns positive values. Therefore, using the sigmoid function, the neural network will not return negative numbers Hyperbolic Tangent Activation Function Unlike the sigmoid activation function, this function does return values less than zero. On the other hand, the hyperbolic tangent function does provide negative numbers. The equation of the hyperbolic activation function (Tanh function) is shown below; (2.3) Fig. 2.7: The Hyperbolic Tangent Activation Function 15

1.5.10.3. Linear Activation Function This function is in reality not an activation function and this is probably the least commonly used activation function.

16 Linear Activation Function This function is in reality not an activation function and this is probably the least commonly used activation function. In addition, this function does not modify a pattern before releasing it. The equation for the linear activation function is; f (x) = x (2.4) This activation function might be useful in applications where the purpose is to output the whole range of numbers Logistic Regression Introduction Binary Data Fig.2.8: The linear activation function For each observation I, the response Y i can take only two values coded 0 and 1. For this research the coded vales would stand for HIV positive (+1) and HIV negative (0). Therefore assuming, p i is the success probability for observation I. Yi has a Bernoulli distribution. 16

17 Binomial Data Each observation I is a count of r i successes out of n i trials. Assuming p i is the success probability for observation I. Therefore r i has a Binomial distribution, r i ~ B (n i, p i ). However, a binomial distribution with n i = 1 is a Bernoulli distribution Models for Binomial and Binary Data An important approach for analysing binary response data is the use of statistical model to describe the relationship inherent between the response and input variables. This approach is equally applicable to data from experimental studies where individual and experimental units have been randomized to a number of treatment groups or to observational studies where individuals have been sampled from some conceptual population by random sampling Statistical Modelling At the centre of any modelling exercise is the need to develop a mathematical representation of the inherent relationship between a response variable and a number of input variables Uses of Statistical Modelling (i) To investigate the possible relationship between a given response and a number of variables, (ii) (iii) To study the pattern of any relationship between a particular response and variables Modelling may motivate the study of underlying reasons for any model structure (iv) To estimate in what way the response would change if certain explanatory variables change Methods of Estimation The process of fitting a model to dataset involves the determination of unknown parameters in the model. The two widely used methods of fitting linear modes are the least squares and maximum likelihood approaches The Method of Least Squares There are two reasons for the use of the method of least squares, namely; (i) minimization of the difference between observations and their expected values, (ii) the parameter estimates and their derived quantities such as fitted values, have a number of optimality proportions, such as being unbiased, having minimum variance when compared with all other unbiased estimators and linearity estimates, meaning that if data assumed to have a normal distribution, the residual sum of squares on fitting a linear model has a Chi-square dis- 17

18 tribution. This is the basis for the use of F-tests to examine the significance of regression or for comparing two models The method of maximum likelihood While the method of least squares is usually adopted in fitting linear regression models, the maximum likelihood method is most frequently used. This method is based on the construction of the likelihood of the unknown parameters in the model for the sample data Transformation of Binomial Response Data This involves the transformation of the probability scale from the range (0,1) to (-,+ ). Other transformations include; The Logistic Transformations The logistic transformation is log { 1 }, which is written as logit p. 1 is the odds of a success 1 p 1 p and so the logistic transformation of p is the log odds of a success. The function logit (p) is a sigmoid curve that is symmetric about p=0.5. Fig. 2.9: The logit and probit transformation of p, as function of p. 18

19 The Probit Transformation The probit function is symmetric in p, and for any value of p in the range (0,1), the corresponding value of the probit of p will lie between - and +. When p=0.5, probit p=0. The probit transformation of p has the same general form as the logistic transformation Advantages of Logit to Probit Transformation There are three reasons why the logit transformation is preferred to the probit transformation; 1. It has a direct interpretation in terms of the logarithm of the odds of a success. This interpretation is particularly important in the analysis of data from epidemiological studies, 2. Models based on the logistic transformation are particularly appropriate for the analysis of data that have been collected retrospectively, 3. Binary data can be summarized in terms of quantities called sufficient statistics when logistic transformation is used It is for the above reasons that the logistic transformation is going to be used in this study Goodness-of-fit of a logistic regression Following the successful fitting of the model to a given dataset, the next step would be to compare the accuracy of the predicted values to the observed values and if there is good agreement then the model is considered to be acceptable. The measure of model adequacy is termed goodness-of-fit, which is described in terms of deviance, Pearson's chi-square statistic, the Hosmer-Lemeshow statistic and analogues of the R 2 statistic Deviance Statistic The D-statistic often called the Deviance measures the extent to which a current model (L c ) deviates from a full model (L f ). The full model is not useful in its own right, since it does not provide a simpler summary of the data than the individual observations themselves. However, by comparing L c with L f, the extent to which the current model adequately represents the data can be judged. To compare L c and L f, it is convenient to use minus twice the logarithm of the ratio of these maximized likelihoods, to give D =-2log ( Lc ) = -2{logL c - logl f } (2.5) Lf 19

20 where values of D will be encountered when L c is small relative to L f, indicating the current model is poor Pearson's chi-square statistic One of the most popular alternatives to the deviance is the Pearson's chi-square statistic defined by; (2.6) where; X 2 = Pearson's cumulative test statistic, which asymptotically approaches a χ 2 distribution. O i = an observed frequency; E i = an expected (theoretical) frequency, asserted by the null hypothesis; n = the number of cells in the table The deviance and the Pearson's chi-square statistics have the same asymptotic chi-square distribution, when the model is fitted correctly. The numerical values of the two statistics will generally differ, but the difference will seldom be of practical importance. Since the maximum likelihood estimates of the success probabilities maximize the likelihood function for the current model, the deviance is the goodness-of-fit statistic that is minimized by these estimates. On that basis, it is more appropriate to utilise the deviance than the Pearson chi-square statistic as measures of goodness-of-fit when linear logistic models The Hosmer-Lemeshow statistic In contrast to the deviance, the Hosmer-Lemeshow statistic is a measure of the goodness-of-fit of a model that can be used in modelling ungrouped binary data. Indeed, if the data are recorded in a grouped form, they must be ungrouped before this statistic can be evaluated Strategy for Model Selection Ideally, the process of modelling should lead to the identification of input factors to be included in the final statistical model for a given binary response dataset. The model selection strategy depends on the underlying purpose of a study. In this current study the aim is to determine which of the many demographic characteristics have a significant effect on the risk of acquiring an HIV infection amongst pregnant women in South Africa. In a nutshell, therefore the central aim of any modelling exercise is to evaluate the dependence of the response probability on the variables of interest. When the number of potential explanatory variables, including interactions, non-linear terms and so on, is too large, it might be feasible to fit all possible combina- 20

21 tions of terms, paying due regard to the hierarchical principle. Models that are not hierarchical are difficult to interpret Model Checking This involved the verification if the model fitted to a given dataset is appropriate and accurate. Indeed, a thorough examination of the extent to which the fitted model provides an appropriate description of the observed data is a vital aspect of the modelling process. Measures of model checking include residual, outlier and influential observations analysis Residuals The measure of agreement between an observation on a response variable and the respective fitted value is termed the residuals. Therefore, residuals are a measure of the adequacy of a fitted model Outliers Observations that are surprisingly distant from the remaining observations in the sample are termed outliers. Such values may be as a result of measurement error, i.e. error in reading, calculating, or reading a numerical value; they may be due to an execution error or an extreme manifestation of natural variability Influential observations A given observation is considered to be influential if its omission from a dataset results in disproportionate changes to the model under review. Although outliers may also be influential observations, influential observation need not necessarily be an outlier Comparison of Models using ROC Curve In general a binary classification technique aims to categorise events into two broad classes namely, true and a false. This in turn leads to four possible classifications for each event: a true positive, a true negative, a false positive, or a false negative. This scenario is generally termed a confusion matrix (Fig. 2.10). A confusion matrix can be used to calculate various model performance measures, as shown in equations 2.7, 2.8 and

22 Measure of Accuracy True Positive+True Negative = True Positive+True Negative+False Positive+False Negative 2.7 Measure of Precision = True Positive True Positive+Fals Positive 2.8 Measure of Recall = True Positive True Positive+False Negative 2.9 Predicted Observed Positive Negative Positive 3361 (TP) 1294 (FP) Negative 375 (FN) 2370 (TN) Fig. 2.10: Format of a Confusion Matrix Based on Fig. 2.10, Accuracy = 0.77, Precision = 0.72 and Recall = Observed Positive Negative Predicted Positive 3361 (TP) 101 (FP) Negative 375 (FN) 198 (TN) Figure 2.11: Effect of changes in false positive and true negative on the measures of accuracy Based on Fig. 2.11, Accuracy = 0.88, Precision = 0.97 and Recall = The Basics of ROC Curves The receiver operating curve (ROC) are graphs used to indicate the performance of a model over different threshold levels. These graphs were initially developed to determine the best operating points for a signal processing apparatus. ROC graphs are drawn by plotting the true positive rate against the false positive rate. Fig shows various regions covered by the ROC curve. 22

23 True Positive Rate 1 Perfect performance (a) Liberal performance (b) Conservative performance random performance 0 (c) worse than random performance always negative classification 0 1 False positive rate Methods of model evaluation Fig. 2.12: Different regions of the ROC curve The central aim of any modelling technique is to improve predictive accuracy. In the study of risk, a small improvement in predictive capability can lead to a substantial increase in benefit. The important question for an analyst is the determination whether a given model has predictive superiority over another. It is imperative for researchers who utilize predictive models for binary classification, to understand the circumstances under which each evaluation method is most appropriate. (i) Global classification rate Table 2.3: Global classification True HIV negative True HIV positive Total Predicted HIV negative x m x + m Predicted HIV positive y n y + n Total x + y m + n (x + m) + (y + n) The above model might have a global percentage classification rate for HIV negative of; Global classification rate (HIV negative) = x 100 x (x + m) + (y + n) 1 23

24 The global classification rate is ideal provided the underlying costs associated with each error are known or presumed to be the same. In this regard, the model with the highest classification would be chosen. (ii) Kolmogorov-Smirnov statistic (K-S test) This is one of the methods used for evaluating predictive binary classification models, and measures the distance between the distribution functions of two classifications. The predictive model generating the largest separability is considered to be superior. A graphical example of a K-S test is shown in Fig Cumulative % of observations Greatest separation of distributions occurs at a score HIV negative 40 HIV positive Score Cut Off Fig. 2.13: K-S test The disadvantages of the K-S test include the fact that this methodology assumes that the inherent costs of miscalculating errors are equal. (iii) Individual Misclassification In reality, however the costs of certain misclassifications are greater than others. A thorough understanding of the situation at hand is required in order to rank the costs of misclassifications. For this current research, a greater mistake might be false negative, in which a pregnant 24

25 Sensitivity woman is told that she is uninfected with an HIV, resulting in the individual not being enrolled for life-saving anti-retroviral treatment (ARVs). On the other hand, advising a false positive verdict cause unnecessary emotional distress as the individual is put on ARVs. (iv) The Receiver Operating Curve (ROC) The ROC curve plots the sensitivity of a model on the vertical axis against 1-specificity on the horizontal axis. The area under the ROC curve (AUROC) is allows for the comparison of different binary classification models. The technique is ideal in situations where there is paucity of information on costs of wrongly classifying events. The AUROC measure is equivalent to a Gini index, c-statistic and the metric θ (Thomas et al., 2002). θ= < θ < 1.0 θ= Specificity Fig. 2.14: ROC Curve illustration θ = Area under the Curve (v) Area under Receiver-of-Characteristics curve (AUROC) This statistic is also used for method validation, with an area value of 0.5 suggesting a random modem with very minimal discriminative advantage. On the other hand, and area value of 1.0 suggests a perfect model Choosing the Right Model The SAS Enterprise Miner TM (SAS Inc. 2002) programme can be successfully used to generate a number of model types that include scorecards, decision trees, logistic regression and neural networks. Some of the considerations for selecting the best model include ease of application, 25

26 understanding and justification. The researcher should also consider the predictive performance of the model in the selection of the best model Scorecards The scorecard model is one of the traditional forms of scoring models. The scorecard is made up of a table containing characteristics with their corresponding attributes. Points are allocated for each attribute and the points vary depending on whether or not the attribute is high or low risk. More points are granted to attributes that are low risk. The overall score is considered relative to a stipulated threshold number of points Decision Trees It is generally believe that a decision tree has the capability to outperform a scorecard model with regards to its ability to accurately predict outcomes. This belief is based on the fact that decision trees are able to analyse interactions between attributes. In that regard the decision tree does add value to the understanding of the risk levels of different attributes Neural Networks In general, neural networks present better accuracy of prediction compared to scorecards and decision trees. The disadvantages of the neural networks are that they are black boxes, and present difficulty in attempting to explain and justify the decisions they arrive at Development of a Scorecard (i) Development of Sample The input dataset comprised of HIV positive and HIV negative individuals. The data partition node, on SAS Enterprise Miner (SAS, Inc.) divided the dataset into 50% training, 25% validation and 25% test. Models will be compared based on the validation data. 26

27 (ii) Classing This is a procedure that involves placing inputs variables into bins. Points are provided to individual attributes on the basis of their relative risk. This relative risk of the attribute is determined by the attribute s weight-of evidence (WOE). On the other hand, the significance of the characteristic is determined by its coefficient in a logistic regression. Distribution of HIV Negative Weight of evidence = ln ( Distribution of HIV Positive ) for each group i of a characteristic 2.1 The classing process determines how many points an attribute is worth relative to the other attributes on the same characteristic. After classing has defined the attributes of a characteristic, the characteristic s predictive power (i.e. its ability to separate high risks from low risk) can be assessed with the Information Value (IV) measure. This will aid in the selection of attributes to be included in the scorecard. The IV is the weighted sum of WOE of the characteristic s attributes. The sum is weighted by the difference between the proportions of HIV negative and HIV positive individuals in the respective attribute. Information value = L i=1 (Distr HIV Negative Distr HIV Positive) 2.2 Distribution HIV Negative ln( Distribution HIV Positive ) where L is the number of attributes (levels)of the characteristic Following the identification of the relative risks of attributes within a given demographic characteristic, a logistic regression is used to measure the demographic characteristics against each other. A number of selection methods such as forward, backward and stepwise can be used in the scorecard node to eliminate the insignificant demographic characteristics. 27

28 Table 2.4: Example of Scorecard Characteristics Attribute Actual level Coded level Scorepoints Women s age (years) < > Partner s age (years) < > Education (grades) < Parity >2 1 - (iii) Logistic Regression Following the determination of the relative risks for the attributes, a logistic regression is used to calculate the regression coefficients, which in turn are multiplied by the WOE values of the attributes to form the basis for the score points in the scorecard. Table 2.6 shows an example of a scorecard. (iv) Scorepoints scaling The scaling of the scorecard points facilitates the attainment of scorepoints that are easy to interpret. Score points = Weight of Evidence * Regression Coefficient (v) Scorecard Assessment The SAS Enterprise Miner TM provides various charts that are used to assess the quality of the scorecard. (a) Scorecard distribution chart- This also shows which scores are most frequent and provides an insight into whether or not the distribution is normal and if there are outliers present. (b) Kolmogorov-Smirnoff (KS) statistic (c) Gini coefficient 28

29 (d) area under the ROC curve (AUROC) The KS statistic, Gini coefficient and AUROC are used to measure the discriminatory power of the scorecard. (vi) Model comparison This involved the comparison of the predictive accuracy neural networks, logistic regression and decision trees using the Model comparison node in SAS Enterprise Miner TM (SAS Inc. 2012). The AUROC statistic was used to achieve model comparison and the results were validated using the K-S and Gini statistics. 29

1.1. Introduction In this chapter, the aspects of the experimental methods, planning and design as well as tools and procedures for the analysis, are presented and motivated.

30 1.1. Introduction In this chapter, the aspects of the experimental methods, planning and design as well as tools and procedures for the analysis, are presented and motivated. Some additional details of the different experimental methodologies are explained in chapter 4 in the context of the experimental results Research Outline Comparison of all modelswith Full Regression A. Data Exploration & Classification B. Screening Design J. Development and validation of an HIV risk scorecard I. Model Assessment with ROC Curves: Validation using a Scorecard design Additional Research that was not included in the initial research proposal-to add value to the research project C. Response Surface Design (Central Composite Function) D. Comparative Study of Two Response Surface Methodologies (RSMs) H. A review of the application of neural networks in modeling HIV/AIDS G. Application of multilayer perceptron to model demographic characteristics E. Comparative Study of RSM with Binary Logistic Regression Fig. 3.1: Research Study Plan Step One: Data Exploration and Classification As explained in chapter 2, the methodology of classification will enable the summarization of voluminous and complex datasets, facilitate the detection of relationships and structure within the data set, allow for more efficient organization and retrieval of information, summaries of data can allow investigators to make predictions or discover hypothesis to account for the structure in the data as well as facilitate the formulation of general hypotheses to account for the observed data Step Two: Screening Design 30

31 This step is undertaken when the experiment has a large number of input variables that have the capacity to influence the response. It is aimed at reducing the number of variables to include only the significant ones. In this current research project, a screening design is going to be used to rank the importance of demographic characteristics on influencing the risk of acquiring HIV infection. As stated in the introduction to this thesis, each pregnant woman attending an antenatal clinic in South Africa is described using various demographic characteristics, such as population group, level of education, age, partner s age, parity, gravidity etc. In literature todate, no recorded work has been conducted in attempting to understand if these demographic characteristics predispose an individual to acquiring HIV. In other words, this work is geared towards ascertaining whether or not there is a link between demographic characteristics and the risk of acquiring HIV, if so, apply a screening design to rank the differential effects of these characteristics on the risk of acquiring the HIV infection. However, the screening design has the disadvantage of not being able to effectively characterize possible interactions between demographic characteristics (Sibanda & Pretorius 2011) Step Three: Response Surface Methodology As already indicated in the screening objective above, the easiest way of estimating a 1 st degree polynomial is to use a factorial design and this technique is sufficient to detect the main effects. However, if it is suspected that there are interactions of explanatory variables, then a more complicated design, such as a response surface methodology is needs to be implemented to estimate a second-degree polynomial model. In this study, a second-order polynomial model used a central composite face centered design estimate the model coefficients of the four selected factors believed to influence the risk of acquiring HIV infection (Sibanda & Pretorius 2012) Step Four: Comparison of two response surface methodologies This step of the research will be conducted to compare results from two response surface methodologies. This is important as it confirms if the results obtained are design-specific and provides a measure of repeatability by using a different RSM technique. A central composite design as shown in step three, is the most common response surface design, and is built on a factorial design. It requires five factorial levels. On the other hand, Box- Behnken designs use the midpoints of the cube edges instead of the corner points, which results in fewer runs, but, unlike the central composite design, all the runs must be done even if there is no curvature. Furthermore, the Box-Behnken design uses only three factor levels, and should be used when the screening experiment indicated curvature to be significant (Sibanda & Pretorius 2013) Step Five: Comparison of response surface methodology and a binary logistic regression results This step of the research will be conducted to compare results of modeling the effects of demographic characteristics on the risk of acquiring HIV using a Box Behnken design and a binary logistic regression. The logistic regression is used in epidemiology to study the relationships between a disease in two modalities (diseased or disease-free) and risk factors which may be qualitative or quantitative variables. This step of the research is used to benchmark the performance of the design of experiment methodologies, as the latter techniques are not traditionally used in disease modeling (Sibanda & Pretorius 2012). 31

32 Step Six: Application of MLPs to model demographic characteristics MLPs are feed-forward artificial neural networks and are comprised of various layers fully connected to each other. MLPs employ a supervised learning technique called backpropagation. MLPs will be trained and validated on given antenatal data, there after used to predict or classify new data. Demographic characteristics will be used as input variables while the HIV status will be the response parameter (Sibanda & Pretorius 2011) Step Seven: A review of the application of neural networks in modeling HIV/AIDS Neural network are finding increasing application in various fields ranging from engineering sciences to life sciences. This review aims to highlight the use of neural networks in the study of HIV/AIDS (Sibanda & Pretorius 2012) Step Eight: Model comparison using ROC Curves Receiver-of-characteristics (ROC) curves will be used to compare the classification accuracy of the models (Sibanda & Pretorius 2013) Step Nine: Development and validation of an HIV risk scorecard This research paper will cover the development of an HIV risk scorecard using SAS Enterprise MinerTM. The project will encompass the selection of the data sample, classing, selection of demographic characteristics, fitting of a regression model, generation of weights-of-evidence (WOE), calculation of information values (IVs), creation and validation of an HIV risk scorecard (Sibanda & Pretorius 2013) Software tools Design of experiments, neural networks and logistic regression analysis tool used in this study was SAS software, produced by SAS Institute, Cary, NC, USA. SAS Enterprise Miner TM was used to compare results from the three modeling methodologies, namely design of experiments, artificial neural networks and binary logistic regression. 32

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA W. Sibanda 1* and P. Pretorius 2 1 DST/NWU Pre-clinical