RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA

RESPONSE SURFACE MODELING AND OPTIMIZATION TO ELUCIDATE THE DIFFERENTIAL EFFECTS OF DEMOGRAPHIC CHARACTERISTICS ON HIV PREVALENCE IN SOUTH AFRICA W. Sibanda 1* and P. Pretorius 2 1 DST/NWU Pre-clinical platform North West University, South Africa wilbert.sibanda@nwu.ac.za 2 School of Information Technology North West University, South Africa philip.pretorius@nwu.ac.za ABSTRACT In this study, a Central Composite Face Centered (CCF) design was employed to study the individual and interaction effects of demographic characteristics on the spread of HIV in South Africa. The demographic characteristics studied for each pregnant mother attending an antenatal clinic in South Africa, were mother s age, partner s age, mother s level of education and parity. HIV status of an antenatal clinic attendee was found to be highly sensitive to changes in pregnant woman s age and partner s age, using the 2007 South African annual antenatal HIV and syphilis seroprevalence data. Individually the pregnant woman s level of education and parity had no significant effect on the HIV status. However, the latter two demographic characteristics exhibited significant effects on the HIV status of antenatal clinic attendees in two way interactions with other demographic characteristics. A 3D response surface plot indicated that the highest rate of HIV positive individuals was obtainable at the highest age of the pregnant women and lowest age of their partners. * Corresponding Author 252-1

1 INTRODUCTION: CIE42 Proceedings, 16-18 July 2012, Cape Town, South Africa 2012 CIE & SAIIE In South Africa, the annual antenatal HIV survey is the only existing national surveillance activity for determining HIV prevalence and it is therefore a vitally important tool to track the geographic and temporal trends of the epidemic (Department of Health)[1]. Antenatal clinic data contains the following demographic characteristics for each pregnant woman; age (herein called mothage), population group (race), level of education (herein called education), gravidity (number of pregnancies), parity (number of children born), partner s age (herein called fathage), name of clinic, HIV and syphilis results (Department of Health) [2]. This research paper explores the application of response surface methodology (RSM) to study the intricate relationships between antenatal data demographic characteristics and one response variable (HIV prevalence). An RSM is a collection of mathematical and statistical techniques used for modelling and analysis of problems in which a response of interest is influenced by several variables and the objective is to optimize this response (Montgomery) [4]. The specific RSM methodology used in this research is the Central Composite Face Centred (CCF), first proposed by G. E. P. Box and K. B. Wilson in 1951. This study follows up on our previous work (Sibanda) [3] where we used a two level fractional factorial design to develop a ranked list of important through unimportant demographic characteristics affecting the HIV status of pregnant mothers attending antenatal clinics for the first time in South Africa. The two level fractional factorial design demonstrated that among demographic characteristics, mother s age had the greatest influence on the HIV status of antenatal clinic attendees. The effects of the rest of the demographics characteristics were ranked using Lenth s plot (figure 1) as shown below; mother s age > level of education > parity > father s age > gravidity > syphilis. Figure 1: Lenth Plot The summaries of the results of the two-level fractional factorial design are shown in Tables 1 and 2 below. 252-2

Table 1: Summary of Results for Two-Level Fractional Factorial Design Summary R 2 0.84 R 2 adjusted 0.76 Standard Error 0.18 PRESS 0.52 R 2 for Prediction 0.34 First Order Autocorrelation -0.74 Collinearity 0.83 Coefficient of Variation 52.17 Precision Index 9.96 Table 2: ANOVA for Two-Level Fractional Factorial Design ANOVA Source SS SS% MS F F signif df Regression 0.66 84 0.33 10.64 0.03 2 Residual 0.12 16 0.03 4 LOF Error 0.05 43 0.05 2026 0.23 1 Pure Error 0.07 57 0.02 3 Total 0.78 100 6 As shown in Table 1, the adjusted R 2 (coefficient of determination) value for the fitted model was 0.76. The R 2 value provides the proportion of variability in a data set that is accounted for by the statistical model and it provides a measure of how well future outcomes are likely to be predicted by the model. In other words, the R 2 provides us with information about the goodness of fit of our model. Judging from the size of the adjusted R 2 value of the fractional factorial model, this suggested that perhaps an employment of a response surface model (RSM) would assist in elucidating the possible effect of interaction of demographic characteristics on the regression model. This belief was further substantiated by the low value of the F-statistic (F=10.64). The F value indicates the overall significance of the regression model and is thus used to decide whether the model as a whole has statistically significant predictive capability. 252-3

2 LITERATURE REVIEW 2.1 Response Surface Methodology (RSM) RSM is a collection of statistical and mathematical methods that are useful for modelling and analyzing design. RSM experiments are designed to allow us to estimate interaction and even quadratic effects, and therefore give us an idea of the local shape of the response surface being investigated. Linear terms alone produce models with response surfaces that are hyperplanes. The addition of interaction terms allows for warping of the hyperplane. Squared terms produce the simplest models in which the response surface has a maximum or minimum, and so an optimal response. RSM comprises of fundamentally three techniques (Myers) [5], namely: Statistical experimental design Regression modelling Optimization The detailed outline of the steps involved in the design of experiments using RSM is clearly indicated in figure 2. 1. Design of Experiments for measurement of response 2. Mathematical model development 4. Two or Three dimensional plots of interactive effects 3. Finding Optimal set of experimental parameters Figure 2: Design procedure of an RSM 2.2 Central Composite Face Centred (CCF) Design Central Composite Face Centered (CCF) design is an example of an RSM that is widely used for fitting a second-order response surface (Mutnury) [6]. CCF involves use of a two-level factorial combined with axial points, factorial points, and center runs. The factorial points represent variance-optimal design for a first order and center runs provide information about the existence of curvature in the system (Zhang) [7]. If curvature is found in the system, the addition of axial points allows for efficient estimation of the pure quadratic terms. Therefore the CCF design is useful for experiments when there is need to fit a second order response surface 3 EXPERIMENTAL METHODOLOGY 3.1 Sources of Data Seroprevalence data studied was obtained from the 2007 South African antenatal data, supplied by the National Department of Health of South Africa (Department of Health) [1]. The data consisted of about 32 000 subjects that attended antenatal clinics for the first time across the nine provinces of South Africa in 2007. 252-4

3.2 Research Tools CIE42 Proceedings, 16-18 July 2012, Cape Town, South Africa 2012 CIE & SAIIE This research utilized the following research tools: 1. Design Expert V8 Software (Design Expert) [8] 2. SAS 9.3, an integrated system of software products provided by SAS Institute Inc. 3. Essential Regression and Experimental Design, version 2.2 (Gibsonia, PA) 3.3 Sampling Procedure To facilitate the experimental design, the data was completely randomized, and this process was undertaken as a preprocessing technique to reduce bias in the design of experiment. 3.4 Missing Data Out of the total of 31 808 cases from the 2007 South African antenatal seroprevalence database, 21 646 (68%) cases were found to be complete. 10 162 (32%) cases were incomplete and thus discarded. 3.5 Variables The variables used in the study were parity, education, mothage, fathage and HIV status. The integer value representing level of education stands for the highest grade successfully completed, with 13 representing tertiary education. Parity represents the number of times the individual has given birth. Parity is important as it shows the reproductive activity as well as reproductive health state of the women. The HIV status is binary coded; a 1 represents positive status, while a 0 represents a negative status. 3.6 Experimental Design In this study, the aim was to use a Central Composite Face Centered (CCF) design to study the individual and interaction effects of demographic characteristics on the HIV status of a pregnant mother using seroprevalence data. The CCF design with four factors and one response variable was developed as shown in Table 3. A two factor-interaction (2FI) design model was used, with 21 runs and no blocks. -1 and +1 denote the minimum and maximum levels of factors respectively. Table 3: The CCF Design Matrix with 4 Factors, 1 Response Variable and 4 Center Points Factors Response Run Mothage Fathage Education Parity HIV 1 1-1 -1 1-2 0 0 0 0 0.34 3-1 1-1 1 0.13 4-1 1 1 1-5 0 0 0 0 0.34 6 0 1 0 0 0.30 7 1 0 0 0-8 0 0 0 0 0.34 9 0 0 0 1 0.31 252-5

Factors Response Run Mothage Fathage Education Parity HIV 3.6.1 Design Matrix Evaluation Degrees of Freedom 10-1 -1-1 -1 0.14 11 1-1 1 1 0.21 12-1 -1 1-1 0.00 13 0 0 0 0 0.34 14 0-1 0 0 0.37 15 0 0 1 0 0.33 16 0 0 0-1 0.30 17 0 0-1 0-18 -1 0 0 0 0.10 19 1 1 1-1 0.36 20 1 1-1 -1-21 0 0 0 0 0.34 Design matrix evaluation showed that there were no aliases for the 2FI model and the degrees of freedom for the matrix are shown in Table 4. As a rule of thumb, a minimum of 3 lack-of-fit df and 4 pure error df ensure a valid lack of fit test. Fewer df tend to lead to a test that may not detect lack of fit (Design Expert) [8]. Standard Errors Table 4: Degrees of Freedom for matrix evaluation Model 10 Residuals 10 Lack of Fit 6 Pure Error 4 Corr total 20 The standard errors of the design are shown in figure 3 and these errors are larger at the edges of the design. This therefore shows that it is advisable to work well within the design margins to achieve a greater degree of accuracy. 252-6

Std Error of Design 1.000 0.800 0.600 0.400 0.200 0.000 1.00 1.00 0.50 0.50 0.00 0.00 B: fathage -0.50-1.00-1.00-0.50 A: mothage Variance Inflation Factor (VIF) Figure 3: 3D Plot of standard error of design The Variance Inflation Factor (VIF) quantifies the severity of multicollinearity in an ordinary least squares regression analysis. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity. Therefore, VIF values should be ideally 1 and values greater than 10 indicate that coefficients are poorly estimated due to multicollinearity (Design Expert) [8]. The VIF values in Table 5 indicate that coefficients of individual demographic characteristics and their interactions are estimated adequately without multicollinearity. However, quadratic terms displayed a higher degree of multicollinearity. Table 5: Signal to noise ratio with the design matrix Term VIF R i Squared A 1.0 0.0 B 1.0 0.0 C 1.0 0.0 D 1.0 0.0 E 1.0 0.0 AB 1.0 0.0 AC 1.0 0.0 AD 1.0 0.0 AE 1.0 0.0 BC 1.0 0.0 252-7

Ri- squared Term VIF R i Squared BD 1.0 0.0 BE 1.0 0.0 CD 1.0 0.0 CE 1.0 0.0 DE 1.0 0.0 A 2 4 0.77 B 2 4 0.77 C 2 4 0.77 D 2 4 0.77 E 2 4 0.77 In general, high R i -squared values mean the terms are correlated with each other, leading to poor model. For this experiment, low R i -squared values were obtained for individual factors and their interactions but higher Ri-squared values were obtained for quadratic terms as shown in Table 5. Fraction of Design Space (FDS) FDS curve (figure 4) is the percentage of the design space volume containing a given standard error of prediction or less. Flatter FDS curve means that the overall prediction error is constant. In general the larger the standard error of prediction, the less likely the results can be repeated, and the less likely that a significant effect will be detected. 1.000 FDS Graph Std Error M ean 0.800 0.600 0.400 0.200 0.000 0.00 0.20 0.40 0.60 0.80 1.00 Fraction of Design Space Figure 4: FDS Plot of the Standard Error over the Design Space 252-8

3.6.2 Choice of Levels for the Factors Table 6: Factor Levels Factor Parity (No. of children) Education (Grades) Levels -1 0 1 0 1 > 2 < 8 9-11 12-13 Mothage (years) Fathage (years) < 20 21-29 < 24 25-33 > 30 > 34 4 RESULTS 4.1 Response Transformations A ratio of maximum to minimum response greater than 10 implies that transformation is required. However as shown in Table 7, ratios less than 10 indicate that power transformation will have no effect, hence the response parameter (HIV) and response terms were not transformed for this study. Table 7: Response Ratio Minimum Maximum Response (HIV) 0.09 0.33 Ratio 0.33/0.00 = 0 4.2 Fit Summary 4.2.1 Model Summary Statistics Table 8: Model Summary Statistics Source Sequential p-value Lackof-fit p-value R 2 Adjusted R 2 Adeq. Precisi on Linear 0.0252 0.0002 0.64 0.50 2FI 0.0005 0.0103 0.99 0.98 25 The R 2 and adjusted R 2 statistics of 2FI model are impressively high at 0.99 and 0.98 respectively, as shown in Table 8. High R 2 values imply that a large proportion of variation 252-9

in the observed values is explained by the model. In addition, the lack-of-fit value of the 2FI of 0.0103 indicates that model lack-of-fit is not significant. 4.2.2 ANOVA for 2FI Response Surface From the ANOVA results (Table 9), it is evident that the mother s age and the father s age are significant terms in the 2FI model, while educational level and parity individually are not. However the non-significant individual terms tend to be significant in two-way interactions with other demographic characteristics. The model F-value (Table 9) of 63.77 implies that the model is significant, and hence there is only a 0.01% chance that this model F-value could be due to noise. Table 9: ANOVA Results Source Sum of Squares df Mean square F value P value Model 0.18 9 0.2 63.77 0.0001 A- Mothage B- Fathage C- Education 0.047 1 0.047 146.5 <0.0001 0.002 1 0.002 7.72 0.0390 0.000001 1 0.000001 0.004 0.9498 D- Parity 0.00005 1 0.00005 0.16 0.7079 AB 0.11 1 0.11 33.11 0.0022 AC 0.038 1 0.038 118.44 0.0001 AD 0.007 1 0.007 20.58 0.0062 BC 0.024 1 0.024 75.8 0.0003 BD 0.011 1 0.011 33.63 0.0021 CD 0.000 0 0.000 Adeq. precision is used to measure the signal to noise ratio. A ratio greater than 4 is desirable and for this experiment a ratio of 25 indicates an adequate signal. Therefore this model can be used to navigate the design space. 5 RESIDUAL ANALYSIS There are many statistical tools for model validation, but the primary tool for most process modeling applications is graphical residual analysis. The residual plots assist in examining the underlying statistical assumptions about residuals (see Table 10). Therefore residual analysis is a useful class of techniques for the evaluation of the goodness of a fitted model. One method of residual analysis is the normal plot of residuals. 252-10

Table 10: Statistical assumptions about residuals Independence Whether response variables are independent Normality Homoscedacity Linearity Whether response variables are normally distributed Whether all response variables have same variance Whether the true relationship between response and explanatory variables is a straight line 5.1 Normal Plot of Residuals The normal plot of residuals (Figure 5), evaluates whether there are outliers in the dataset. All the points lie on the diagonal, implying that the residuals constitute normally distributed noise. A curved pattern indicates non-modelled quadratic relations or incorrect transformations. Normal Plot of Residuals N o rm a l % P ro b a b ility 99 95 90 80 70 50 30 20 10 5 1-3.00-2.00-1.00 0.00 1.00 2.00 3.00 Internally Studentized Residuals Figure 5: Normal plot of residuals 6 FINAL EQUATION OF THE RESPONSE MODEL The final equation of the HIV response model was as shown below; HIV= + 0.33 + 0.23 *Mothage - 0.035 *Fathage - 0.013 *Education + 0.005 *Parity - 0.140 *Mothage*Fathage - 0.120 *Mothage*Education - 0.070 *Mothage*Parity - 0.020 *Fathage*Parity 252-11

A coefficient plot (figure 6) was drawn to represent the information provided by the 2FI response model equation. Coefficient plots tend to clearly represent the relative importance of each variable on the model equation. 0.25 0.2 0.15 0.1 0.05 0-0.05-0.1-0.15-0.2 Coefficient mothage fathage educa on parity mot*fat mot*edu mot*parity fat*parity Figure 6: Coefficient Plot of the Different Demographic Characteristics Inspection of the regression coefficients (figure 6) indicates that the two model terms, level of education and parity are not significant and can be removed from the model. 7 PERTURBATION PLOT The perturbation plot (Figure 7) compares the effect of all factors at a particular point in the design space. A steep slope or curvature in a factor shows that the response is sensitive to that factor. A relatively flat line shows insensitivity to change in that particular factor. However the perturbation plot does not show interactions. From figure 7, the perturbation plot indicates that the effects of the demographic characteristics on the response are in the order: Mothage (A) >Fathage (B) > Education (C) > Parity (D) Perturbation 0.6 A 0.5 H IV 0.4 0.3 B C D D C B 0.2 0.1 A -1.000-0.500 0.000 0.500 1.000 Deviation from Reference Point (Coded Units) Figure 7: Perturbation Plot 252-12

8 3D RESPONSE SURFACE PLOT Figure 8 shows the 3D plot of the influences of mothage and fathage on HIV response. The highest rate of HIV is observed at the highest age of the mother and lowest age of the father. 1 0.8 0.6 0.4 H IV 0.2 0-0.2 1.00 1.00 0.50 0.50 0.00 0.00 B: fathage -0.50-1.00-1.00-0.50 A: mothage Figure 8: 3D Response Surface plot 9 DISCUSSION A central composite face centered (CCF) design was found to be suitable for studying the involvement of demographic characteristics in the determination of the HIV status of pregnant women attending antenatal clinic in South Africa. The 2FI polynomial function for mothage, fathage, education, and parity obtained using StatEase Design Expert was found to be statistically significant. The measured HIV prevalence response was in close agreement with the predicted values, as shown in Figure 9, below. 0.40 Predicted vs. Actual 0.35 5 Predicted 0.30 0.25 0.20 0.15 0.10 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Actual Figure 9: Plot of Predicted vs. Actual Response 252-13

10 CONCLUSION CIE42 Proceedings, 16-18 July 2012, Cape Town, South Africa 2012 CIE & SAIIE The CCF design therefore confirmed the results obtained by fractional factorial design (Sibanda) [3], that mother s age had the greatest effect on the HIV status of an antenatal clinic attendee. However, the CCF further demonstrated that interaction of factors had a significant effect on an individual s HIV status. The R 2 value of the predictive model improved from 33.5% (fractional factorial in the previous study) to 98% (CCF). The latter result demonstrated that the relationship between the demographic characteristics and HIV response were better modeled by a 2FI function. 11 ACKNOWLEDGEMENTS Wilbert Sibanda acknowledges doctoral funding from South African Centre for Epidemiological Modelling (SACEMA), Medical Research Council (MRC) and North-West University. Special thanks to Cathrine Tlaleng Sibanda and the National Department of Health (South Africa) for the antenatal seroprevalence data (2006-2007). 12 REFERENCES [1] Department of Health. 2010. National Antenatal Sentinel HIV and Syphilis Prevalence in South Africa. [2] Department of Health. 2010. Protocol for implementing the National Antenatal Sentinel HIV and Syphilis Prevalence Survey in South Africa. [3] Sibanda, W. 2011. Application of Two-level Fractional Factorial Design to Determine and Optimize the Effect of Demographic Characteristics on HIV Prevalence using the 2006 South African Annual Antenatal HIV and Syphilis Seroprevalence data, International Journal of Computer Applications, 35 (12). [4] Montgomery, D.C. 2008. Design and Analysis of Experiments, John Wiley and Sons. [5] Myers, R.H. 2002. Response Surface Methodology: Process and Product Optimization Using Designed Experiments, 2 nd Edition, John Wiley and Sons. [6] Mutnury, B. 2011. Modeling and Characterization of High Speed Interfaces in Blade and Rack Servers Using Response Surface Model, Electronic Components and Technology Conference (ECTC). [7] Zhang, Z. 2008. Comparison about the Three Central Composite Designs with Simulation, International Conference on Advanced Computer Control Advanced Computer Control (ICACC). [8] Design Expert 8.0.71. StatEase software. 252-14