Measuring Goodness of Fit for the

Measuring Goodness of Fit for the Double-Bounded Logit Model Barbara J. Kanninen and M. Sami Khawaja The traditional approaches of measuring goodness of fit are shown to be inappropriate in the case of the double-bounded logit model. An alternate approach called the "sequential classification procedure" is presented as a possible alternative to the standard tests. The double-bounded log it model is reviewed along with the standard goodness-of-fit measures. The sequential classification procedure and its features are presented in the context of an empirical example. Key words: contingent valuation, discrete choice, double-bounded model, goodnessof-fit test, logit. Contingent valuation (CV) methods are used to elicit information about willingness to pay (WTP) for nonmarket public goods, such as environmental quality. Because of the difficulty of obtaining reliable responses to direct WTP questions, researchers typically use discrete response techniques to obtain the information they desire. n the standard binary logit ("single-bounded") approach, survey respondents are asked to provide one yes/no response to a question that asks them whether they are willing to pay a stated dollar amount for the program or policy in question. Recently a "double-bounded" approach has been introduced where one additional dollar amount is offered as a follow-up to the initial response (Carson, Hanemann, and Mitchell). Hanemann, Loomis, and Kanninen have shown that this approach improves the efficiency of the estimated parameters of WTP. This paper addresses the question of measuring the goodness of fit of double-bounded logit models. We show that the standard goodnessof-fit measures for discrete response models are inappropriate (and in most cases not even possible to estimate) in the case of the doublebounded logit model. We then offer a procedure that we call the "sequential classification procedure" as a possible alternative to the standard goodness-of-fit approaches. n the first section, we review the doublebounded logit model. n the second section, we discuss standard goodness-of-fit measures and show that they are inappropriate for use with the double-bounded approach. n the third section, we propose the sequential classification procedure as a way to measure goodness of fit and discuss the features of this procedure. n the fourth section, we consider an empirical example, and in the fifth section, we discuss empirical examples of goodness-of-fit values derived using our approach. Double-Bounded Logit Model n a double-bounded approach, respondents are engaged in two rounds of questions. n a WTP experiment, if the response to the initial question "Are you willing to pay $BD for the program just described?" is yes, the follow-up question uses a higher bid value; alternatively, if the response is no, then the follow-up question uses a lower bid value. As a result, the researcher is able to place each respondent in one of four categories: "yes/yes," "yes/no," "no/ yes," and "no/no," all of which correspond to smaller, more informative intervals around each respondent's WTP amount. The mathematics of the double-bounded model are a straightforward extension of the single-bounded model. The probability of a respondent saying yes to the initial bid value offered (BD) is Barbara J. Kanninen is assistant professor at the Hubert H. Humphrey nstitute of Public Affairs, and M. Sami Khawaja is project director at Barakat & Chamberlin, nc. () P/ = prob(yes) prob(wt~ ~ BD) Downloaded from http://ajae.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 3, 206 Amer. J. Agr. Econ. 77 (November 995): 885-890 Copyright 995 American Agricultural Economics Association

886 November 995 Amer. J. Agr. Econ. and the probability of obtaining a no response is (l - P/). n this discussion, we use the logit model, so that f takes the following form: (2) p,y which leads to the standard binary choice loglikelihood function (3) LSB = LYilog~Y +L(l-y)log(l-~Y) i i where Yi equals if the response is yes, and 0 otherwise. Now we consider the double-bounded format, where each participant is presented with two sequential bid values and the second bid value is conditional on the first bid value. Following Hanemann, Loomis, and Kanninen, the following response probabilities are obtained for the logit model: (4) (5) (6) ~yy = ------- + e-(u+~hghbd) ~NN = ------- + e-{a+pwwbd) ~YN = _ (7) p,ny + e-(u+phlghbd) + e-(a+plstbd) where stbd represents the starting bid value, LOWBD represents the follow-up lower bid value, and HGHBD represents the follow-up higher bid value. The double-bounded log-likelihood function now has four parts: + e-(a+~lstbld) + e-{a+pwwbd) where ixx indicates the response category for each respondent i. The double-bounded model is premised on an important assumption: that the responses to both the initial and the follow-up WTP questions are consistent, or that respondents have the same WTP value in mind when they answer both questions.' Measuring Goodness of Fit Goodness-of-fit measures are used to assess how well an econometric model explains the observed data or how well-fitted values of the response variable compare to the actual values.' There are several options for measuring goodness of fit when using binary discrete choice data: the McFadden pseudo R2, the Pearson chisquare test, and the classification procedure (Maddala). Unfortunately, as discussed below, each of these methods is inappropriate for use with the double-bounded logit model. McFadden Pseudo R 2 A popular goodness-of-fit measure for binary discrete response data is the McFadden pseudo R 2, which is written as (9) where La is the log-likelihood in the null case (where all coefficients, ~, are assumed equal to 0) and L m ax is the log-likelihood at convergence. n the single-bounded model, Lais calculated by placing the response proportion ny/n, where ny equals the total number of yes responses obtained and N equals the sample size, into the likelihood function [equation (3)]. The probability of obtaining a no response is - ny/n. Using the logit function, these probabilities correspond to a model with only a constant term Downloaded from http://ajae.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 3, 206 This expression can be derived using the Hanemann approach, assuming that utility is linear in income, with an error term that is distributed following the extreme value distribution. (8) LDB L f log ~yy + L t log ~yn 2 An alternative assumption is that respondents change their WTP values between the first and the second question, leading to a sequence of two correlated single-bounded responses. See Cameron and Quiggin, and Alberini, Kanninen, and Carson. 3 Collett describes several possible reasons that a fitted model might be inadequate: the functional form of the response probability may be misspecified; explanatory variables may be omitted or in need of transformation; the data may contain outliers; or the data may contain influential variables that have an undue impact on the parameter estimation.

Kanninen and Khawaja Double-Bounded Logit Model 887 (0) We have already seen that the double-bounded model is premised on the assumption that the responses to the initial WTP question and the follow-up question are consistent. This seemingly innocuous assumption gives the doublebounded model an unusual property: it does not have a standard null case. This property can be illustrated by examining the yes/no or no/yes response probabilities. The probability of obtaining a yes/no response in the null case, for example, would be equal to () The null hypothesis that all coefficients are equal to zero implies that the probability of obtaining responses in either of the yes/no or not yes categories is equal to zero because the null hypothesis assumes that the bid value has no impact on the response probabilities. But the conditional nature of the value of the follow-up bid in the double-bounded procedure assumes a bid value effect. The nature of the doublebounded procedure therefore rules out the possibility of using the pseudo R 2 as a measure of goodness of fit. 4 Pearson Chi-Square A CV survey typically has several versions, where each version uses a different set of bid values for the WTP questions. n this sense, CV data can be thought of as grouped data, where the responses are grouped by survey version. With grouped data, the standard Pearson chisquare measure for goodness of fit can be used. The Pearson chi-square can be expressed as (2) i (Oi - E)2 i=l E, where N is the number of groups defined, O, represents the observed frequencies and E i represents the predicted frequency for each group 4 This problem would not occur with the models described in footnote because those models do not impose the structure found in equation (). For these models, the pseudo R 2 is an appropriate goodness-of-fit measure. and response possibility. This statistic converges to a X 2 (with k - degrees of freedom; where k is the number of categories or survey versions) as the number of observations in each group tends toward infinity. One drawback to this measure is that groups must be defined so that respondents in each group are considered homogeneous. This is possible only when relevant explanatory variables are few and in discrete form. t is easy to see, however, that having even a few explanatory variables that take multiple values can hinder the calculation of this measure when the sample size is modest. For example, if we have five versions of the survey and three variables that each take one of four values, then there are sixty groups to consider." The sample size would have to be quite large in this case to take advantage of the convergence property mentioned above. The Pearson chi-square is therefore of only limited practical use as a goodnessof-fit measure for cases with several covariates and/or modest sample sizes. Classification Procedure Another popular approach to goodness of fit is the 2 x 2 classification table that counts the percentages of "hits and misses" obtained when the predicted outcomes are compared with the actual outcomes. Typically, prediction uses the simple rule that an outcome is predicted to be positive when the response probability is greater than 0.5, and negative otherwise. This rule is sufficient when there are only two response possibilities, but double-bounded data have four potential response possibilities. The idea of assigning an observation to the interval that obtains at least a 0.25 predicted probability seems inadequate. n the next section we extend the classification procedure to address the specific nature of the double-bounded model. ntroducing the Sequential Classification Procedure We propose an extended procedure for measuring goodness of fit with the double-bounded 5 n our case, as in many cases, numerous covariates are used in the model. Furthermore, covariates often are continuous and offer no easy way of "data grouping." For example, two models were estimated in our study: "detailed" and "simplified." The detailed model contained twenty-five covariates and the simplified model contained ten covariates, making calculation of Pearson chi-square close to impossible even with our large sample of nearly 3,500 surveys. Downloaded from http://ajae.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 3, 206

888 November 995 Amer. J. Agr. Econ. model that explicitly takes the sequential, conditional nature of the double-bounded model into account. Our procedure is to sequentially count the proportion of "fully, correctly classified cases." That is, we count the correctly classified cases with respect to the first question alone, then use only the observations that were correctly classified according to the first question to count the correctly classified cases for the second question. These cases are classified using the fact that the responses are conditional on the (correctly classified) initial responses. The procedure works as follows:. For each respondent i, estimate the probabilities of obtaining a yes and no response to the initial bid using the standard singlebounded probability as determined by the double-bounded parameter estimates (3) Py = ----- + e-(a+~stbld) 2. Allocate each respondent to either the yes or the no group according to the higher of the two probabilities and compare each allocated outcome to the actual outcome. 3. Keep only the correctly classified respondents from step 2. Call these the initially correctly classified cases (lccc). 4. Estimate the joint (double-bounded) probabilities using equations (4)-(7) for each respondent. 5. For the respondents who answered yes to the initial bid and were classified correctly in step 2, estimate the following conditional probabilities and assign each respondent to the appropriate group (yes/yes or yes/no) based on the higher probability Gust as in the initial response in step 2). Because we are conditioning here, we compare only two probabilities for this group (4) (5) pt py p;yn py ity. Again, we compare only two probabilities for this group because of the conditioning. These formulae are as follows: (6) (7) pny pn 6. For the respondents who answered no to the initial bid and were classified correctly in step 2, estimate the following conditional probabilities and assign each respondent to the group no/yes or no/no with the higher probabilpnn pn 7. Add the number of respondents classified correctly in steps 5 and 6 to obtain the sum of all fully, correctly classified cases (n). Estimate the percentage of fully, correctly classified cases (FCCC) as n (8) FCCC = N We propose this measure as the primary measure of goodness of fit for the double-bounded logit model. An important feature of the FCCC is that it correctly accounts for the true, sequential nature of the double-bounded model. Because we are interested in testing the adequacy of the double-bounded model, it is essential that our goodness-of-fit measure correctly evaluates the fit of both WTP responses according to the assumptions of the model. t would be inappropriate, for example, to use the CCC as an equally valid measure of goodness of fit because this measure does not address the sequential nature of the double-bounded data. We do, however, propose using the CCC as a secondary measure of goodness of fit. n cases where two FCCC measures are close, it might be appropriate to favor the model that demonstrates greater success in correctly classifying initial responses. The FCCC/CCC measures we have proposed are essentially extensions of the standard classification approach, and they have several of the same advantages and disadvantages as that approach. The measures are straightforward, intuitive, and easy to calculate. They are also conservative measures in terms of assessing the fit of a model. Credit is given only for correctly predicted responses, no matter how close a prediction might be to the true probability of a given response. For example, if the true probability is 0.52, but the model predicts a probability of 0.48, then the observation would be incorrectly classified as a no, even though it predicted the true probability quite well. There Downloaded from http://ajae.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 3, 206

Kanninen and Khawaja Double-Bounded Logit Model 889 Table. Frequency of Responses Percentage of Respondents (n =3,647 surveys) nitial Bid Yes!Yes Yes/No No!Yes No/No $5 $0 $5 $20 Total 9% 7% 3% 2% 2% is no theoretical basis for establishing a critical value for a rejection criterion, and because of this, we recommend that the FCCC/CCC be used primarily to compare two competing models. Empirical Example 9% 7% 4% 2 8% 7% 27% 5% 9% 2 We used the FCCC approach in a water study conducted in California (Barakat & Chamberlin) to estimate people's willingness to pay to avoid certain shortage levels with certain frequencies. The primary purpose of conducting this study was to determine the value residential customers place on water supply reliability, specifically, how much they are willing to pay to avoid water shortages of varying magnitude and frequency. Respondents were asked whether they would vote yes or no in a hypothetical referendum. Respondents were told that if the majority voted yes, water bills would increase by a certain amount and there would be no water shortages. n the spirit of the double-bounded legit, when respondents answered yes, their responses were sought for higher bids, and when they answered no, their responses were sought for lower bids. Willingness to pay varied from $2 to $7 monthly to avoid a 20% shortage once every thirty years to a 50% shortage every twenty years, respectively. Response frequencies are presented in table. We modeled WTP using two double-bounded logit models. The first model (which we refer to as the "detailed model") included several survey-collected customer characteristics, i.e., twenty-five covariates. The second model (the "simplified model") only used characteristics available for the population as a whole, i.e., ten covariates. The FCCC approach was used to compare the two models. We estimated the probabilities of belonging to one of two groups (the yes or no groups) for the first bid. For the observations that were al- Table 2. FCCC Comparison of Two Double Bounded Logit Models Detailed model Simplified model CCC 59% Appropriate Values for FCCC FCCC 35% 33% located accurately for the first bid, we next estimated the respective probabilities that they belonged to the two groups for the second bid and calculated the CCC. For an observation to be described as "fully correctly classified," the model had to predict its group membership correctly for both bids; that is, an observation had to be classified correctly for both the first response and the second response. The results of our goodness-of-fit calculations are presented in table 2. The results show that the detailed model has only slightly more explanatory power than the simplified model. These results, together with the similarity of the WTP results for the two models, indicate that California water agencies could simply apply the simplified model to estimate WTP rather than going to the extra trouble of using the detailed model. We believe that the most appropriate use of the FCCC approach is for evaluation of different models' performance. n other words, as with the standard R2 measure, there is no theoretical basis for establishing a rejection criterion for values of the FCCC. However, there is a need for providing some absolute level of judgment regarding the values derived from the FCCC approach. One possibility is to use the "maximum chance" criterion for this purpose (Hair, Anderson, and Tatham, pp. 89-90). This criterion is usually used in discriminant analysis applications as the threshold that the model needs to exceed before it is considered valid. The maximum chance criterion is the percentage of correctly classified cases that would result if all observations are placed in the group with the largest proportion of cases. Table, illustrates the use of this criterion. n this case, C max would be 27% (the group with the highest probability of occurrence). This percentage reflects the percentage of correctly classified cases that would have resulted if we had simply classified all respondents as the most frequent case: no/ Downloaded from http://ajae.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 3, 206

890 November 995 yes. Thus, our models must outperform 27%. As table 2 shows, both the detailed and the simplified models do so. Conclusion We have presented an intuitive and straightforward approach for measuring goodness of fit for use with double-bounded logit models. Our approach measures the ability of a model to place both WTP responses correctly. Although our FCCC approach is a useful measure of a model's predictive and explanatory power, this approach is inherently conservative. For a model to receive "credit" for a correct prediction using our approach, an observation must be classified correctly on both the first response and the second response. This requirement is considerably more demanding than the requirement for a single-bounded model. For example, while a single-bounded model may yield 60% accurate predictions, an equivalent double-bounded model that correctly predicts 60% of both first and second responses would yield an FCCC measure of only 3. Due to lack of convergence, our proposed approach does not allow for establishing rejection and nonrejection criteria. We proposed C max as a benchmark against which to compare the FCCC values. However, we believe that the most appropriate use of the FCCC approach is for comparing different models. [Received July 994; final revision received June 995.] References Amer. J. Agr. Econ. Alberini, A., B. Kanninen, and R. Carson. "Random Effect Models of Willingness to Pay Using Discrete Response CV Survey Data." Resources for the Future, Discussion Paper 94-34, 994. Barakat & Chamberlin, nc. "The Value of Water Supply Reliability: Results of a Contingent Valuation Survey," 994. Cameron, T.A., and J. Quiggin. "Estimation Using Contingent Valuation Data from a 'Dichotomous Choice with Follow-up' Questionnaire." J. Environ. Econ. and Manage. 27(November 994):28-34. Carson, R.T., W.M. Hanemann, and R.C. Mitchell. "Determining the Demand for Public Goods by Simulating Referendums at Different Tax Prices." University of California, San Diego, Department of Economics, 986. Collett, D. Modelling Binary Data. London: Chapman and Hall, 99. Hair, J.E, R.E. Anderson, and R.L. Tatham. Multivariate Data Analysis. New York: MacMillan, 984. Hanemann, W.M. "Welfare Evaluations in Contingent Valuation Experiments with Discrete Responses." Amer. J. Agr. Econ. 66(August 984):332-4. Hanemann, W.M., J.B. Loomis, and B.J. Kanninen. "Statistical Efficiency of Double-Bounded Dichotomous Choice Contingent Valuation." Amer. J. Agr. Econ. 73(November 99):255 63. Maddala, G.S. Limited Dependent and Qualitative Variables in Econometrics. Cambridge UK: Cambridge University Press, 983. Downloaded from http://ajae.oxfordjournals.org/ at Penn State University (Paterno Lib) on May 3, 206