Statistical Models for Enhancing Cross-Population Comparability

Statistical Models for Enhancing Cross-Population Comparability A. Tandon, C.J.L. Murray, J.A. Salomon, and G. King Global Programme on Evidence for Health Policy Discussion Paper No. 42 World Health Organization, Geneva, Switzerland January 23, 22 1 Introduction Measuring the health state of individuals is important for the evaluation of health interventions, monitoring individual health progress, and as a critical step in measuring the health of populations. Self-report responses in household survey data are widely used for assessing the non-fatal health status of populations. These data typically take the form of ordered categorical (ordinal) responses. Over the past three decades, there has been great progress in developing instruments to measure the multiple domains of health that are reliable and demonstrate within population validity [31],[22]. One key analytical issue is that these self-report ordinal responses are not comparable across populations primarily because of response category cut-point shifts. Conceptualizing the observed responses as resulting from a mapping between an underlying unobserved latent variable (e.g., ability on the domain of mobility) and categorical response categories, cut-points are threshold levels on the latent variable that characterize the transition from one observed categorical response to the next. If cut-points differ systematically across populations, or even across sociodemographic groups within a population, then the observed ordinal responses are not cross-population comparable since they will not imply the same level on the underlying latent variable that we are trying to measure (Figure 1). Another way of characterizing this problem is that, for the same level of the latent variable on any given domain, the probability of an individual responding in any given response category is different across populations. This issue of cross-population comparability is not limited to health surveys: it is of equal relevance to self-report surveys on responsiveness of health systems, as well as to numerous other questions that rely on ordinal responses. One example of self-report health data comes from the WHO Multi-Country Household Survey Study on Health and Responsiveness [28]. The main self-report question on the domain of mobility is: Overall in the past 3 days, how much difficulty did you have with moving around? Respondents are asked to classify themselves using one of five response categories: 1=Extreme/Cannot do; 2=Severe difficulty; 3=Moderate difficulty; 4=Mild

A B C N N Mi N Mo Mi Mi S Mo Mo S E S Cut-points E E Latent mobility scale N = None, Mi = Mild, Mo = Moderate, S = Severe, E = Extreme Figure 1: Mapping from unobserved latent variable to observed response categories difficulty; 5=No difficulty. We can hypothesize that cut-points may vary between populationsbecauseofdifferent cultural or other expectations for domains of health. Cut-points are also likely to vary within a cultural or sociodemographic group. The cut-points for older individuals may shift as their expectations for a domain diminish with age. Men may be more likely to deny declines in health so that their cut-points may be systematically shifted as compared to women. Contact with health services may influence expectations for a domain and thus shift cut-points [2]. Empirical examples suggesting cross-population cut-point shifts in health surveys abound [23]. For instance, in Australian national health surveys comparing the self-reported health status of Aboriginals with that of the general population, only around 12% of the Aboriginal population characterized their own health status as fair or poor, while more than 2% of the general population rated their health in these low categories. By any other major indicator of mortality and morbidity, the Aboriginal population fares much worse than the general population, which suggests that there may be important differences in the interpretation of categorical responses in the different sub-populations due to shifts in response category cut-points. Residents of the state of Kerala in India which has the lowest rates of infant and child mortality and the highest rates of literacy in India consistently report highest incidences of morbidity in the country [19]. The object of this document is to elaborate on several statistical models used in the analysis of survey data. First, we focus on off-the-shelf models that are widely available as part of any standard statistical software. In particular, we demonstrate the problems of inference that arise from these standard methods when the underlying data are not crosspopulation comparable. In later sections, we introduce methods that modify these standard routines to enhance the cross-population comparability of survey analyses. 2

2 Models for Analyzing Ordinal Survey Responses We begin by describing the application of existing statistical models for the analysis of ordinal survey data. These models serve as the building blocks for the methodological innovations introduced in subsequent sections. In particular, the focus is on two off-theshelf methods: (a) the ordered probit model (widely used by econometricians and other social scientists), and (b) the partial credit model (from psychometrics). Both these models are used in the analysis of ordered categorical response data. The partial credit model is a multiple-category generalization of the Rasch model and is part of a large body of literature oftenreferredtoasitemresponsetheory(irt) whichhasitsrootsineducationaltesting using standardized exams. One needs to be careful, though, in using these standard models in the analysis of data that may not be cross-population comparable. In other words, if there are good reasons to believe that respondents saying they are in good health in Ethiopia and in Denmark mean very different things in terms of an underlying latent variable measure, then the use of these methods without correction may lead to very misleading conclusions regarding the actual levels of health in these two populations. In order to better demonstrate this point, and to subsequently introduce some methodological innovations dealing with cross-population comparability, a simulated dataset is utilized. The simulated dataset consists of 1, respondents each from two hypothetical populations (countries A and B) for which the level of health on a domain, say mobility, is to be estimated based on self-report categorical responses to three questions (one core question, and two auxiliary questions). These questions are: 1 Main Question: Overall in the past 3 days, how much difficulty did you have moving around? Auxiliary Question 1: Overall in the past 3 days, how much difficulty did you have standing for long periods such as 3 minutes? Auxiliary Question 2: Overall in the past 3 days, how much difficulty did you have climbing several flights of stairs or walking up a steep hill? Each of the questions asks the respondents to pick one of five responses: 1 = Extreme/Cannot Do 2 = Severe 3 = Moderate 4 = Mild 5= None Since this is simulated data, the true mobility levels are known for each respondent. This enables a comparison of the estimated mobility levels versus truth for the different models. 1 The questions mirror those in the WHO Multi-Country Study. 3

The simulated data is generated based on the assumption that true mobility is a function of age, sex, education, and country of residence for each respondent. An individual-level random effect term is also added to represent other individual-specific unobserved factors that might affect mobility. Table 1 reports the mean age, education level, and sex distribution in the simulated sample. Table 1: Descriptive statistics (simulated data) Country Mean Age Mean Education Female N A 38.72 4.72 5 1, B 38.63 7.33 492 1, In addition, the simulation allows cut-points for each question to differ by sociodemographic group. The response category cut-points are generated as functions of age, sex, education, and country of residence. Figure 2 plots the distribution of the simulated observed categorical responses for the three questions for countries A and B. 2 At first glance, the distribution of self-report responses in the two countries does not look very different. Country A Country A Country A 1 1 1.8.8.8.6.6.6 Frac tion.4 Frac tion.4 Frac tion.4.2.2.2 Country B 1 2 3 4 5 Main question Country B 1 2 3 4 5 Auxiliary question 1 Country B 1 2 3 4 5 Auxiliary question 2 1 1 1.8.8.8 Fract ion.6.4 Fract ion.6.4 Fract ion.6.4.2.2.2 1 2 3 4 5 Main question 1 2 3 4 5 Auxiliary question 1 1 2 3 4 5 Auxiliary question 2 Figure 2: Distribution of responses for three self-report questions in countries A and B In the next two sub-sections, these data are analyzed using both the ordered probit model and the Rasch-based partial credit model. It is assumed that the data analyst has access to the self-report categorical responses as well as standard demographic variables such as age, sex, education, and country of residence for each of the respondents. The goal is to estimate mobility levels in the two simulated populations using these data. In later 2 In generating the categorical responses, a stochastic error term with a variances ranging from 15 to 25 units was used (assumed different across questions, with auxiliary question 2 being the noisiest question). 4

sections, we introduce models that allow response category-cut-points also to be functions of covariates. In such models, the direction of shift for the response category cut-points is also of substantive interest (e.g., to test the hypotheses that more educated respondents have higher cut-points indicative of higher norms, or that older individuals respond based on norms for their age category, and so on). Of course, such models can also be used for testing hypotheses relating to causal inferences and other tests of statistical significance. 2.1 The Ordered Probit Model The ordered probit model assumes there is an unobserved latent variable Yi (mobility) distributed with mean µ i and variance 1, where i refers to the respondent. 3 The mean level of the latent variable is a function of individual-level sociodemographic characteristics such as age, sex, education, and country of residence, Yi N(µ i, 1), i =1,..., N µ i = Ziβ. Let y i be the observed categorical response of individual i to the main self-report question. The ordered probit model stipulates an observation mechanism such that: y i = k if τ k 1 Y i < τ k ; for τ =, τ 5 =, i & k =1,..., 5. Also, it follows from the set-up of the model that τ 1 < τ 2 < τ 3 < τ 4. Given this structure, the probabilities of responding in any given category k =1,..., 5, conditional on a vector of covariates Z i,canbederivedas: Pr(y i = k) = F (τ 1 Zi β), k =1 F (τ 2 Zi β) F (τ 1 Zi β), k =2 F (τ 3 Zi β) F (τ 2 Zi β), k =3 F (τ 4 Zi β) F (τ 3 Zi β), k =4 1 F (τ 4 Zi β), k =5, where F ( ) is the standard normal cumulative distribution function. If the observations are assumed independent across individuals, then the likelihood function is simply the product of the probabilities of observing each value of y i in the dataset. Estimates of the β vector as well as the cut-points τ k may then be obtained using maximum likelihood methods. It is important to note that the standard ordered probit model assumes the same set of cut-points for the entire sample. Table 2 reports the results from a run of the ordered probit model for our simulated data for the main question in both countries. Figure 3 plots the cut-points estimated from the ordered probit model versus true cutpoints for the main question. Because the true cutpoints may vary across individuals but 3 Since the latent variable is unobserved, the variance of the latent variable conditional on determinants is arbitrarily set to 1 in the ordered probit model. In addition, in order to identify the model, the constant term is set to. These conventions produce a scale that is unique up to any positive affine transformation, i.e., the latent scale has so-called interval properties. (1) 5

Table 2: Estimation results : ordered probit Variable Coefficient (Std. Err.) Age 3-44 -.79 (.65) Age 45-59 -.166 (.77) Age 6+ -.498 (.88) Male -.62 (.53) 1 < Educ 6.124 (.91) 6 < Educ 11.245 (.96) Educ > 11.344 (.113) Country B -.232 (.56) τ 1-1.612 (.12) τ 2-1.335 (.1) τ 3-1.1 (.98) τ 4 -.365 (.96) 5 First cut-point Second cut-point Third cut-point Fourth cut-point True cut-points -5-1 -1.5-1 -.5 Predicted cut-points Figure 3: Predicted versus true cut-points: ordered probit for main question the model assumes that they are fixed, each predicted cutpoint is associated with a range of different true values. Figure 4 is a plot of true mobility versus estimated average mobility using the standard ordered probit model. As reported in the graph, the R-squared value is only about.11. Not only does the ordered probit model predict the mean mobility poorly, it also predicts that the average mobility is lower in country B (see coefficient on country B in Table 2) even though the true level of mobility is higher in country B in the simulated data. The basic point of this simulation experiment is simple: if there are significant cutpoint shifts in the underlying data-generating mechanism then using standard procedures such as the ordered probit model to analyze the data can be very misleading. Since the ordered probit model is a probability model, we can also obtain the predicted probabilities of responding in each of the five categories for the main question, given any particular level on the underlying latent variable scale (Figure 5). We have used only the main question for analyzing the data using the ordered probit model. One way to analyze 6

R-squared =.11 RMSE = 21.264 5 True mobility -5-1 -1 -.5.5 Predicted mobility Figure 4: Predicted versus true mobility: ordered probit for main question Predicted probabilities: ordered probit 1.8 Pr(k = 1) Pr(k = 5).6.4.2 Pr(k = 4) Pr(k = 3) Pr(k = 2) -2.5-2 -1.5-1 -.5.5 1 1.5 Latent mobility scale Figure 5: Predicted probabilities: ordered probit for main question multiple questions using this model would be to pool the data and allow for a dummy variable per question (since the cut-points will be assumed to be the same for all questions). However, doing this will yield a different mean value of the latent variable per question for each individual. Running the model in this way is potentially confusing, since we assume that an individual has a single value on the latent variable of interest that informs answers to all three questions, but this procedure would allow estimates of this latent variable to differ by question. 2.2 The Partial Credit Model Asecondmodelthatisoftenusedintheanalysisofordinaldataisthepartialcreditmodel from item response theory. This is basically a polytomous extension of the binary-response 7

Rasch model [16],[17],[18]. 4 Suppose there are N respondents, each answering J questions on a given domain. Individual i =1,...,N chooses response category k =1,..., 5 for question j = 1,..., J. The partial credit model conceptualizes the ordinal nature of the categorical data as a series of dichotomies or steps. 5 These dichotomies are modeled such that the probability that a respondent chooses response category k, given the choice between response category k or k 1, is: φ k ij = Pr(y ij = k) Pr(y ij = k 1) + Pr(y ij = k) = exp(β i δ k j ) 1+exp(β i δ k j ) Here, Pr(y ij = k) is the probability that individual i responds in category k for question j, andφ ijk is the corresponding probability of responding in category k conditional on responding either in category k 1ork. β i is the ability of individual i, andδj k is the difficulty associated with the k-thstepinquestionj. In other words, the probability of responding in category k, conditional on responding either in category k 1 ork, ismodeled as a positive function of a person s ability and a negative function of the difficulty for the question category. Making use of the condition that the probabilities of responding in a category must sum to 1 across all five categories for each individual i and question j, i.e., Pr(y ij =1)+Pr(y ij =2)+Pr(y ij =3)+Pr(y ij =4)+Pr(y ij =5)=1, a general expression for the probability of responding in the k-th category (where k = 1,..., 5) can be derived: Pr(y ij = k) = exp[(k 1)β i P k 1 m= δm j ] P 5s=1 exp[(s 1)β i P s 1 m= δm j ], where, for notational convenience, P m= δj. For the case of five categories, the probabilities of responding in each category can be written as: Pr(y ij = k) = where A is the expression 1/A, k =1 exp(β i δj 1 )/A, k =2 exp(2β i δj 1 δ2 j )/A, k =3 exp(3β i δj 1 δ2 j δ3 j )/A, k =4 exp(4β i δj 1 δ2 j δ3 j δ4 j )/A, k =5, A 1+exp(β i δj 1 )+exp(2β i δj 1 δ2 j )+exp(3β i δj 1 δ2 j δ3 j ) +exp(4β i δj 1 δj 2 δj 3 δj 4 ) (2) For a fixed number of questions, the unconditional estimation of the likelihood function yields difficulty parameters that are inconsistent [16],[3]. Consistent estimates of the difficulty parameters can be obtained by conditioning on the raw score (i.e., on the sum of 4 The Rasch model is a fixed-effect logit model and can also be reformulated as a quasi-symmetry loglinear model [27],[8]. 5 In this sense, the partial credit model can be viewed as an adjacent category logit model. 8

responses across questions for each individual). So, for example, the conditional probability that a person responds in category 2 for all 3 questions is calculated as the joint probability divided by the probability of getting a raw score r of 6 across the questions: Pr(y i1 =2)Pr(y i2 =2)Pr(y i3 =2) Pr(r =6) The likelihood written in this manner is free of the ability parameter β. Once the difficulty parameters have been estimated using the conditional approach, estimates of β r can be obtained using the unconditional likelihood derived from: 1/A, k =1 exp(β r ˆδ j 1 )/A, k =2 Pr(y ij = k) = exp(2β r ˆδ j 1 ˆδ j 2 )/A, k =3 exp(3β r ˆδ j 1 ˆδ j 2 ˆδ j 3 )/A, k =4 exp(4β r ˆδ j 1 ˆδ j 2 ˆδ j 3 ˆδ j 4 )/A, k =5, The notation changes to β r because this method requires only one estimate of ability for every possible sum score of responses across all questions. In the partial credit model, the difficulty parameters are points on the latent variable scale where the probabilities of responding in one category or the next are equal. Alternatively, the difficulty parameters are points where the probability of responding in category k, conditional on responding in categories k 1ork, is.5. The ability parameters can be thought of as estimates of the individual s underlying latent variable. The estimates of ability levels can be compared to true mobility for the simulated data to assess the performance of this model. This simple version of the partial credit model assumes that the difficulty parameters do not vary by sociodemographic characteristics which in the language of psychometrics is akin to saying that it assumes there is no differential item functioning. Table 3 reports the difficulty parameters for the simulated data obtained by running the conditional likelihood procedure in STATA (for identification, δ 1 is set to zero for the main question). 6 Figure 6 plots the estimated ability parameters versus the true mobility. As with the ordered probit model, Figure 7 reports the predicted probabilities from the model for given values of ability. The predicted probabilities are quite similar to those that are predicted by the ordered probit model (Figure 1). As the value of the latent variable increases, the probability of responding in the lowest category becomes small and the probability of responding in higher categories increases. The partial credit model does better than the ordered probit model in predicting the true level of mobility. The R-squared value is much higher than that of the ordered probit model. However, the comparison between the two models is not entirely fair since we only use one question for the ordered probit model and all three questions in the partial credit model. In the formulation introduced here, the partial credit model uses no extraneous information (i.e., covariates such as sex, age, and education) in the estimation of the abilities. 6 Estimates of the difficulty and ability parameters using STATA were of the same magnitude as those obtained using IRT software such as WINMIRA and RUMM. 9

R-squared =.221 RMSE = 17.437 5 True mobility -5-1 -1 1 2 Predicted mobility Figure 6: Predicted versus true mobility: two-stage partial credit Predicted probabilities: partial credit 1.8 Pr(k = 1) Pr(k = 5).6.4.2 Pr(k = 2) Pr(k = 3) Pr(k = 4) -2.5-2 -1.5-1 -.5.5 1 1.5 Latent mobility scale Figure 7: Predicted probabilities: two-stage partial credit for main question 1

Table 3: Estimation results : two-stage partial credit Variable Coefficient (Std. Err.) δ 1 Dummy Aux 1.27 (.183) Dummy Aux 2 1.615 (.178) δ 2 Dummy Aux 1.225 (.186) Dummy Aux 2.723 (.183) Main question -.795 (.267) δ 3 Dummy Aux 1 1.277 (.154) Dummy Aux 2 1.797 (.151) Main question -.933 (.187) δ 4 Dummy Aux 1-1.267 (.11) Dummy Aux 2 1.291 (.131) Main question -.544 (.175) In the next subsection, we present an alternative specification of the model that includes covariates. 2.3 The Partial Credit Model with Covariates The partial credit model can be reformulated so that instead of having a dummy variable per individual β i, variables such as age, sex, education, and country of residence can be introduced. Such a modification to the partial credit model is especially useful in the analysis of health survey data given that sociodemographic variables are usually collected in such surveys. Equation (2) with covariates can be written as the probability that individual i responds in category k for each of the questions j, conditional on a vector of covariates Z i : 1/A, k =1 exp(zi β δ1 j )/A, k =2 Pr(y ij = k) = exp(2zi β δ1 j δ2 j )/A, k =3 (3) exp(3zi β δ1 j δ2 j δ3 j )/A, k =4 exp(4zi β δ1 j δ2 j δ3 j δ4 j )/A, k =5, where A is the expression A 1+exp(Zi β δ1 j )+exp(2z i β δ1 j δ2 j )+exp(3z i β δ1 j δ2 j δ3 j ) +exp(4zi β δ1 j δ2 j δ3 j δ4 j ) Assuming independence across observations and questions, estimates can be computed using maximum likelihood. The mean predicted level of mobility versus truth is plotted in Figure 8 and the estimates are in Table 4. 11

Table 4: Estimation results : partial credit with covariates Variable Coefficient (Std. Err.) Mean Age 3-44 -.134 (.24) Age 45-59 -.23 (.28) Age 6+ -.336 (.32) Male -.77 (.19) 1<Educ 6.49 (.33) 1<Educ 6.19 (.34) Educ>11.16 (.41) Country B -.75 (.2) δ 1 Dummy Aux 1.274 (.185) Dummy Aux 2 1.261 (.163) Main question.272 (.144) δ 2 Dummy Aux 1.92 (.185) Dummy Aux 2 -.76 (.166) Main question -.747 (.14) δ 3 Dummy Aux 1 1.261 (.151) Dummy Aux 2 1.247 (.126) Main question -1.124 (.1) δ 4 Dummy Aux 1-1.319 (.19) Dummy Aux 2.746 (.99) Main question -1.22 (.66) The mean level of the estimated latent variable that is plotted in Figure 8 does not account for the fact that the deterministic variation in the latent variable will be imperfectly captured by the limited set of included covariates. In the absence of a random effect, the model will overestimate the amount of stochastic variability in the data. The next subsection introduces a method for accounting for this by using Bayes theorem to estimate the predicted mobility. 2.4 Random Effects and Latent Variable Estimation using Bayes Theorem If there is an individual-level random effect in the data i.e., when covariates in our model do not capture all the systematic variation in the latent variable then there remains information content in the set of responses across questions for each individual that has not been fully exploited. The partial credit model with covariates and arandomeffect ν i with 12

R-squared =.55 RMSE = 2.787 5 True mobility -5-1 -.6 -.4 -.2.2 Predicted mobility Figure 8: Predicted versus true mobility: partial credit with covariates mean zero and variance σν 2 can be written out as follows: 1/A, k =1 exp[(zi β + ν i) δj 1 ]/A, k =2 Pr(y ij = k) = exp[2(zi β + ν i) δj 1 δ2 j ]/A, k =3 exp[3(zi β + ν i) δj 1 δ2 j δ3 j ]/A, k =4 exp[4(zi β + ν i) δj 1 δ2 j δ3 j δ4 j ]/A, k =5, (4) where A is the expression A 1+exp[(Zi β + ν i) δj 1 ]+exp[2(z i β + ν i) δj 1 δ2 j ]+exp[3(z i β + ν i) δj 1 δ2 j δ3 j ] +exp[4(zi β + ν i) δj 1 δ2 j δ3 j δ4 j ] In order to exploit the information content in the set of responses we can make use of Bayes theorem to obtain estimates of the mean level of mobility conditional of the observed set of responses. That is, we can estimate Pr(µ i y i ) using Bayes formula: Pr(µ i y i )= Pr(y i µ i )Pr(µ i ) R Pr(yi µ i )Pr(µ i )dµ i. (5) where y i represents the vector of categorical responses on all questions for individual i. The way this can be implemented is as follows. First,we use the model with a random effect and estimate all the parameters including the variance of the random effect. This estimate of the variance can be used to simulate 1 different values of µ i around the predicted Z i β of the latent variable for each individual in the sample. Hence, for each simulated value of µ i, Pr(µ i ) can be calculated. Pr(y i µ i ) can be calculated using the probability specifications given in equation (4). Integrating over all simulated values of µ i for each individual gives us the denominator of equation (5). In the absence of a model that estimates the variance of this individual-specific random effect, one can assume that the random effectcapturesabout5%ofthevariation 13

R-squared =.334 RMSE = 17.441 5 True mobility -5-1 -2-1 1 2 Predicted mobility Figure 9: Predicted versus true mobility: partial credit with covariates (Bayesian) in estimated variance of the error term. Under this assumption, the Bayesian predication of mobility conditional on the observed pattern of responses is plotted in Figure 9 for the partial credit model with covariates. 7 It is quite remarkable that the Bayesian correction significantly improves the estimation of mobility (Figure 9) when compared with the estimation of abilities using the two-step conditional procedure for the partial credit model (Figure 6), as judged by the R-squared values. In other words, if the goal of the analyst is to estimate the underlying latent variable, then a modification of the partial credit model that allows for covariates and a random effect outperforms the simple version of the partial credit model. 2.5 Ordered Probit versus Partial Credit We have introduced two basic types of models that are widely used in the analysis of categorical data, namely the ordered probit model and the partial credit model (with ability dummies and with covariates). Fundamentally, both models assume some sort of latent variable that gives rise to an observation mechanism governed by probabilities given in equations (1) and (2). Viewed this way, the two models are quite similar differing only with respect to the functional form for the data generating mechanism and their differences in approach to modeling the probabilities: these being derived from differences in the cumulative probability function for the ordered probit model versus the focus on adjacent categories in the partial credit model. Apart from poor predictions of the underlying latent variable, both the ordered probit and the partial credit models suffer from the problem that one cannot allow the response category cut-points (τ s), or the so-called difficulty parameters (δ s), to be functions of the same covariates as the mean value of the latent variable. This is because there will be a clear 7 We have developed working versions of the models with random effects. However, they are very slow to run and we are currently trying to improve the speed of estimation. 14

identification problem if one does so: in the absence of additional exogenous information, neither model will be able to detect whether the effects of the covariates are on the mean value of the latent variable or on the cut-points or difficulties. This is easy to see from the equations for the predicted probabilities [equations (1) and (2)]. This is likely to be a serious shortcoming of both models in estimating cross-population comparable differences in the latent variable of interest. In simple terms, these models do not allow for a world in which the Danish not only have a higher health status, but also have different expectations for their health status relative to Ethiopians. In the next section, we introduce an innovation to both the ordered probit and partial credit models that allows for the introduction of exogenous information in the form of vignettes. Analyzing the self-report questions in conjunction with responses to vignettes allows us to identify the model such that the same set of covariates can be used to assess differences in the mean level of the underlying latent variable as well as in cut-points or difficulties. 3 Vignettes Wenowintroducetheuseofvignettesasameansofcorrectionofself-reportresponsesin order to make them cross-population comparable. A vignette is a description of a concrete level of ability on a given domain that respondents are asked to evaluate with relation to the same main question and on the same categorical response scale as the main self-report question [24]. The vignette fixes the level of ability such that variations in categorical responses are attributable to variations in response category cut-points. This introduction of exogenous information in the form of responses to vignettes allows us to identify the effects of a set of sociodemographic covariates (such as age, sex, education, country of residence, etc.) on both the level of the underlying latent variable that is being estimated as well as on the cut-points (in the ordered probit version of the model) and difficulties (in the partial credit version of the model). 8 In the WHO Multi-Country Study, there are six vignettes for the domain of mobility, each designed to capture a different level of ability on this domain. The vignettes are: Vignette 1: [Paul] is an active athlete who runs long distance races of 2 kilometers twice a week and engages in soccer with no problems. Vignette 2: [Mary] has no problems with moving around or using her hands, arms and legs. She jogs 4 kilometers twice a week without any problems. Vignette 3: [Rob] is able to walk distances of up to 2 meters without any problems but feels breathless after walking one kilometer or climbing up more than one flight of stairs. He has no problems with day-to-day physical activities, such as carrying food from the market. 8 An alternative method to set a comparable scale such that response category cut-point differences can be recovered is to use measured tests [26]. 15

Vignette 4: [Margaret] feels chest pain and gets breathless after walking distances of up to 2 meters, but is able to do so without assistance. Bending and lifting objects such as groceries produces pain. Vignette 5: [Louis] is able to move his arms and legs, but requires assistance in standing up from a chair or walking around the house. Any bending is painful and lifting is impossible. Vignette 6: [David] is paralyzed from the neck down. He is confined to bed and must be fed and bathed by somebody else. Respondents are asked to classify each of these vignettes on the same five-point response category scale as the main question. So, for each individual, we not only have categorical responses to their self-report main question and several auxiliary questions, but we also have their categorical responses to a set of vignettes (ranging in number from six to eight across the different domains for health and responsiveness in the WHO Multi-Country Study). In order to introduce statistical models designed around the use of vignettes, we have extended the simulated data set to include hypothetical ratings of seven mobility vignettes in countries A and B by assigning true mobility scores to the different vignettes and assuming that individuals will use the categorical response scale the same way in assessing vignettes as they do in assessing their own levels of mobility on the main question. This assumption is critical for the estimation of the models, as discussed below. The simulated vignette ratings for the two countries are summarized in Figures 1 and 11. Each graph shows the distribution of categorical responses for the set of vignettes (lighter colors signifying worse responses). The vignettes are ranked from 1 to 7 in decreasing order of ability: i.e., vignette 1 refers to a higher level of mobility than vignette 2, and vignette 3 is higher than vignette 2, and so on. From these graphs, it is clear that there are important differences in the cutpoints between country A and country B. At lower levels of mobility, respondents in country B are more likely to characterize a vignette unfavorably than respondents in country A. In addition, the compression of the middle categories in country B suggest cut-points that are more narrowly spaced than those in country A. The types of variation in vignette ratings that we have generated in the simulated dataset closely parallel the variation observed in actual data from the WHO Multi-Country Survey Study. In a later section, we show the response distributions for China versus those for India for mobility vignettes. In the following sections we describe how variants of the ordered probit model and partial credit model may be used in conjunction with vignette ratings in order to characterize these systematic cutpoint differences more precisely. Both models are modified such that: (a) information from responses to vignettes are introduced in the likelihood function, and (b) cut-points and difficulties are allowed to be functions of the same covariates as those used in the estimation of the mean value of the latent variable. 3.1 Hierarchical Ordered Probit Model (HOPIT) The hierarchical ordered probit (HOPIT) model is a modification of the standard ordered probit model described earlier. In order to incorporate information on vignette ratings and 16

1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % 1 2 3 4 5 6 7 Vignette Figure 1: Distribution of vignette responses for country A 1% 9% 8% 7% 6% 5% 4% 3% 2% 1% % 1 2 3 4 5 6 7 Vignette Figure 11: Distribution of vignette responses for country B 17

multiple questions, the expanded model has several components to the likelihood function: the first component refers to estimation of cut-points using responses to vignettes, and the second component utilizes responses on the self-report main question. The remaining components are for auxiliary questions. In formal terms, the first component of the likelihood distributedwithmeanµ v ij and variance 1. Here, i refers to the respondent, j refers to the vignette number, and the v superscript indicates that this refers to the vignette component of the model. In mathematical terms, function assumes there is an unobserved latent variable Y v ij Yij v N(µ v ij, 1), i =1,...,N; j =1,..., V µ v ij = Ji α, where J i is a vector of indicator variables for each of V 1 vignettes. Letting y v ij denote the observed categorical response by individual i to vignette j, the observation mechanism is defined as follows: y v ij = k if τ k 1 i Y v ij < τ k i ; for τ i =, τ 5 i =, i, j & k =1,..., 5. In addition, the cut-points are allowed to be functions of covariates: As before, τ 1 i < τ 2 i < τ 3 i < τ 4 i. τ k i = X i γk, The second component of the likelihood function utilizes information from the respondent s main self-report question (the one that is tied to the vignettes) and assumes there is an unobserved latent variable Yi s distributed with mean µ s i and variance σ2. Here, the s superscript indicates that this component refers to self-report questions. This formulation is slightly different from the standard ordered probit model: since we are allowing the vignettes to drive the cut-point estimation, this second component of the likelihood function hasmoreincommonwithanintervalregressionmodel(i.e.,anorderedprobitmodelwith known cut-points). Since the cut-point estimation is being driven by vignettes and the scale is set by the first estimation component, we are now able to obtain estimates of the variance of the latent variable (i.e., there is no need to set the variance equal to 1 as before). In mathematical terms, the model is: Yi s N(µ s i, σ 2 ), i =1,..., N µ s i = Z i β. Let y s i be the observed categorical responses on the self-report such that: y s i = k if τ k 1 i Y s i < τ k i ; for τ i =, τ 5 i =, i & k =1,..., 5 Similarly for the auxiliary questions, let a j i be the observed categorical responses on the j-th auxiliary question such that: and a j i = k, if τ j,k 1 i Y s i < τ j,k i ; for τ j, i =, τ j,5 i =, i & k =1,..., 5 τ j,k i = X i γj,k 18

R-squared =.495 RMSE = 15.186 5 True mobility -5-1 2 4 6 8 Predicted mobility Figure 12: Predicted versus true mobility: HOPIT It is assumed that Yi s & Yi s are independent i 6= i, conditional on X i. Yij v & Yi s are independent i, j conditional on X i,j i and Z i. The probabilities associated with the observed responses to vignettes, the main question, and the auxiliary questions can be computed as in equation (1) with the adjustment for cut-point shifts being functions of covariates. The likelihood function can be written using these probabilities as three separate components. The three components of the likelihood function are additive in logs and can be jointly maximized to yield the parameter estimates. There is explicit parametric dependence between the different components of the likelihood function. The cut-points to be estimated from the vignettes component are the same as those in the main question component. In addition, µ s i is the same for both the main question and all the auxiliary questions. This ensures that the estimated cut-points for both the main question and the auxiliary questions are on the same scale to enable meaningful comparisons. Tables 5 to 9 report the results of the estimation in the Annex. Figure 12 plots the estimatesofthemeanlevelversustruth. TheR-squared for the prediction has improved when compared with the simple ordered probit model as well as with the partial credit models with and without covariates. Figure 13 reports the true versus estimated cutpoints for the main question. These differ by sociodemographic group in that they are also functions of the same covariates (age, sex, education, and country of residence) as the mean level of the mobility. As can be seen, the model is able to recover the cut-point differences quite well. Figures 14 and 15 report the comparison of estimated cut-points to truth for the two auxiliary questions. The recovery here is not quite as good as that for the main question. This is to be expected since the information in the vignettes are directly driving the main question cut-points, whereas the estimation of the cut-points for the auxiliary questions is more indirect and is not anchored to the cut-points derived from vignette responses. The estimation of the latent variable using Bayes theorem (Figure 16) improves the R-squared quite significantly, yielding estimates of mobility that are quite close to the true mobility levels in the underlying simulated data. 19

First cut-point Second cut-point Third cut-point Fourth cut-point 5 True cut-points -5-1 2 4 6 8 Predicted cut-points Figure 13: Predicted versus true cut-points: HOPIT main question 5 First cut-point Second cut-point Third cut-point Fourth cut-point True cut-points -5-1 2 4 6 8 Predicted cut-points Figure 14: Predicted versus true cut-points: HOPIT auxiliary question 1 2

First cut-point Second cut-point Third cut-point Fourth cut-point 5 True cut-points -5-1 2 4 6 8 Predicted cut-points Figure 15: Predicted versus true cut-points: HOPIT auxiliary question 2 5 R-squared =.729 RMSE = 11.139 True mobility -5-1 5 1 Predicted mobility Figure 16: Predicted versus true mobility: HOPIT (Bayesian) 21

3.2 Hierarchical Partial Credit Model In analogy to the HOPIT model, we implement the use of vignettes in exactly the same way for the Rasch-based partial credit model. We allow for responses to vignettes to set the difficulty levels and estimate differences across sociodemographic groups in the first component of the likelihood function. In the other components of the likelihood, we utilize information from the main and auxiliary questions. The logic is the same as before: we are using information on difficulty parameters from responses on vignettes to allow us to have covariates that affect both the mean level of the estimated latent variable as well as the difficulty parameters. For all the vignette questions, i.e., for j = 1,...,V: 1/A, k =1 exp(j Pr(yi v i α δ1 i )/A, k =2 = k) = exp(2ji α δ1 i δ2 i )/A, k =3 (6) exp(3ji α δ1 i δ2 i δ3 i )/A, k =4 exp(4ji α δ1 i δ2 i δ3 i δ4 i )/A, k =5, where J i is a vector of indicator variables for each of V 1 vignettes, and A is the expression A 1+exp(Ji α δ1 i )+exp(2j i α δ1 i δ2 i )+exp(3j i α δ1 i δ2 i δ3 i ) +exp(4jiα δi 1 δi 2 δi 3 δi 4 ) and, δ k i = X iβ k Similarly, the probabilities for the main question (the one which is tied to the vignettes): 1/A, k =1 exp(z Pr(yi s i β δ1 i )/A, k =2 = k) = exp(2zi β δ1 i δ2 i )/A, k =3 (7) exp(3zi β δ1 i δ2 i δ3 i )/A, k =4 exp(4zi β δ1 i δ2 i δ3 i δ4 i )/A, k =5, where Z i is a vector of individual-level covariates, and A is the expression A 1+exp(Ziβ δi 1 )+exp(2ziβ δi 1 δi 2 )+exp(3ziβ δi 1 δi 2 δi 3 ) +exp(4ziβ δi 1 δi 2 δi 3 δi 4 ) And for the j-th auxiliary question: 1/A, k =1 exp(z Pr(yij s i β δ1 ij )/A, k =2 = k) = exp(2zi β δ1 ij δ2 ij )/A, k =3 exp(3zi β δ1 ij δ2 ij δ3 ij )/A, k =4 exp(4zi β δ1 ij δ2 ij δ3 ij δ4 ij )/A, k =5, (8) where Z i is a vector of individual-level covariates, and A is the expression A 1+exp(Zi β δ1 ij )+exp(2z i β δ1 ij δ2 ij )+exp(3z i β δ1 ij δ2 ij δ3 ij ) +exp(4zi β δ1 ij δ2 ij δ3 ij δ4 ij ) 22

R-squared =.487 RMSE = 15.287 5 True mobility -5-1 -6-5 -4-3 Predicted mobility Figure 17: Predicted versus true mobility: partial credit model 5 R-squared =.683 RMSE = 12.19 True mobility -5-1 -8-6 -4-2 Predicted mobility Figure 18: Predicted versus true mobility: partial credit model (Bayesian) Tables 1 to 14 in the Annex report the results of this estimation. Figures 17 and 18 show the predicted mobility versus the true mobility before and after the Bayesian correction. The R-squared values obtained from the hierarchical ordered probit model for predicted mobility are similar in magnitude for the pre-bayesian estimates obtained using the HOPIT model. The post-bayesian estimation appears to be slightly higher for HOPIT than for the hierarchical partial credit model. This may result from the fact that the hierarchical partial credit model, in the way we have formulated it, does not estimate the variance of the stochastic term. This constraint will inhibit the model from fitting the data as well as it could if the variance were included as a parameter. 23

4 Goodness-of-Fit Assessing goodness-of-fit for categorical data is not straightforward. One can compute a simple count-r 2 which is a measure of the proportion of correct responses obtained for a given sample. For ordinal data, the predicted categorical response would be the one associated with the maximum predicted probability. Other options include a pseudo-r 2 measure, which in software such as STATA, is a likelihood-based comparison of the model with all the parameters to one with only the intercept [12]. Rasch-based models use measures of fit suchas outfit and infit : outfit is a chi-square test based on the sum of the standardized deviation of observed versus expected values of a response. Infit is also a chisquare test which utilizes an information-weighted sum by adjusting for extreme responses using weights [32]. In order to assess model fit, a standard likelihood ratio test can be used. These tests compare the log-likelihood value of the full model with a constrained version of the same model (i.e., a model that is nested within the full model) to assess the contribution of the dropped covariates to the likelihood function. Assume L is the log-likelihood value associated with the full model and L 1 is the log-likelihood value of the constrained model. Then 2(L 1 L ) is distributed χ 2 with d d 1 degrees of freedom, where d and d 1 are the model degrees of freedom associated with the full and the constrained models, respectively [12]. 5 Unidimensionality Both the HOPIT model and the Rasch-based models in IRT assume some form of unidimensionality. In formal terms, unidimensionality can be defined as the assumption that any dependence between different questions tapping into a given domain is solely due to the existence of a single underlying latent trait. Tests of unidimensionality are often based on uncovering this assumed factor that underlies observed responses to multiple question. Mathematically, the assumption of unidimensionality can be worked out by assuming responses to all questions on a given domain are tapping this latent trait. In the WHO Multi-Country Study, test-retest data are available from a subsample of respondents who were revisited and administered the survey questionnaire for a second time. This availability of test-retest data can be used to design a test of unidimensionality. Suppose we get latent variable estimates from two separate questions on any given domain, Y1 and Y 2. Each of these estimates of the latent variable represents some measure of truth with error. That is, if truth were denoted by Ytrue, then: and Y 1test = Y true + ² 1test Y 1retest = Y true + ² 1retest Y 2test = Y true + ² 2test Y 2retest = Y true + ² 2retest 24

Here, ² 1 and ² 2 are the question-specific error terms for both test and retest questions, ² 1 N(, σ 2 ² 1 ), ² 2 N(, σ 2 ² 2 ). The correlation coefficient ρ between the measured Y s is: Rewriting (9), ρ = cov(y 1,Y 2 ) = σ Y 1 σ Y 2 ρ = cov(y 1,Y 2 ) (9) σ Y 1 σ Y 2 cov(y 1,Y 2 ) q σ 2 Y true + σ2 ² 1 q σ 2 Y true + σ 2 ² 2 (1) Similarly, ρ = cov(y true,y true) (11) σ Y true σ Y true Dividing (11) by (1), q ρ σ 2 ρ = Y + q σ2 ² true 1 σ 2 Y + σ 2 true ² 2, σ Y true σ Y true since cov(y1,y 2 )=cov(y true,y true) if the error terms are assumed to be uncorrelated. Therefore, v v ρ u = t σ2 Ytrue + σ2 ² 1 u t σ2 Ytrue + σ2 ² 2 ρ =1 σ 2 Y true σ 2 Y true where σ² 2 i = var(y itest Y iretest ) 2 for i =1, 2. Given that both σy 2 true =cov(y 1,Y 2 )andρ are observed, the above expression should equal 1. This can form the basis of a test of unidimensionality using information from test-retest data. 6 Discussion One of the key conclusions of this paper is that adjustments are needed to make survey results comparable across populations. In particular, when categorical variables are involved, analyses must account for differences in response category cut-points. There is considerable evidence that suggests that response category cut-points are different across countries. Therefore, until variation in cut-points is addressed, one must start from a presumption that results are not comparable across populations. The problem of cross-population comparability also appears to apply within populations across different socio-economic and demographic groups. This has important implications for the measurement of inequality, which may be greater or smaller than measured before taking into account response category cut-point shifts. It also has critical implications for comparisons over time. Cut-points may systematically shift over time (e.g., due to rising income, education, and health norms) so long-term trends may be difficult to assess without correction. 25

7 Annex Table5:Estimationresults:HOPIT Variable Coefficient (Std. Err.) Vignettes Ivignette 2 -.35 (.146) Ivignette 3-4.33 (.117) Ivignette 4-5.116 (.122) Ivignette 5-5.341 (.123) Ivignette 6-7.458 (.175) Ivignette 7-7.643 (.195) Mean Age 3-44 -.488 (.85) Age 45-59 -.715 (.1) Age 6+ -1.656 (.113) Male.174 (.68) 1<Educ 6.185 (.115) 6<Educ 11.332 (.122) Educ>11.521 (.147) Country B.996 (.74) Intercept -2.985 (.166) log(s).61 (.43) 26

Table 6: Estimation results : HOPIT τ 1 Variable Coefficient (Std. Err.) Main question Age 3-44 -.54 (.46) Age 45-59 -.569 (.54) Age 6+ -1.282 (.62) Male.25 (.37) 1<Educ 6.79 (.61) 6<Educ 11.72 (.65) Educ>11.129 (.79) Country B 1.296 (.41) Intercept -4.662 (.134) Auxiliary question 1 Age 3-44 -.384 (.132) Age 45-59 -.544 (.156) Age 6+ -1.172 (.17) Male.227 (.16) 1<Educ 6.56 (.175) 6<Educ 11.161 (.185) Educ>11.196 (.227) Country B.928 (.113) Intercept -4.312 (.223) Auxiliary question 2 Age 3-44 -.99 (.113) Age 45-59 -.159 (.134) Age 6+ -.768 (.155) Male.464 (.92) 1<Educ 6.249 (.157) 6<Educ 11.242 (.166) Educ>11.395 (.197) Country B 1.26 (.99) Intercept -3.777 (.22) 27

Table 7: Estimation results : HOPIT τ 2 Variable Coefficient (Std. Err.) Main question Age 3-44 -.441 (.48) Age 45-59 -.551 (.56) Age 6+ -1.283 (.63) Male.25 (.38) 1<Educ 6.53 (.62) 6<Educ 11.59 (.66) Educ>11.72 (.81) Country B 1.259 (.43) Intercept -4.399 (.134) Auxiliary question 1 Age 3-44 -.356 (.125) Age 45-59 -.42 (.145) Age 6+ -1.232 (.164) Male.271 (.1) 1<Educ 6 -.117 (.165) 6<Educ 11 -.5 (.175) Educ>11.14 (.212) Country B.839 (.17) Intercept -3.922 (.21) Auxiliary question 2 Age 3-44 -.152 (.112) Age 45-59 -.224 (.133) Age 6+ -.845 (.155) Male.46 (.92) 1<Educ 6.324 (.157) 6<Educ 11.346 (.165) Educ>11.458 (.196) Country B 1.258 (.98) Intercept -3.579 (.21) 28

Table 8: Estimation results : HOPIT τ 3 Variable Coefficient (Std. Err.) Main question Age 3-44 -.395 (.51) Age 45-59 -.537 (.59) Age 6+ -1.16 (.65) Male.227 (.4) 1<Educ 6.89 (.65) 6<Educ 11.79 (.7) Educ>11.136 (.87) Country B 1.252 (.46) Intercept -4.74 (.135) Auxiliary question 1 Age 3-44 -.271 (.118) Age 45-59 -.388 (.138) Age 6+ -1.262 (.158) Male.217 (.95) 1<Educ 6 -.93 (.159) 6<Educ 11.31 (.168) Educ>11.73 (.24) Country B.836 (.11) Intercept -3.611 (.23) Auxiliary question 2 Age 3-44 -.12 (.113) Age 45-59 -.25 (.136) Age 6+ -.883 (.162) Male.345 (.93) 1<Educ 6.26 (.16) 6<Educ 11.259 (.168) Educ>11.367 (.2) Country B 1.235 (.1) Intercept -2.945 (.23) 29