Impact and adjustment of selection bias. in the assessment of measurement equivalence

Impact and adjustment of selection bias in the assessment of measurement equivalence Thomas Klausch, Joop Hox,& Barry Schouten Working Paper, Utrecht, December 2012 Corresponding author: Thomas Klausch, L.T.Klausch@uu.nl Utrecht University Faculty for Social and Behavioural Sciences Department of Methods and Statistics PO Box 80140, 3508 TC Utrecht The Netherlands. Abstract Selection bias is a threat to valid causal inference in designs with incomplete randomization, e.g. in observational studies and quasi-experiments. When measurement models, such as CFA or IRT, need to be assessed for equivalence selection bias might lead analysts to draw wrong conclusions. Whether this threat is real and how to adjust for it, is assessed in the present study by means of a Monte-Carlo simulation. Selection bias between a treatment and control group was simulated, where measurement non-equivalence was introduced on qualitative covariates that were causally related to the assignment mechanism. Our results indicate that unadjusted tests falsely reject the hypothesis of measurement equivalence using RMSEA and CFI fit criterions. Inverse propensity score weighting performed best in adjustment, whereas simple ANCOVA adjustment on the latent factor proved insufficient in removing all selectivity in the treatment assignment. 1

1. Introduction Latent variables are important quantities in social research assisting researchers in measuring concepts that cannot be surveyed by single direct questions alone. Measurement models, such as confirmatory factor analysis (CFA) or item response theory (IRT), are used in the estimation of latent variables and additionally help to control for measurement error in the observable indicators (e.g. Bollen, 1989; Alwin, 2007). These are typically available from multiple item scales. Multiple group models can be used to assess construct equivalence (equivalence) across study groups about which analysts wish to draw inference or make comparisons, for example population strata defined by gender, age, nationality or race. The methods necessary to assess such question have been well developed and documented (Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000; Millsap, 2011). Not always, however, lies the focus of interest in comparing naturally occurring population strata. When the researchers seek to draw causal inference about latent constructs in experimental designs or the effect of an experimentally manipulated factor on other parameters of the measurement model, it needs to be assessed whether measurement instruments are invariant across experimental groups. If randomization was successful, these analyses will be unbiased. In many situations, however, full randomization of subjects to treatment ( intervention ) and control groups is not possible, in particular in observational and quasiexperimental studies (Rosenbaum, 2002; Morgan & Winship, 2007). In such situations selection bias is known to occur threatening valid causal inference about equivalence in measurement models. Although selection bias has received a lot of attention in the literature including the methods to adjust for it (cf. Schafer & Kang, 2008 ), the effect selection bias has in latent variable models and on equivalence assessment, to the authors knowledge, has not been discussed. Furthermore, adjusting for selection bias in latent variable models has received little systematic attention in the literature. The present study addresses this research gap. Using a Monte-Carlo simulation, we illustrate the effects that selection bias can have in categorical CFA measurement models (also known as polytomous IRT models) when testing measurement equivalence (Muthén& Asparouhov, 2002; Millsap, 2011). This class of models is appropriate when questions using Likert scales with a small number of answer categories need to be scaled, a situation close to research practice. Consequently, we compare the performance of methods to adjust for selection bias. In particular, we consider three popular methods used to adjust selection bias, when directly observed outcome variables are studied: ANCOVA adjustment on covariates as suggested by Sörbom (1978), exact stratification (Rosenbaum, 2002), and propensity score weighting (Rosenbaum& Rubin, 1983). 2

This paper is structured as follows. First, we present the CFA model used in the simulation and discuss how selection bias is introduced. Second, we discuss how to adjust for selection bias. Finally, results are presented and discussed. 2. Simulating Selection bias in an ordered categorical CFA Model 2.1 The CFA Model In the simulation we consider the ordered categorical factor model with, for simplicity, one factor and four indicators (cf. Muthén & Asparouhov, 2002; Millsap, 2011): (1) We model j=1,,4 latent response variables X* by variable specific intercepts, a source W of common variance ( true or latent scores) and a random error. For identification we, however, fix 0 for all j. We further set ~0,1. Unit variance of W leads to the definition of the reliability of measure j: ² 1 (2) The second equation follows after standardizing by 1. Accordingly, error variances dependend on : θ 1" ² (3) All cannot be observed directly, but are mapped without error on the observed ordered categorical indicator with C=4 defining five categories by the mapping function: # $ #% & ' % &() (4) Where % & are denoted threshold parameters for the latent response variable j. 2.2 Introducing Covariates as Causes to Selection Bias We introduce selection bias by two multinominal stratification variables, * ) and * +, dividing the sample space into 3 x 2 strata (population proportions are {.3,.3,.4} and {.5,.5} respectively). In practice many more variables might be available to the analyst, but the results will easily generalize to other situations. Note that assignment to treatment and control groups has not taken place. However, in observational studies probability of assignment is impacted by the levels of covariates, which will be represented by * ) and * +. 3

In the simulation we vary reliability and threshold parameters across these two stratification variables. By this process we account for the fact that in reality not the experimental factor causes measurement non-equivalence, but rather the underlying characteristics S. For example, S might be the nationality of subjects, but S is not equally distributed across treatment and control due to selectivity. This imbalance will be introduced below. First, let membership in stratum combination * ), and * + - have a fixed effect on response reliability: /,0 + ² / ² 0 ² 1 (5) As well as the threshold parameters: % & /,0 % & % /& % 0& (6) The order implied by equation (4) must still hold for % & /,0. Consider table 1 and 2 for an overview on the exact parameterization used, which introduces measurement non-equivalence between all strata. Equations 5 and 6 generalize model 1 to a multi-group CFA model: /,0 /,0 /,0 (7) # $ * ),,* + - #% & /,0 ' % &() /,0 (8) Where we additionally assume that W is independent of * ) and * +. Table 1: Differential item functioning (reliabilities) in sub-groups * ) and * + j ² /2) ² /2+ ² /23 ² 02) ² 02+ ² 1.6 0 -.3 -.5 0.2 2.6 0 -.3 -.5 0.2 3.6 0 -.3 -.5 0.2 4.6 0 -.3 -.5 0.2 Table 2: Differential item functioning (thresholds) in sub-groups * ) and * + c % & %,/2),& %,/2+,& %,/23,& %,02),& %,02+,& 1-1.5 0 -.5 0 0 0 2-1.0 0 0.5 0 0 3 0 0 0 0 0.5 4 1.0 0 0 0 -.5 0 4

For example, if S 2 is a nationality indicator, we would assume that the reliability of answers given by people with nationality l=a is smaller by.20 than the reliability of answers provided by subjects with nationality l=b. Furthermore the threshold at which a particular answer would be given is also different across groups. Additionally to this, it might be possible that the factor means (and variances) of subjects differ across strata. This possibility is neglected in order to keep the present simulation straight-forward. 2.3 Introducing selection bias on S when randomizing treatment and control groups Now let population members select into a treatment and a control group using a simple probit selection model. For this purpose we transform * ) and * + into dummy indicators: 45 )), 5 )+, 5 )3 6 and 45 +), 5 ++ 6. For individual i we define a latent selection variable (see also Table 3): 7 8 9 : 9 )+ 5 )+8 9 )3 5 )38 9 ++ 5 ++8 9 3) 5 )+8 5 ++8 9 3+ 5 )38 5 ++8 ; 8 (9) Where ;~0,1 and define the treatment indicator M as: < 8 = 0 >? 7 8 0 1 >? 7 8 @ 0. (10) Model (9)-(10) suggests that we do not assume that randomization was perfect. Rather background characteristics S are the known causes of selection bias. Table 3: Parameters of the selection model Parameter Value 9 : 0 9 )+ 1 9 )3 2 9 ++ -1 9 3).5 9 3+.5 Ana analyst interested in assessing measurement equivalence over M can do so by means of a multi-group model (e.g. Millsap & Yun-Tein, 2004; Millsap, 2011): A A A (11) # $ < #% & A ' % &() A (12) 5

In doing so, the hypotheses: B :) : A2: A2) for all j (13) B :+ : % & A2: % & A2) for all j and c (14) are evaluated jointly by imposing equality constraints on the parameters and evaluating global model fit. Model (11)-(12) can be estimated by mean and variance adjusted weighted least squares (WLSMV) as described in Muthén (1984) and Muthén, du Toit, & Spisic (1997). Treatment indicator M has no differential impact on the measurement model. That is why B :) and B :+ should not be rejected. However, selection on * ) and * + might introduce measurement non-equivalence, because the distribution of the selection variables is not equal across modes by means of selection model (9)-(10). Hence it is asked, if the test of measurement equivalence (13)-(14) can be improved (or adjusted) by applying techniques that balance mode groups with respect to * ) and * +. How to do this, is discussed in the next section. 3. Evaluation of three adjustment methods We evaluate performance of three possible adjustment methods: 1. ANCOVA adjustment of the latent factor ( Covariate adjustment ), 2. Exact Stratification on * ) and * +, 3. Inverse propensity score weighting ( IPW ), against the case of ignoring selection ( Simple model ). (a) (b) S 1 T T S 2 Figure 1: Illustration of (a) a path model in the ANCOVA tradition (b) exact stratification on all levels of S1 by S2 by a stratified multi-group model 6

Situations 1 and 2 are illustrated in figure 1. ANCOVA adjustment is a classical way to control for group heterogeneity in incompletely randomized groups (e.g. Schafer & Kang, 2008). In the context of CFA models modeling direct effects of the stratification variables on the latent factor (method 1, case a in figure 1) seeks to balance group heterogeneity, as suggested by Sörbom (1978; e.g. Heerwegh & Loosveldt, 2011 for an application): 8 D : D )+ 5 )+8 D )3 5 )38 D ++ 5 ++8 D 3) 5 )+8 5 ++8 D 3+ 5 )38 5 ++8 E 8 (15) Second, model stratification on the exact strata defined by * ) and * + (method 2, case b in figure 1) implies conditional estimation of all multiple-group model parameters on any combinations of S (e.g. Rosenbaum, 2002; Morgan & Winship, 2007). In the present simulation this implies to estimate one multiple-group model for each of s=1,,6 strata defined by * ) and * + : A,F2G A,F2G A,F2G (16) # $ <,* 5 #% & A,F2G ' % &() A,F2G (17) It is concluded that measurement equivalence holds conditional on * ) and * + if B :) and B :+ cannot be rejected in all strata defined by the combinations of both S variables. Finally, inverse propensity score weighting (method 3) can be used to adjust for unequal selection probabilities of individual i to treatment (or control). Propensity scored are estimated from a probit model that follows the true model (9) (Rosenbaum & Rubin, 1982; Morgan & Winship, 2007; Guo & Fraser, 2010): #< 8 1 * )8,* +8 Φ9 : 9 )+ 5 )+8 9 )3 5 )38 9 ++ 5 ++8 9 3) 5 )+8 5 ++8 9 3+ 5 )38 5 ++8 (18) Where Φ denotes the standard normal distribution function. Let IJ 8 be a propensity score estimate from (18), then individual weights are defined as: KL 8 M IJ 8 N) >? < 8 1 1"IJ 8 N) >? < 8 0 < 8IJ 8 N) 1"< 8 1"IJ 8 N) (19) Implementation of selection probabilities to the estimation of model (11)-(12) with WLSMV is described in Asparouhov (2005). 4. Results from a Monte-Carlo Simulation A Monte-Carlo simulation with 1000 replications and a sample size of n=3000 was conducted. Data were simulated in the statistical programming environment R 2.14. Models were estimated in the statistical software Mplus 6 run from R using the procedure MplusAutomation. To 7

evaluate B :) and B :+ jointly, the parameters A,% & A,and θ A were constrained equal across M. Fit was evaluated based on RMSEA criterion (root means square error approximated): Reject B :) and B :+ if O<*PQ @.05 (20) and CFI criterion (comparative fit index): Reject B :) and B :+ if TUV '.95 (21) We found that all models in the unadjusted ( simple ) model have insufficient fit leading to false rejection of the measurement equivalence (MI) hypothesis with respect to mode groups M in all of the replications (Table 4). Covariate adjustment on the latent factor improves model fit (mean RMSEA=.053, CFI=.916) but still leads to rejection of the MI hypotheses in 71.6% of cases based on RMSEA and all of the cases based on the CFI criterion. Exact stratification on all six strata and separate evaluation of multi-group models in each of the strata is more effective than covariate adjustment. However, during estimation the conditioning technique posed new problems due to data sparseness in some of the mode group strata combinations. Table 4: MC distribution of CFA model fit statistics with results of hypothesis tests (in %) Simple Covariate Stratification IPW RMSEA (mean/sd).129(.008).053(.004).014(.018)*.013(.009) % MI not rejected 0 28.4 83.8 100 CFI (mean/sd).864(.017).916(.014).994(.031)*.993(.007) % MI not rejected 0 0.5 100 Successful estimations 1000 1000 3325 of 6000 1000 * over all successful replications To see this, consider table 5 that illustrates performance of hypotheses tests in all six strata. The selection model (9) parameterization evidently causes mode distribution in stratum * ) 3 to be very skew (9 )3 2, for example (i.e. few observation in M=0). While the adjustment method works well in strata with a high number of successful replications (i.e. those with sufficient group sizes), it functions badly in the two strata associated with * ) 3. It was, furthermore, postulated (cf. section 3) that successful adjustment for selection would only be considered successful if equivalence was produced in all six strata. This was, however, only found for 83.8% of replications based on RMSEA and a mere 0.5% of replication based on CFI (only successful model fits were used in these two statistics). This suggests a multiple testing problem, because taken separately for all strata (table 5) the conditioning technique work 8

satisfactory, if strata are of sufficient size. In sum, the conditioning technique may suffer from data sparseness and multiple testing problems. Table 5: Fit statistics per strata for the stratification adjustment method (in %) * ) 1, * + 1 * ) 2, * + 1 * ) 3, * + 1 * ) 1, * + 2 * ) 2, * + 2 * ) 3, * + 2 RMSEA (mean/sd).014.013.016.015.013 (.018) (.018). (.018) (.020) (.018) n/a % MI not rejected 95.1 94.5 96.8 93.6 96.1 100 CFI (mean/sd).998.989.898.999.997.822 (.003) (.020) (.151) (.001) (.006) (n/a) % MI not rejected 100 92.7 54.8 100 100 0 Successful estimations 1000 439 93 968 824 1 Finally, consider the performance of the inverse propensity score (IPW) adjustment technique (Table 4). Mean RMSEA and CFI suggest good fit. Measurement equivalence is not rejected in any of the replications; that is, RMSEA<.05 and CFI>.95 in all replications after weighting with inverse propensity scores. Note that this finding holds in face of small strata sizes as discussed for the conditioning adjustment technique. Since IPW, furthermore, only requires one statistical test, it appears to be superior to stratification in the current simulation. 5. Conclusions and Outlook Our simulation demonstrated that working with non-adjusted CFA models when testing for measurement equivalence in experimental groups is prone to false conclusions under two conditions. First, observed or unobserved covariates determine individual selection into treatment and control groups. Second, there is measurement non-equivalence across the classes of these variables. In the presence of selection bias in measurement equivalence tests, adjusting for bias on observed covariates is a necessity. Our results demonstrate, however, that not any of the methods available from the literature performs equally well. In particular, an ANCOVA adjustment on the latent trait performed very weakly in the present simulation and therefore is not recommended. The reasons for this weak performance are related to the locus of non-equivalence in the present data. We assumed that strata of the stratification variables did not differ on the means and variances of the latent factor, but rather with respect to thresholds and item reliabilities. The ANCOVA adjustment, however, works only on the expectation and variance on the latent factor not controlling for the true sources of non-equivalence. These are taken into account using exact stratification and propensity score weighting. Given the sample size of the present simulation (n=3000) and six strata, cell sparseness in a few stratums 9

coincided with false test results. In situations with even more stratification variables or less observations, this problem is prone to become even more serious. Therefore exact stratification is not recommended in these situations. This observation is generally known from the literature on adjustment by exact stratification (e.g. Rosenbaum, 2002; Morgan & Winship, 2007). The propensity score combines information on all stratification variables into a single vector, thereby addressing cell sparseness problems effectively. Consequently, inverse propensity weights performed exceedingly well in adjusting for selection bias in the present simulation. From the present results we conclude that IPW is the method of choice. This conclusion has to be considered against the specific limitations of the present simulation design (e.g. parameterization, sample size, number of stratification variables) as well as further options to adjust for selection bias, which were not considered. These include in particular methods based on the propensity score, such as propensity matching and stratification. Furthermore, the present study assumed that all bias is overt, that is, full information was available on both stratification covariates. In practical situations, there might be bias caused by hidden covariates. Propensity score models have shown to be misleading in this situation. Double robust methods using both covariate and propensity adjustment might prove beneficial. In further simulations, these aspects should be assessed. 10

6. References Alwin, D. F. (2007). Margins of Error. Hoboken: Wiley. Asparouhov, T. (2005). Sampling Weights in Latent Variable Modeling. Structural Equation Modeling: A Multidisciplinary Journal, 12(3), 411 434. doi:10.1207/s15328007sem1203_4 Bollen, K. A. (1989). Structural Equations with Latent Variables. New York: Wiley. Guo, S., & Fraser, M. W. (2010). Propensity Score Analysis. Thousand Oaks: Sage. Heerwegh, D., & Loosveldt, G. (2011). Assessing Mode Effects in a National Crime Victimization Survey using Structural Equation Models: Social Desirability Bias and Acquiescence. Journal of Official Statistics, 27(1), 49 63. Millsap, R. E. (2011). Statistical Approaches to Measurement Equivalence. New York: Routledge. Millsap, R. E., & Yun-Tein, J. (2004). Assessing Factorial Equivalence in Ordered- Categorical Measures. Multivariate Behavioral Research, 39(3), 479 515. Morgan, S. L., & Winship, C. (2007). Counterfactuals and Causal Inference. Cambridge: Cambridge University Press. Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49(1), 115 132. Muthén, B., Du Toit, S. H. C., & Spisic, D. (1997). Robust inference using weighted least squares and quadratic estimating equations in latent variable modeling with categorical and continuous outcomes. Retrieved from http://www.gseis.ucla.edu/faculty/muthen/articles/article_075.pdf 11

Muthén, B. O., & Asparouhov, T. (2002). Latent Variable Analysis With Categorical Outcomes: Multiple-Group And Growth Modeling In Mplus. Muthèn & Muthèn. Retrieved from http://www.statmodel.com/download/webnotes/catmglong.pdf Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). New York: Springer. Rosenbaum, P. R., & Rubin, D. B. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70(1), 41 55. Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13(4), 279 313. Sörbom, D. (1978). An alternative to the methodology for analysis of covariance. Psychometrika, 43(3), 381 396. Steenkamp, J.-B. E. M., & Baumgartner, H. (1998). Assessing Measurement Equivalence in Cross-National Consumer Research. Journal of Consumer Research, 25, 78 90. Vandenberg, R. J., & Lance, C. E. (2000). A Review and Synthesis of the Measurement Equivalence Literature: Suggestions, Practices, and Recommendations for Organizational Research. Organizational Research Methods, 3(1), 4 70. 12