Impact and adjustment of selection bias. in the assessment of measurement equivalence

Similar documents
Manifestation Of Differences In Item-Level Characteristics In Scale-Level Measurement Invariance Tests Of Multi-Group Confirmatory Factor Analyses

How few countries will do? Comparative survey analysis from a Bayesian perspective

Assessing Measurement Invariance in the Attitude to Marriage Scale across East Asian Societies. Xiaowen Zhu. Xi an Jiaotong University.

A methodological perspective on the analysis of clinical and personality questionnaires Smits, Iris Anna Marije

Alternative Methods for Assessing the Fit of Structural Equation Models in Developmental Research

On the Performance of Maximum Likelihood Versus Means and Variance Adjusted Weighted Least Squares Estimation in CFA

Chapter 21 Multilevel Propensity Score Methods for Estimating Causal Effects: A Latent Class Modeling Strategy

Measurement Invariance (MI): a general overview

Propensity Score Analysis Shenyang Guo, Ph.D.

Estimating drug effects in the presence of placebo response: Causal inference using growth mixture modeling

To link to this article:

Comparing Factor Loadings in Exploratory Factor Analysis: A New Randomization Test

Survey Sampling Weights and Item Response Parameter Estimation

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Using Sample Weights in Item Response Data Analysis Under Complex Sample Designs

Measurement Equivalence of Ordinal Items: A Comparison of Factor. Analytic, Item Response Theory, and Latent Class Approaches.

Confirmatory Factor Analysis of Preschool Child Behavior Checklist (CBCL) (1.5 5 yrs.) among Canadian children

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Moving beyond regression toward causality:

Scale Building with Confirmatory Factor Analysis

Version No. 7 Date: July Please send comments or suggestions on this glossary to

Paul Irwing, Manchester Business School

Jumpstart Mplus 5. Data that are skewed, incomplete or categorical. Arielle Bonneville-Roussy Dr Gabriela Roman

A Bayesian Nonparametric Model Fit statistic of Item Response Models

Methods for Computing Missing Item Response in Psychometric Scale Construction

Mediation Analysis With Principal Stratification

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Research Brief Reliability of the Static Risk Offender Need Guide for Recidivism (STRONG-R)

Sensitivity of DFIT Tests of Measurement Invariance for Likert Data

The Modification of Dichotomous and Polytomous Item Response Theory to Structural Equation Modeling Analysis

Using the Distractor Categories of Multiple-Choice Items to Improve IRT Linking

Optimal full matching for survival outcomes: a method that merits more widespread use

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

PubH 7405: REGRESSION ANALYSIS. Propensity Score

A Brief Introduction to Bayesian Statistics

PLEASE SCROLL DOWN FOR ARTICLE. Full terms and conditions of use:

Analysis Propensity Score with Structural Equation Model Partial Least Square

G , G , G MHRN

Combining machine learning and matching techniques to improve causal inference in program evaluation

investigate. educate. inform.

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Latent Variable Modeling - PUBH Latent variable measurement models and path analysis

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Section on Survey Research Methods JSM 2009

JSM Survey Research Methods Section

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

Causal Methods for Observational Data Amanda Stevenson, University of Texas at Austin Population Research Center, Austin, TX

Psychology, 2010, 1: doi: /psych Published Online August 2010 (

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

Impact of Violation of the Missing-at-Random Assumption on Full-Information Maximum Likelihood Method in Multidimensional Adaptive Testing

Detection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis

Running head: CFA OF TDI AND STICSA 1. p Factor or Negative Emotionality? Joint CFA of Internalizing Symptomology

Adjustments for Rater Effects in

Running head: CFA OF STICSA 1. Model-Based Factor Reliability and Replicability of the STICSA

Abstract. Introduction A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION

Multifactor Confirmatory Factor Analysis

Known-Groups Validity 2017 FSSE Measurement Invariance

Discussion. Ralf T. Münnich Variance Estimation in the Presence of Nonresponse

Inclusive Strategy with Confirmatory Factor Analysis, Multiple Imputation, and. All Incomplete Variables. Jin Eun Yoo, Brian French, Susan Maller

ABSTRACT. Professor Gregory R. Hancock, Department of Measurement, Statistics and Evaluation

Complier Average Causal Effect (CACE)

Questionnaire Construct Validation in the International Civic and Citizenship Education Study

Simultaneous Equation and Instrumental Variable Models for Sexiness and Power/Status

Understanding and Applying Multilevel Models in Maternal and Child Health Epidemiology and Public Health

A Comparison of Item Response Theory and Confirmatory Factor Analytic Methodologies for Establishing Measurement Equivalence/Invariance

Research Design. Beyond Randomized Control Trials. Jody Worley, Ph.D. College of Arts & Sciences Human Relations

BIOSTATISTICAL METHODS

Current Directions in Mediation Analysis David P. MacKinnon 1 and Amanda J. Fairchild 2

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

Technical Appendix: Methods and Results of Growth Mixture Modelling

Construct Validity of the MBTI in Management Development: A Test of Two Interpretations. Robert B. Kaiser & S. Bartholomew Craig

Overview of Perspectives on Causal Inference: Campbell and Rubin. Stephen G. West Arizona State University Freie Universität Berlin, Germany

Construct Invariance of the Survey of Knowledge of Internet Risk and Internet Behavior Knowledge Scale

Unit 1 Exploring and Understanding Data

Introduction to Meta-Analysis

Strategies for handling missing data in randomised trials

Doctoral Dissertation Boot Camp Quantitative Methods Kamiar Kouzekanani, PhD January 27, The Scientific Method of Problem Solving

Is Random Sampling Necessary? Dan Hedlin Department of Statistics, Stockholm University

Applications of Structural Equation Modeling (SEM) in Humanities and Science Researches

Investigating the Invariance of Person Parameter Estimates Based on Classical Test and Item Response Theories

International Journal of Education and Research Vol. 5 No. 5 May 2017

Non-Normal Growth Mixture Modeling

Proof. Revised. Chapter 12 General and Specific Factors in Selection Modeling Introduction. Bengt Muthén

Item Parameter Recovery for the Two-Parameter Testlet Model with Different. Estimation Methods. Abstract

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

ASSESSING THE UNIDIMENSIONALITY, RELIABILITY, VALIDITY AND FITNESS OF INFLUENTIAL FACTORS OF 8 TH GRADES STUDENT S MATHEMATICS ACHIEVEMENT IN MALAYSIA

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

The Multidimensionality of Revised Developmental Work Personality Scale

Comprehensive Statistical Analysis of a Mathematics Placement Test

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Analysis of the Reliability and Validity of an Edgenuity Algebra I Quiz

A structural equation modeling approach for examining position effects in large scale assessments

UN Handbook Ch. 7 'Managing sources of non-sampling error': recommendations on response rates

Using directed acyclic graphs to guide analyses of neighbourhood health effects: an introduction

The Psychometric Properties of Dispositional Flow Scale-2 in Internet Gaming

THE APPLICATION OF ORDINAL LOGISTIC HEIRARCHICAL LINEAR MODELING IN ITEM RESPONSE THEORY FOR THE PURPOSES OF DIFFERENTIAL ITEM FUNCTIONING DETECTION

The Impact of Relative Standards on the Propensity to Disclose. Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX

Effects of propensity score overlap on the estimates of treatment effects. Yating Zheng & Laura Stapleton

Transcription:

Impact and adjustment of selection bias in the assessment of measurement equivalence Thomas Klausch, Joop Hox,& Barry Schouten Working Paper, Utrecht, December 2012 Corresponding author: Thomas Klausch, L.T.Klausch@uu.nl Utrecht University Faculty for Social and Behavioural Sciences Department of Methods and Statistics PO Box 80140, 3508 TC Utrecht The Netherlands. Abstract Selection bias is a threat to valid causal inference in designs with incomplete randomization, e.g. in observational studies and quasi-experiments. When measurement models, such as CFA or IRT, need to be assessed for equivalence selection bias might lead analysts to draw wrong conclusions. Whether this threat is real and how to adjust for it, is assessed in the present study by means of a Monte-Carlo simulation. Selection bias between a treatment and control group was simulated, where measurement non-equivalence was introduced on qualitative covariates that were causally related to the assignment mechanism. Our results indicate that unadjusted tests falsely reject the hypothesis of measurement equivalence using RMSEA and CFI fit criterions. Inverse propensity score weighting performed best in adjustment, whereas simple ANCOVA adjustment on the latent factor proved insufficient in removing all selectivity in the treatment assignment. 1

1. Introduction Latent variables are important quantities in social research assisting researchers in measuring concepts that cannot be surveyed by single direct questions alone. Measurement models, such as confirmatory factor analysis (CFA) or item response theory (IRT), are used in the estimation of latent variables and additionally help to control for measurement error in the observable indicators (e.g. Bollen, 1989; Alwin, 2007). These are typically available from multiple item scales. Multiple group models can be used to assess construct equivalence (equivalence) across study groups about which analysts wish to draw inference or make comparisons, for example population strata defined by gender, age, nationality or race. The methods necessary to assess such question have been well developed and documented (Steenkamp & Baumgartner, 1998; Vandenberg & Lance, 2000; Millsap, 2011). Not always, however, lies the focus of interest in comparing naturally occurring population strata. When the researchers seek to draw causal inference about latent constructs in experimental designs or the effect of an experimentally manipulated factor on other parameters of the measurement model, it needs to be assessed whether measurement instruments are invariant across experimental groups. If randomization was successful, these analyses will be unbiased. In many situations, however, full randomization of subjects to treatment ( intervention ) and control groups is not possible, in particular in observational and quasiexperimental studies (Rosenbaum, 2002; Morgan & Winship, 2007). In such situations selection bias is known to occur threatening valid causal inference about equivalence in measurement models. Although selection bias has received a lot of attention in the literature including the methods to adjust for it (cf. Schafer & Kang, 2008 ), the effect selection bias has in latent variable models and on equivalence assessment, to the authors knowledge, has not been discussed. Furthermore, adjusting for selection bias in latent variable models has received little systematic attention in the literature. The present study addresses this research gap. Using a Monte-Carlo simulation, we illustrate the effects that selection bias can have in categorical CFA measurement models (also known as polytomous IRT models) when testing measurement equivalence (Muthén& Asparouhov, 2002; Millsap, 2011). This class of models is appropriate when questions using Likert scales with a small number of answer categories need to be scaled, a situation close to research practice. Consequently, we compare the performance of methods to adjust for selection bias. In particular, we consider three popular methods used to adjust selection bias, when directly observed outcome variables are studied: ANCOVA adjustment on covariates as suggested by Sörbom (1978), exact stratification (Rosenbaum, 2002), and propensity score weighting (Rosenbaum& Rubin, 1983). 2

This paper is structured as follows. First, we present the CFA model used in the simulation and discuss how selection bias is introduced. Second, we discuss how to adjust for selection bias. Finally, results are presented and discussed. 2. Simulating Selection bias in an ordered categorical CFA Model 2.1 The CFA Model In the simulation we consider the ordered categorical factor model with, for simplicity, one factor and four indicators (cf. Muthén & Asparouhov, 2002; Millsap, 2011): (1) We model j=1,,4 latent response variables X* by variable specific intercepts, a source W of common variance ( true or latent scores) and a random error. For identification we, however, fix 0 for all j. We further set ~0,1. Unit variance of W leads to the definition of the reliability of measure j: ² 1 (2) The second equation follows after standardizing by 1. Accordingly, error variances dependend on : θ 1" ² (3) All cannot be observed directly, but are mapped without error on the observed ordered categorical indicator with C=4 defining five categories by the mapping function: # $ #% & ' % &() (4) Where % & are denoted threshold parameters for the latent response variable j. 2.2 Introducing Covariates as Causes to Selection Bias We introduce selection bias by two multinominal stratification variables, * ) and * +, dividing the sample space into 3 x 2 strata (population proportions are {.3,.3,.4} and {.5,.5} respectively). In practice many more variables might be available to the analyst, but the results will easily generalize to other situations. Note that assignment to treatment and control groups has not taken place. However, in observational studies probability of assignment is impacted by the levels of covariates, which will be represented by * ) and * +. 3

In the simulation we vary reliability and threshold parameters across these two stratification variables. By this process we account for the fact that in reality not the experimental factor causes measurement non-equivalence, but rather the underlying characteristics S. For example, S might be the nationality of subjects, but S is not equally distributed across treatment and control due to selectivity. This imbalance will be introduced below. First, let membership in stratum combination * ), and * + - have a fixed effect on response reliability: /,0 + ² / ² 0 ² 1 (5) As well as the threshold parameters: % & /,0 % & % /& % 0& (6) The order implied by equation (4) must still hold for % & /,0. Consider table 1 and 2 for an overview on the exact parameterization used, which introduces measurement non-equivalence between all strata. Equations 5 and 6 generalize model 1 to a multi-group CFA model: /,0 /,0 /,0 (7) # $ * ),,* + - #% & /,0 ' % &() /,0 (8) Where we additionally assume that W is independent of * ) and * +. Table 1: Differential item functioning (reliabilities) in sub-groups * ) and * + j ² /2) ² /2+ ² /23 ² 02) ² 02+ ² 1.6 0 -.3 -.5 0.2 2.6 0 -.3 -.5 0.2 3.6 0 -.3 -.5 0.2 4.6 0 -.3 -.5 0.2 Table 2: Differential item functioning (thresholds) in sub-groups * ) and * + c % & %,/2),& %,/2+,& %,/23,& %,02),& %,02+,& 1-1.5 0 -.5 0 0 0 2-1.0 0 0.5 0 0 3 0 0 0 0 0.5 4 1.0 0 0 0 -.5 0 4

For example, if S 2 is a nationality indicator, we would assume that the reliability of answers given by people with nationality l=a is smaller by.20 than the reliability of answers provided by subjects with nationality l=b. Furthermore the threshold at which a particular answer would be given is also different across groups. Additionally to this, it might be possible that the factor means (and variances) of subjects differ across strata. This possibility is neglected in order to keep the present simulation straight-forward. 2.3 Introducing selection bias on S when randomizing treatment and control groups Now let population members select into a treatment and a control group using a simple probit selection model. For this purpose we transform * ) and * + into dummy indicators: 45 )), 5 )+, 5 )3 6 and 45 +), 5 ++ 6. For individual i we define a latent selection variable (see also Table 3): 7 8 9 : 9 )+ 5 )+8 9 )3 5 )38 9 ++ 5 ++8 9 3) 5 )+8 5 ++8 9 3+ 5 )38 5 ++8 ; 8 (9) Where ;~0,1 and define the treatment indicator M as: < 8 = 0 >? 7 8 0 1 >? 7 8 @ 0. (10) Model (9)-(10) suggests that we do not assume that randomization was perfect. Rather background characteristics S are the known causes of selection bias. Table 3: Parameters of the selection model Parameter Value 9 : 0 9 )+ 1 9 )3 2 9 ++ -1 9 3).5 9 3+.5 Ana analyst interested in assessing measurement equivalence over M can do so by means of a multi-group model (e.g. Millsap & Yun-Tein, 2004; Millsap, 2011): A A A (11) # $ < #% & A ' % &() A (12) 5

In doing so, the hypotheses: B :) : A2: A2) for all j (13) B :+ : % & A2: % & A2) for all j and c (14) are evaluated jointly by imposing equality constraints on the parameters and evaluating global model fit. Model (11)-(12) can be estimated by mean and variance adjusted weighted least squares (WLSMV) as described in Muthén (1984) and Muthén, du Toit, & Spisic (1997). Treatment indicator M has no differential impact on the measurement model. That is why B :) and B :+ should not be rejected. However, selection on * ) and * + might introduce measurement non-equivalence, because the distribution of the selection variables is not equal across modes by means of selection model (9)-(10). Hence it is asked, if the test of measurement equivalence (13)-(14) can be improved (or adjusted) by applying techniques that balance mode groups with respect to * ) and * +. How to do this, is discussed in the next section. 3. Evaluation of three adjustment methods We evaluate performance of three possible adjustment methods: 1. ANCOVA adjustment of the latent factor ( Covariate adjustment ), 2. Exact Stratification on * ) and * +, 3. Inverse propensity score weighting ( IPW ), against the case of ignoring selection ( Simple model ). (a) (b) S 1 T T S 2 Figure 1: Illustration of (a) a path model in the ANCOVA tradition (b) exact stratification on all levels of S1 by S2 by a stratified multi-group model 6

Situations 1 and 2 are illustrated in figure 1. ANCOVA adjustment is a classical way to control for group heterogeneity in incompletely randomized groups (e.g. Schafer & Kang, 2008). In the context of CFA models modeling direct effects of the stratification variables on the latent factor (method 1, case a in figure 1) seeks to balance group heterogeneity, as suggested by Sörbom (1978; e.g. Heerwegh & Loosveldt, 2011 for an application): 8 D : D )+ 5 )+8 D )3 5 )38 D ++ 5 ++8 D 3) 5 )+8 5 ++8 D 3+ 5 )38 5 ++8 E 8 (15) Second, model stratification on the exact strata defined by * ) and * + (method 2, case b in figure 1) implies conditional estimation of all multiple-group model parameters on any combinations of S (e.g. Rosenbaum, 2002; Morgan & Winship, 2007). In the present simulation this implies to estimate one multiple-group model for each of s=1,,6 strata defined by * ) and * + : A,F2G A,F2G A,F2G (16) # $ <,* 5 #% & A,F2G ' % &() A,F2G (17) It is concluded that measurement equivalence holds conditional on * ) and * + if B :) and B :+ cannot be rejected in all strata defined by the combinations of both S variables. Finally, inverse propensity score weighting (method 3) can be used to adjust for unequal selection probabilities of individual i to treatment (or control). Propensity scored are estimated from a probit model that follows the true model (9) (Rosenbaum & Rubin, 1982; Morgan & Winship, 2007; Guo & Fraser, 2010): #< 8 1 * )8,* +8 Φ9 : 9 )+ 5 )+8 9 )3 5 )38 9 ++ 5 ++8 9 3) 5 )+8 5 ++8 9 3+ 5 )38 5 ++8 (18) Where Φ denotes the standard normal distribution function. Let IJ 8 be a propensity score estimate from (18), then individual weights are defined as: KL 8 M IJ 8 N) >? < 8 1 1"IJ 8 N) >? < 8 0 < 8IJ 8 N) 1"< 8 1"IJ 8 N) (19) Implementation of selection probabilities to the estimation of model (11)-(12) with WLSMV is described in Asparouhov (2005). 4. Results from a Monte-Carlo Simulation A Monte-Carlo simulation with 1000 replications and a sample size of n=3000 was conducted. Data were simulated in the statistical programming environment R 2.14. Models were estimated in the statistical software Mplus 6 run from R using the procedure MplusAutomation. To 7

evaluate B :) and B :+ jointly, the parameters A,% & A,and θ A were constrained equal across M. Fit was evaluated based on RMSEA criterion (root means square error approximated): Reject B :) and B :+ if O<*PQ @.05 (20) and CFI criterion (comparative fit index): Reject B :) and B :+ if TUV '.95 (21) We found that all models in the unadjusted ( simple ) model have insufficient fit leading to false rejection of the measurement equivalence (MI) hypothesis with respect to mode groups M in all of the replications (Table 4). Covariate adjustment on the latent factor improves model fit (mean RMSEA=.053, CFI=.916) but still leads to rejection of the MI hypotheses in 71.6% of cases based on RMSEA and all of the cases based on the CFI criterion. Exact stratification on all six strata and separate evaluation of multi-group models in each of the strata is more effective than covariate adjustment. However, during estimation the conditioning technique posed new problems due to data sparseness in some of the mode group strata combinations. Table 4: MC distribution of CFA model fit statistics with results of hypothesis tests (in %) Simple Covariate Stratification IPW RMSEA (mean/sd).129(.008).053(.004).014(.018)*.013(.009) % MI not rejected 0 28.4 83.8 100 CFI (mean/sd).864(.017).916(.014).994(.031)*.993(.007) % MI not rejected 0 0.5 100 Successful estimations 1000 1000 3325 of 6000 1000 * over all successful replications To see this, consider table 5 that illustrates performance of hypotheses tests in all six strata. The selection model (9) parameterization evidently causes mode distribution in stratum * ) 3 to be very skew (9 )3 2, for example (i.e. few observation in M=0). While the adjustment method works well in strata with a high number of successful replications (i.e. those with sufficient group sizes), it functions badly in the two strata associated with * ) 3. It was, furthermore, postulated (cf. section 3) that successful adjustment for selection would only be considered successful if equivalence was produced in all six strata. This was, however, only found for 83.8% of replications based on RMSEA and a mere 0.5% of replication based on CFI (only successful model fits were used in these two statistics). This suggests a multiple testing problem, because taken separately for all strata (table 5) the conditioning technique work 8

satisfactory, if strata are of sufficient size. In sum, the conditioning technique may suffer from data sparseness and multiple testing problems. Table 5: Fit statistics per strata for the stratification adjustment method (in %) * ) 1, * + 1 * ) 2, * + 1 * ) 3, * + 1 * ) 1, * + 2 * ) 2, * + 2 * ) 3, * + 2 RMSEA (mean/sd).014.013.016.015.013 (.018) (.018). (.018) (.020) (.018) n/a % MI not rejected 95.1 94.5 96.8 93.6 96.1 100 CFI (mean/sd).998.989.898.999.997.822 (.003) (.020) (.151) (.001) (.006) (n/a) % MI not rejected 100 92.7 54.8 100 100 0 Successful estimations 1000 439 93 968 824 1 Finally, consider the performance of the inverse propensity score (IPW) adjustment technique (Table 4). Mean RMSEA and CFI suggest good fit. Measurement equivalence is not rejected in any of the replications; that is, RMSEA<.05 and CFI>.95 in all replications after weighting with inverse propensity scores. Note that this finding holds in face of small strata sizes as discussed for the conditioning adjustment technique. Since IPW, furthermore, only requires one statistical test, it appears to be superior to stratification in the current simulation. 5. Conclusions and Outlook Our simulation demonstrated that working with non-adjusted CFA models when testing for measurement equivalence in experimental groups is prone to false conclusions under two conditions. First, observed or unobserved covariates determine individual selection into treatment and control groups. Second, there is measurement non-equivalence across the classes of these variables. In the presence of selection bias in measurement equivalence tests, adjusting for bias on observed covariates is a necessity. Our results demonstrate, however, that not any of the methods available from the literature performs equally well. In particular, an ANCOVA adjustment on the latent trait performed very weakly in the present simulation and therefore is not recommended. The reasons for this weak performance are related to the locus of non-equivalence in the present data. We assumed that strata of the stratification variables did not differ on the means and variances of the latent factor, but rather with respect to thresholds and item reliabilities. The ANCOVA adjustment, however, works only on the expectation and variance on the latent factor not controlling for the true sources of non-equivalence. These are taken into account using exact stratification and propensity score weighting. Given the sample size of the present simulation (n=3000) and six strata, cell sparseness in a few stratums 9

coincided with false test results. In situations with even more stratification variables or less observations, this problem is prone to become even more serious. Therefore exact stratification is not recommended in these situations. This observation is generally known from the literature on adjustment by exact stratification (e.g. Rosenbaum, 2002; Morgan & Winship, 2007). The propensity score combines information on all stratification variables into a single vector, thereby addressing cell sparseness problems effectively. Consequently, inverse propensity weights performed exceedingly well in adjusting for selection bias in the present simulation. From the present results we conclude that IPW is the method of choice. This conclusion has to be considered against the specific limitations of the present simulation design (e.g. parameterization, sample size, number of stratification variables) as well as further options to adjust for selection bias, which were not considered. These include in particular methods based on the propensity score, such as propensity matching and stratification. Furthermore, the present study assumed that all bias is overt, that is, full information was available on both stratification covariates. In practical situations, there might be bias caused by hidden covariates. Propensity score models have shown to be misleading in this situation. Double robust methods using both covariate and propensity adjustment might prove beneficial. In further simulations, these aspects should be assessed. 10

6. References Alwin, D. F. (2007). Margins of Error. Hoboken: Wiley. Asparouhov, T. (2005). Sampling Weights in Latent Variable Modeling. Structural Equation Modeling: A Multidisciplinary Journal, 12(3), 411 434. doi:10.1207/s15328007sem1203_4 Bollen, K. A. (1989). Structural Equations with Latent Variables. New York: Wiley. Guo, S., & Fraser, M. W. (2010). Propensity Score Analysis. Thousand Oaks: Sage. Heerwegh, D., & Loosveldt, G. (2011). Assessing Mode Effects in a National Crime Victimization Survey using Structural Equation Models: Social Desirability Bias and Acquiescence. Journal of Official Statistics, 27(1), 49 63. Millsap, R. E. (2011). Statistical Approaches to Measurement Equivalence. New York: Routledge. Millsap, R. E., & Yun-Tein, J. (2004). Assessing Factorial Equivalence in Ordered- Categorical Measures. Multivariate Behavioral Research, 39(3), 479 515. Morgan, S. L., & Winship, C. (2007). Counterfactuals and Causal Inference. Cambridge: Cambridge University Press. Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49(1), 115 132. Muthén, B., Du Toit, S. H. C., & Spisic, D. (1997). Robust inference using weighted least squares and quadratic estimating equations in latent variable modeling with categorical and continuous outcomes. Retrieved from http://www.gseis.ucla.edu/faculty/muthen/articles/article_075.pdf 11

Muthén, B. O., & Asparouhov, T. (2002). Latent Variable Analysis With Categorical Outcomes: Multiple-Group And Growth Modeling In Mplus. Muthèn & Muthèn. Retrieved from http://www.statmodel.com/download/webnotes/catmglong.pdf Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). New York: Springer. Rosenbaum, P. R., & Rubin, D. B. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70(1), 41 55. Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13(4), 279 313. Sörbom, D. (1978). An alternative to the methodology for analysis of covariance. Psychometrika, 43(3), 381 396. Steenkamp, J.-B. E. M., & Baumgartner, H. (1998). Assessing Measurement Equivalence in Cross-National Consumer Research. Journal of Consumer Research, 25, 78 90. Vandenberg, R. J., & Lance, C. E. (2000). A Review and Synthesis of the Measurement Equivalence Literature: Suggestions, Practices, and Recommendations for Organizational Research. Organizational Research Methods, 3(1), 4 70. 12