Running head: SELECTION OF AUXILIARY VARIABLES 1. Selection of auxiliary variables in missing data problems: Not all auxiliary variables are

Similar documents
Graphical Representation of Missing Data Problems

Published online: 27 Jan 2015.

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

Inclusive Strategy with Confirmatory Factor Analysis, Multiple Imputation, and. All Incomplete Variables. Jin Eun Yoo, Brian French, Susan Maller

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Accuracy of Range Restriction Correction with Multiple Imputation in Small and Moderate Samples: A Simulation Study

Missing Data and Imputation

Exploring the Impact of Missing Data in Multiple Regression

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

11/24/2017. Do not imply a cause-and-effect relationship

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

Instrumental Variables Estimation: An Introduction

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

6. Unusual and Influential Data

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Missing by Design: Planned Missing-Data Designs in Social Science

Module 14: Missing Data Concepts

Studying the effect of change on change : a different viewpoint

Advanced Handling of Missing Data

Propensity scores: what, why and why not?

Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research

In this module I provide a few illustrations of options within lavaan for handling various situations.

Simultaneous Equation and Instrumental Variable Models for Sexiness and Power/Status

Multiple Imputation For Missing Data: What Is It And How Can I Use It?

Best Practice in Handling Cases of Missing or Incomplete Values in Data Analysis: A Guide against Eliminating Other Important Data

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

Section on Survey Research Methods JSM 2009

Causal Mediation Analysis with the CAUSALMED Procedure

What to do with missing data in clinical registry analysis?

Regression Discontinuity Analysis

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

EXERCISE: HOW TO DO POWER CALCULATIONS IN OPTIMAL DESIGN SOFTWARE

Using directed acyclic graphs to guide analyses of neighbourhood health effects: an introduction

George B. Ploubidis. The role of sensitivity analysis in the estimation of causal pathways from observational data. Improving health worldwide

12/31/2016. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Ball State University

Help! Statistics! Missing data. An introduction

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis

Impact and adjustment of selection bias. in the assessment of measurement equivalence

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

MEA DISCUSSION PAPERS

How should the propensity score be estimated when some confounders are partially observed?

Book review of Herbert I. Weisberg: Bias and Causation, Models and Judgment for Valid Comparisons Reviewed by Judea Pearl

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

Multiple imputation for handling missing outcome data when estimating the relative risk

INTERVIEWS II: THEORIES AND TECHNIQUES 5. CLINICAL APPROACH TO INTERVIEWING PART 1

Complier Average Causal Effect (CACE)

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

OLS Regression with Clustered Data

Validity and reliability of measurements

Approaches to Improving Causal Inference from Mediation Analysis

Assignment 4: True or Quasi-Experiment

Missing Data: Our View of the State of the Art

Alternative Methods for Assessing the Fit of Structural Equation Models in Developmental Research

Methods for Addressing Selection Bias in Observational Studies

EPSE 594: Meta-Analysis: Quantitative Research Synthesis

CHAPTER 3 RESEARCH METHODOLOGY

Validity and reliability of measurements

ISC- GRADE XI HUMANITIES ( ) PSYCHOLOGY. Chapter 2- Methods of Psychology

Master thesis Department of Statistics

Running head: NESTED FACTOR ANALYTIC MODEL COMPARISON 1. John M. Clark III. Pearson. Author Note

Performance of Median and Least Squares Regression for Slightly Skewed Data

Analysis of TB prevalence surveys

Chapter 3 Missing data in a multi-item questionnaire are best handled by multiple imputation at the item score level

Standard Errors of Correlations Adjusted for Incidental Selection

Okayama University, Japan

Running head: INDIVIDUAL DIFFERENCES 1. Why to treat subjects as fixed effects. James S. Adelman. University of Warwick.

Unit 1 Exploring and Understanding Data

Technical Specifications

Placebo and Belief Effects: Optimal Design for Randomized Trials

Cochrane Pregnancy and Childbirth Group Methodological Guidelines

COMMITTEE FOR PROPRIETARY MEDICINAL PRODUCTS (CPMP) POINTS TO CONSIDER ON MISSING DATA

PLANNING THE RESEARCH PROJECT

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

Propensity Score Analysis Shenyang Guo, Ph.D.

3 CONCEPTUAL FOUNDATIONS OF STATISTICS

Bias reduction with an adjustment for participants intent to dropout of a randomized controlled clinical trial

SUPPLEMENTAL MATERIAL

Statistical Methods and Reasoning for the Clinical Sciences

Confidence Intervals On Subsets May Be Misleading

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Mantel-Haenszel Procedures for Detecting Differential Item Functioning

STATISTICS AND RESEARCH DESIGN

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Sequential nonparametric regression multiple imputations. Irina Bondarenko and Trivellore Raghunathan

Session 1: Dealing with Endogeneity

Chapter 5: Field experimental designs in agriculture

Evaluation of Medication-Mediated Effects in Pharmacoepidemiology. By EJ Tchetgen Tchetgen and K Phiri

Results & Statistics: Description and Correlation. I. Scales of Measurement A Review

PharmaSUG Paper HA-04 Two Roads Diverged in a Narrow Dataset...When Coarsened Exact Matching is More Appropriate than Propensity Score Matching

Cahiers Recherche et Méthodes

CHAPTER 6. Conclusions and Perspectives

Minimizing Uncertainty in Property Casualty Loss Reserve Estimates Chris G. Gross, ACAS, MAAA

A Brief Introduction to Bayesian Statistics

Detection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis

Investigating the robustness of the nonparametric Levene test with more than two groups

Transcription:

Running head: SELECTION OF AUXILIARY VARIABLES 1 Selection of auxiliary variables in missing data problems: Not all auxiliary variables are created equal Felix Thoemmes Cornell University Norman Rose University of Tuebingen Author Note The authors would like to thank the participants of the colloquium of the Methodology Center at Pennsylvania State University. Inquiries to this article should be addressed to the first author, Felix Thoemmes, MVR G62A, Cornell University, Ithaca, NY 14853, felix.thoemmes@cornell.edu.

SELECTION OF AUXILIARY VARIABLES 2 Abstract The treatment of missing data in the social sciences has changed tremendously during the last decade. Modern missing data techniques such as multiple imputation and full-information maximum likelihood are used much more frequently. These methods assume that data are missing at random. One very common approach to increase the likelihood that missing at random is achieved, consists of including many covariates as so-called auxiliary variables. These variables are either included based on data considerations or in an inclusive fashion, i.e., taking all available auxiliary variables. However, neither approach accounts for the fact that under a wide range of circumstances there is a class of variables that, when used as auxiliary variables, will always increase bias in the estimation of parameters from data with missing values. In this paper we show that this bias exists, quantify it in a simulation study, and discuss possible ways how one can avoid selecting bias-inducing covariates as auxiliary variables. Keywords: missing data, auxiliary variables,multiple imputation, full information maximum likelihood

SELECTION OF AUXILIARY VARIABLES 3 Selection of auxiliary variables in missing data problems: Not all auxiliary variables are created equal Introduction The presence of missing data is a prevalent problem in social science research (Peugh & Enders, 2004). Given that a large portion of social science studies are conducted outside the confines of a laboratory, the threat of suffering missing data due to non-compliance or attrition is even more pronounced. The pervasiveness of this problem has triggered much research during the last 30 years. Rubin (1976) laid the foundation of modern missing data theory which has culminated in sophisticated methods to deal with missing values, specifically the use of full-information maximum likelihood (FIML) and multiple imputation (MI). For an overview see e.g., Enders (2010). Both of these so-called modern missing data techniques are expected to yield unbiased estimates of parameters in the presence of missing data, given that certain assumptions about missingness hold. It should be noted that especially MI, while conceptually straightforward (Rubin, 1996), can be conducted with various different techniques, see e.g., Schafer (1999), King, Honaker, Joseph, and Scheve (2001), van Buuren and Groothuis-Oudshoorn (2011), or Raghunathan, Lepkowski, Hoewyk, and Solenberger (2001). However, despite computational differences, all techniques, whether they may be FIML or variants of MI, rely on the same, untestable assumptions, notably, the missing at random (MAR) assumption (Rubin, 1976), which we will define more formally later in the manuscript. The goal of this paper is to critically examine current recommendations to increase the plausibility of MAR, especially in regards to the selection of auxiliary variables. We argue that the current recommendations are incomplete and simply ignore the possibility of complex relationships between substantive analysis variables and variables that are solely used to improve the missing data estimation, so-called auxiliary variables. Further, we believe that the complexities of the assumptions are not widely appreciated among social science researchers and many quantitative scientists alike, who have long believed that inclusion of as many auxiliary variables as possible is a safe strategy

SELECTION OF AUXILIARY VARIABLES 4 to asymptotically achieve or approximate unbiasedness. We will show in a small example and a larger simulation study that this strategy is not guaranteed to yield unbiased results and that biases due to missing data and the use of auxiliary variables are much more complex than previously thought. As a result, the use of modern missing data techniques, while laudable, does often not guarantee that bias in studies with missing data has been adequately dealt with. We will first review classic missingness mechanisms and discuss which conditional independencies these conditions imply and how these independencies can be encoded in a graph. Further, we demonstrate that there are situations and classes of variables that should not be used as auxiliary variables in FIML or MI as they tend to increase bias. We will quantify the bias in our simulation studies, and suggest possible ways to avoid it. Finally, we will discuss implications for applied research and offer an alternative framework to think about and communicate assumptions of missing data problems. Missing data mechanism We begin by reviewing the classic mechanisms defined by Rubin (1976): missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In our overview we use a slightly modified version of the notation employed by Schafer and Graham (2002). In addition, we also express missing mechanisms using conditional independence statements. In conjunction with the conditional independence statements, we present graphical displays to illustrate the mechanisms. Using graphs to illustrate how missingness relates to other variables in a model is not a novel approach and has in fact been used in popular texts and articles to aid understanding of the mechanisms (Enders, 2010; Schafer & Graham, 2002). In this paper however, we do not use graphs simply as illustrations, but also use formal graph theory (Pearl, 2000) to derive certain results.

SELECTION OF AUXILIARY VARIABLES 5 MCAR Following the notation of Schafer and Graham (2002), we denote an N K matrix Y. The rows of Y represents the cases n = 1,..., N of the sample and the columns represent the variables i = 1,..., K. Y can be partitioned into an observed part, labeled Y obs, and a missing part Y mis, which yields Y =(Y obs, Y mis ). Further, we denote an indicator matrix of missigness, R, whose elements take on values of 0 or 1, for observed or missing values of Y, respectively. Accordingly, R is also an N K matrix. Each variable in Y can therefore have both observed or unobserved values. Missing completely at random (MCAR) is the most restrictive assumption, but, when fulfilled, the least problematic. It states that the unconditional distribution of missingness P (R) is equal to the conditional distribution of missingness given Y obs and Y mis, or simply Y. P (R Y ) = P (R Y obs, Y mis ) = P (R) (1) These equalities of probabilities imply (can be expressed as) conditional independence statements, here in particular R (Y obs, Y mis ). (2) The MCAR condition is therefore fulfilled when the missingness has no relationship with either the observed and unobserved part of Y. In an applied research context we could imagine MCAR being fulfilled if the missing data arose from a purely accidental (random) process, like dropping a single sheet from a questionnaire. In other words, the probability of missingness is related only to factors that are completely unrelated to any other variable in the model. MCAR is rare in applied research and usually does not hold, unless it has been planned by the researcher in so-called missingness by design studies (Graham, Taylor, Olchowski, & Cumsille, 2006). When MCAR holds, even simple techniques, like listwise deletion will yield unbiased estimates (Enders, 2010), even though it might still not be advisable to use these simple methods due to loss in statistical power. As Gelman and Hill (2007) described and Raykov (2011) more formally showed, MCAR cannot be tested

SELECTION OF AUXILIARY VARIABLES 6 empirically, and homogeneity of means, variances, or more generally distributions, of observed variables across missing data patterns constitutes only necessary, but not sufficient evidence for MCAR. The inability to directly test MCAR can also be seen by the fact that it posits independence assumptions about quantities that are by definition unobserved, here in particular Y mis. Before we proceed further, it is necessary to address the graphical displays that we will be using. First, they are constructed as so-called direct acyclic graphs (Pearl, 2000), which we will abbreviate as DAGs. DAGs are widely used in epidemiology (Greenland, Pearl, & Robins, 1999; Hernán, Hernández-Díaz, Werler, & Mitchell, 2002; VanderWeele & Robins, 2007; VanderWeele & Shpitser, 2011), medicine (Merchant & Pitiphat, 2002; Shrier & Platt, 2008), computer science (Koller & Friedman, 2009; Pearl, 1995, 2000; Textor & Liśkiewicz, 2011) and other fields. They also have been used to examine missing data situations (Daniel, Kenward, Cousens, & De Stavola, 2011). Researchers who are familiar with structural equation models (SEM) will also feel familiar with DAGs, however there are some differences (for a complete overview of differences refer to Shadish and Sullivan (2012)). Briefly explained, in a DAG we use the ε terms, the so-called disturbance terms, to denote all unmeasured variables that may have an effect on the variable that is endowed with this ε term. Note that these disturbance terms are not identical to regression residuals that are by definition uncorrelated with variables that were used to predict the variable with the ε term. Further, the DAG is completely non-parametric and encodes conditional independencies among the variables displayed. Precisely because of this ability to encode conditional independencies are DAGs well suited to express missing data mechanisms (which can be expressed as such conditional independencies, as we have shown earlier in the example of MCAR). We will use DAGs to express conditional independencies that are prescribed by different missingness mechanisms and in doing so, show how novel insights about missingness problems can be gathered. In Figure 1 we present a graphical display of MCAR for the simple case in which a

SELECTION OF AUXILIARY VARIABLES 7 single variable X has an effect on a unidimensional variable Y. In this simple case, X is completely observed and only Y suffers from missingness. Whether data on Y is missing is encoded by the indicator R Y in the graph. We use an additional subscript for R here to denote that this missingness indicator pertains only to variable Y. Note that we could have visually partitioned Y in the graph into Y o bs and Y m is, but for clarity simply denote it as Y. In this example equation 2, which expresses the condition that needs to hold for MCAR, can be written as R Y (X, Y ). Independence relations in DAGs are expressed as so-called d-separation statements. d-separation is a graphical criterion that can be applied to DAGs to infer independence relations among variables. In short, if two variables are said to be d-separated, there exists no traceable, unblocked path in the diagram between the variables. Conversely, if two variables are d-connected, there exists a traceable and unblocked path between the variables. A traceable path is defined as any path that connects two variables in a graph. It is not of importance for the definition of a path whether the segments of the path have arrows pointing in one or the other direction. To examine d-separation one examines whether all paths are open or blocked. A path is said to be blocked if one conditions on a variable in the path that acts as a mediator, i.e., takes on the form X or X, or is an arrow-emanating variable, i.e., takes on the form, X. Further, a path is blocked if one does not condition on a variable that has two arrows pointing in it, i.e., takes on the form X. Such a variable is usually called a collider variable (Pearl, 2000). If two variables are said to be d-connected there exists at least one traceable path between them that has not been blocked. Being d-connected implies that the two variables are stochastically dependent on each other. Pearl (2000) has provided a proof that variables in a graph that are d-separated are stochastically independent from each other, regardless of the functional form of the relationships among the variables in the graph. For a more thorough introduction to d-separation for social scientist, consult Hayduk et al. (2003) or the original text by Pearl (2000).

SELECTION OF AUXILIARY VARIABLES 8 In the graph in Figure 1 we can see that there is only a single arrow pointing to R Y from the disturbance term ε R, meaning that missingness arises only due to unobserved factors. Further, these unobserved factors have no association with any other variable or disturbance term in the model, as can be seen by the fact that ε R is unassociated with other parts of the model. In this graph, there is no traceable path between Y and R Y (or X and R Y ) and they are said to be d-separated without having to condition on any other variables, implying unconditional stochastic independence between the variables Y and R Y (as defined in equation 1), and therefore the missing data mechanism is MCAR. So far we have used the the expression to condition on in the context of missing data problems this relates to observing and using a variable in a FIML or multiple imputation model. MAR A somewhat less restrictive condition is missing at random (MAR). MAR states that the conditional distribution of missingness, given the observed part Y obs is equal to the probability of missingness, given the observed and the unobserved part (Y obs, Y mis ). P (R Y ) = P (R Y obs, Y mis ) = P (R Y obs ). (3) These equalities of probabilities again imply (can be expressed as) conditional independence statements, here in particular R Y mis Y obs. (4) In words, MAR states that the missingness is stochastically independent of the unobserved variables, whereas dependencies between observed variables and missingness are allowed. In an applied research context, we could imagine that missingess is caused by certain observed variables that may also have an effect on important analysis variables. For example, missigness on an achievement measure could be caused by motivation (or lack thereof). Further we can assume that motivation has also an effect on achievement. MAR is an important condition, because when it holds, modern estimation techniques (MI and FIML)

SELECTION OF AUXILIARY VARIABLES 9 yield unbiased results. Just as MCAR, MAR cannot be tested empirically, as it also posits conditional stochastic independence assumptions among quantities that are by definition unobserved, specifically, Y mis. Returning to the example with variable X, the unidimensional variable Y and the respective missing indicator R Y, the MAR condition (see equation 4) implies the conditional stochastic independence R Y Y X. In Figure 2 (a) we show the simple situation in which MAR holds. In this figure, Y and R Y are d-connected, via the path Y X R Y. However, if one conditions on X, this path becomes blocked and Y and R Y are now d-separated, implying conditional stochastic independence R Y Y X, as similarly defined in equation 4, and therefore MAR holds, as long as one has observed X and uses it in the estimation of Y in a FIML framework, or uses it as a predictor variable in an MI framework. Often, researchers use variables to predict missingness that may not be of substantive interest. Such variables are usually called auxiliary variables, because they are not of theoretical interest to the applied researcher but aid in the estimation of the missing data. In the second graphical example in 2 (b), we explicitly describe an auxiliary variable and how it can help to create conditional independence between the missingness and the variable with missing value, thereby implying MAR. We use the same set of variables as in Figure 1 (a), but introduce a new variable A, which in this example must be used as an auxiliary variable for an unbiased estimate of the relationship of interest between X and Y in the presence of missing data on Y. In Figure 2 (b), Y and R Y are d-connected, via the path Y A R Y and via the path Y X R Y. However, if one conditions on A, the first path becomes blocked, and if one conditions on X, the second path becomes blocked and Y and R Y are now d-separated, implying conditional stochastic independence R Y Y (A, X), and therefore MAR holds. Note that A in the graph could be a multidimensional set of variables that all exhibit the same structure.

SELECTION OF AUXILIARY VARIABLES 10 MNAR Finally, missing not at random (MNAR) is the least stringent assumption, however the most problematic, as even FIML and MI will typically, though not always for all parameter estimates, yield biased results. MNAR is characterized by the probability of missingness being dependent on both the observed part, Y obs, and the unobserved part, Y mis. That is, P (R Y obs, Y mis ) P (R Y obs ). (5) No conditional independencies are implied be equation 5. In an applied research context, we could consider different ways that MNAR could arise. One situation would be if missingness was caused by the variable with missing data itself, e.g., participants with a very high income are more likely to not report their income. This situation is depicted in Figure 3 (a), in which Y and R Y are directly connected by a path. Y and R Y are said to be d-connected through the direct path Y R Y. Two adjacent, connected variables in a graph, can never be d-separated. Hence, no conditional stochastic independence can arise, and MNAR is present. A similar MNAR situation would arise when an unobserved variable has an effect on both the missingness R Y and Y. In an applied research context, this could happen whenever a variable that influences missingness also has an effect on analysis variables, but the variable has not been measured and is therefore omitted. This omitted variable can be displayed as a latent, unobserved variable in the graph, or simply as correlated disturbance terms. Figure 3 (b) displays such a situation in which an omitted variable influences both Y and R Y. Here, Y and R Y are d-connected via the path Y L 1 R Y. This path cannot be blocked via conditioning, because no observed variables reside in the middle of the path. Again, no stochastic conditional independence can be achieved through conditioning and MNAR holds. Note that the variable L 1 in the graph should not be confused with a modeled, latent variable in a SEM, but rather is a simple depiction of an unobserved variable. To make this clear, we deviate slightly from regular symbolic language of DAGs and SEM graphs and used a dashed outline for the unobserved variable.

SELECTION OF AUXILIARY VARIABLES 11 Equivalence of missing data mechanisms and graphs In the previous section we showed how the classic missingness mechanisms can be expressed via graphs that encode conditional independencies and applied the graph-theoretic concept of d-separation. In summary, when a variable Y and its associated missing indicator R Y are d-connected, MNAR holds and bias will typically emerge. If Y and R Y can be d-separated using any set of other observed variables, then MAR holds, and parameters related to Y can be estimated without bias, when using methods that rely on MAR (FIML,MI) and using those variables that are needed to d-separate Y and R Y in the imputation or analysis model, respectively for MI and FIML. A special case arises when Y and R Y are d-separated given no other variables (unconditionally independent), which maps on to the classic MCAR condition. As we shall see, relying on the graph-theoretic concept of d-separation will allow us to further determine, whether any given auxiliary variable is needed to achieve d-separation of Y and R Y or whether a variable would in fact make these two variables d-connected and induce conditional dependencies. We believe that herein lies an important advantage of using graphical models as we can easily spot auxiliary variables that may be bias-reducing or - as we will show - bias-inducing, something that is not apparent when relying on the classic conditional independence notation that has been used to describe the missing data mechanisms. Current approaches While all assumptions of the missing mechanisms are important, insofar as they prescribe which methods will yield biased or unbiased estimates, MAR is an assumption that is necessary for the two missing data approaches that are considered state-of-the-art, FIML and MI. A pertinent question is therefore how a researcher can achieve MAR or at least make MAR plausible in his or her study. As seen in equation 3 and 4 and in the accompanying graphs it is necessary to include all variables in the imputation or FIML model that make Y and R Y independent of each other. In other words, researchers need to

SELECTION OF AUXILIARY VARIABLES 12 capture all variables that they believe have a direct or indirect effect on the probability of being missing and at the same time a direct or indirect effect on the variable with missing data. Some of these variables might already be part of the analytic model, others might not be part of the analytic model, but might be needed to satisfy the MAR assumption, i.e., auxiliary variables. We now describe current approaches that aim to achieve MAR and present an example that illustrates potential problems with these approaches. Inclusive approach The so-called inclusive approach (Collins, Schafer, & Kam, 2001) to achieve MAR directs researchers to include many auxiliary variables in their imputation model (or in their FIML estimation, following guidelines by Graham (2003)). The reasoning behind the inclusive strategy is as follows: if many variables are included it becomes less likely that variables that are both causes of the missingness and the analytic variables with missing data are omitted. Such omission would be harmful as it would destroy the conditional independence posited in MAR and induce bias. Collins et al. (2001) showed that bias in means, variances, and regression estimates can be substantial if this kind of variable is omitted. A second rationale for adopting an inclusive strategy is that the inclusion of variables that may not be causes of the missingness or causes of the analytic variables with missing data, was shown to be far from being harmful[,]...at worst neutral, and at best extremely beneficial (Collins et al., 2001, p. 349). In particular Collins et al. (2001) examined the influence of including variables that are completely uncorrelated to missingness or analytic variables with missing data (so called trash variables ), or only related to analytic variables with missing data but not with the missingness itself. Completely uncorrelated variables did not have any impact on bias, and variables that were only correlated with Y, were shown to be able to attenuate bias in MNAR situations and reduce standard errors.

SELECTION OF AUXILIARY VARIABLES 13 Data-driven approach Even if one fully acknowledges the benefits of an inclusive strategy, such a strategy can reach its limits, especially when applied to large-scale datasets, which may contain hundreds of variables. If analytic models include many variables and many auxiliary variables are added, both MI and FIML will likely encounter problems in the convergence of models. To mitigate this problem it has been suggested to examine data for the inclusion of variables as auxiliaries. Schafer (1997) suggest that variables make good candidates for auxiliary variables if they are related to the missingness or the analytic variable that exhibits missingness. The rationale behind this advice is straight-forward: a variable that is completely uncorrelated with the probability of missing, cannot induce any dependencies between R Y and Y. Likewise, a variable that is completely uncorrelated with the analysis variable with missing values can also not induce any dependencies between R Y and Y. As a demonstration of this principle, consider Figure 4 in which three auxiliary variables A 1, A 2, and A 3 are added to a model in which X d-connects Y and R Y via Y X R Y and A 1 d-connects Y and R Y via Y A 1 R Y. The two variables A 2 and A 3 do not d-connect Y and R Y and conditioning on them is therefore not needed to render Y and R Y conditionally independent, and hence fulfilling the MAR condition. Simply using X and A 1 is sufficient in this example. 1 The data-driven approach advises us to screen our set of potential auxiliary variables as to whether they are related (usually examined using correlations) with any of the analysis variables, or any of the missing value indicator variables. Variables that are related to either or both should be included as auxiliary variables, while variables that fall below a certain correlation threshold to either, should not be used. Particular guidelines on the inclusion and exclusion of auxiliary variables were formulated by Van Buuren, Boshuizen, Knook, et al. 1 Note that if the disturbance terms of A 2 and A 3 were correlated (e.g., due to an unobserved variable that has a relationship to both of these variables), an active path Y A 2 ε A2 ε A3 A 3 R Y would be present, which could be blocked by either conditioning on A 2, A 3, or both. Hence at least one of these variables would need to be included in a FIML or imputation model.

SELECTION OF AUXILIARY VARIABLES 14 (1999) who recommend to include a variable if the correlation of it with either missingness or the variable with missing data exceeds ±.1 (or any other chosen threshold, e.g., Enders (2010) suggests correlations with the analysis variables greater than ±.4). The implicit assumption is that variables that are correlated even lower than the chosen threshold will have little power to induce any dependencies, and that variables that are correlated higher, are assumed to induce biases in the estimation of parameters in the presence of missing data. Generally, the advice to include auxiliary variables in missing data problems is sound and has, in both simulations studies (Collins et al., 2001) and theoretical work (Schafer, 1997), been shown to be useful. However, both the inclusive strategy and the data-driven approach ignore the possibility that there are certain instances and classes of variables that should not be used as auxiliary variables, because they induce bias in the estimation of parameters in the presence of missing data, by destroying the conditional independence between Y and R Y, hence violating MAR. We now turn to these situations and variables and show, using illustrative examples and simulations, that this bias can become potentially large, if ignored. Bias-enhancing auxiliary variables Consider first a simplified illustrative example of a single variable Y with missing data, a missing data indicator R Y, and two potential auxiliary variables A 1 and A 2 that are at the disposal of the applied researcher. In addition, two unobserved variables L 1 and L 2 are part of the true data-generating model. The full model is displayed in Figure 5. An initial reaction to this model might be that the unobserved variables L 1 and L 2 make this an MNAR situation and that some bias would be expected and is not surprising. However, the situation is more subtle. Variable A 1 indeed induces conditional dependencies between Y and R Y via the path Y A 1 R Y and therefore biases the estimates of Y, in the presence of missing data. Therefore, if one uses A 1 as an auxiliary variable, bias due to A 1 will be eliminated, as the biasing path is blocked. Variable A 2 on the other hand, even

SELECTION OF AUXILIARY VARIABLES 15 though spuriously correlated with Y and R Y, does not induce conditional dependencies via the path Y L 1 A 2 L 2 R Y and therefore cannot bias the estimates of Y no matter what values the constituent path coefficients would take on. This is because A 2 is a collider variable on this path and not conditioning on it, closes this path and does not induce any dependencies between Y and R Y. What however happens when A 2 is also used as an auxiliary variable, along with A 1? The inclusion of A 2 will actually destroy the conditional independence that was achieved earlier with the inclusion of A 1 and induce an MNAR situation. The path Y L 1 A 2 L 2 R Y that was initially blocked becomes open when A 2 is conditioned on (used as an auxiliary variable). To illustrate this point further using data, we simulated a single dataset based on the model in Figure 5. The data generation is fully described in the first simulation study below. Briefly described, we chose a large sample size of n = 1000. All continuous variables were multivariate normally distributed with mean of 0 and variance of 1. Path coefficients in the model were completely standardized and the size of the path coefficients was chosen so that the total R 2 (or the respective McKelvey-Zavoina pseudo-r 2 (McKelvey & Zavoina, 1975)) of every single dependent variable in the model (Y, A 2, R Y ) was identical to 50%. We chose the sign of the path coefficients so that the direction of bias due to the omission of A 1 and the bias due to the inclusion of A 2 was in the same direction and not incidentally offsetting each other. The amount of missing data was set to 50%. We estimated the mean and standard deviation of the variable Y using a listwise deletion approach, FIML estimation in Mplus (Muthén, 2011) and lavaan (Rosseel, 2012) using only A 1 as the auxiliary variables, using only A 2 as the auxiliary variable, or using both A 1 and A 2 as auxiliary variables. Auxiliary variables in the FIML estimation were included using the Mplus auxiliary command, which automatically fits a model suggested by Graham (2003). We also used mice (van Buuren & Groothuis-Oudshoorn, 2011) to generate 5 multiple imputations whose results were pooled following standard recommendations (Rubin, 1976). As expected, and previously reported by Collins et al. (2001), results of FIML and MI did not differ

SELECTION OF AUXILIARY VARIABLES 16 substantially when the same set of auxiliary variables were used. We only report results of the FIML estimation in Table 1. In the single simulated dataset the completely observed data of Y had a mean of.03, and a standard deviation (SD) of 1.00. When using listwise deletion, the mean of Y was.19, and the SD was.98. Not surprisingly we observed bias in the means, as would be expected under a MAR situation in which missingness was induced through a linear function of other variables. Using A 1 as an auxiliary variable and estimating the mean of Y with FIML estimation yielded a mean of.06. Using A 1 does a very good job of reducing bias. The relative percent reduction of bias compared to the listwise model was 100.19.06.17 68%. Using A 2 as an auxiliary variable on the other hand actually increases bias! The estimated mean of Y was now.30, with a resulting percent bias amplification of 100.19.30.19 58% compared to the listwise results. Finally, when using both A 1 and A 2 as auxiliary variables, the mean of Y was estimated to be.14, resulting in a bias reduction of a mere 100.14.19.19 26%. We observed that using both variables as auxiliary variables was worse than using A 1 alone. This result may not be obvious when considering the formulas for MAR or MNAR, and in fact it goes counter to the advice that an auxiliary variable can be at worst neutral. Clearly, this auxiliary variable was not neutral, but highly bias-inducing. When one uses a graph to encode the structural relationships between the auxiliary variables and missingness and analysis variables, respectively, this result however is expected and can be directly seen by the fact that conditioning on A 2 d-connects Y and R Y by opening a previously blocked path. A single simulated dataset is seldom a convincing argument, however it can serve as a departing point for a more developed argument. First, it shows that an auxiliary variable can increase bias in the estimation of parameters in the presence of missing data. Second, a bias-inducing variable cannot be distinguished from a helpful auxiliary variable by examining correlations with analysis variables and missingness indicators. In fact, in this example, the variable A 2 posed as a perfectly innocent and potentially very helpful auxiliary variable. In the complete dataset A 2 was both significantly correlated with the analysis variable Y

SELECTION OF AUXILIARY VARIABLES 17 (r =.26, p <.001) and the missing data indicator R Y (point-biserial correlation r pb =.25, p <.001). Using inclusion criteria that rely solely on correlations would incorrectly lead to the inclusion of A 2 in the set of auxiliary variables. In addition, a simple example like this one helps to link what could simply be a mathematical curiosity to an applied context. To make this illustrative example more concrete, consider that Y, the variable with missing data, is a measure of mathematical ability with a missingness indicator R Y. For this example, we assume that MAR holds and that there is no direct path from Y to R Y. Variable A 1 is a measure of motivation of the participant that has been observed and is used in the analysis as a potential auxiliary variable. Specifically, more motivated participants score higher on the math achievement test, and are less likely to have missing data. Consider further that A 2 is the income of the participant, another variable that was assessed as part of the study. The two unobserved variables L 1 and L 2 are IQ and gender of the participant, respectively. Note that we are assuming in this model that IQ and gender are in fact uncorrelated (which seems like a tenable assumption). The model further expresses that participants with higher IQ scores also score higher on math achievement, and that participant s gender has an influence on missingness (maybe one gender group was more likely to skip certain items). While this example is admittedly somewhat artificial due to it s constrained nature, we believe that it is not entirely implausible and suggests that auxiliary variables of the type as A 2 in our example could in fact be lurking among seemingly benign potential auxiliary variables. Henceforth, we will refer to these variables as collider auxiliary variables. Research questions Having established in a single example that auxiliary variables can induce bias we set forth to answer several research questions. 1. First, we are interested in the absolute magnitude of bias that can be induced when using collider auxiliary variables as a function of the magnitude of the constituent paths that

SELECTION OF AUXILIARY VARIABLES 18 connect a collider auxiliary variable to missingness and analysis variables. In addition, we want to put this magnitude into context and contrasts it with bias that is induced due to the omission of a helpful auxiliary variable. This latter form of bias has been examined before and we only include it to provide a benchmark for the bias that we expect to observe with the inclusion of a collider auxiliary variable. Earlier research by Greenland (2003) in the area of confounding in causal inference suggests that the magnitude of bias due to conditioning on a collider, especially of the kind that we presented in our example, is usually smaller than omitting a confounder. We therefore suspect that bias due to including a collider variable as an auxiliary variable will be noticeable, but smaller in magnitude than omitting a true confounding auxiliary variable (i.e., a variable that is directly or indirectly causing both missingness and analysis variables with missing data). 2. The second research question examines behavior of auxiliary variables in data situations that are inherently MNAR. In the MAR cases considered in the first simulation study, the conditional independence between missingness and analysis variables with missing data can always be created by using some observed variables. Hence there is an expectation that including the collider auxiliary variable will necessarily increase bias. by disturbing the conditional independence. In the MNAR case collider auxiliary variables are expected to behave differently, insofar as the relationship that they induce between the missingness and variables with missing data can either enhance or reduce the already existing relationship between missingness and analysis variables with missing data. In a similar fashion, we will also explore the behavior of auxiliary variables that are directly related to both missingness and analysis variables. Simulations studies Simulation study 1.1 Our first simulation study explores the absolute magnitude of bias that can be induced when using a collider auxiliary variable in a MAR situation. The simulation study roughly

SELECTION OF AUXILIARY VARIABLES 19 followed Collins et al. (2001), in terms of data-generation and evaluation criteria. Generally speaking, data are first generated under a specific model, then missing data are imposed based on a described mechanism, then parameters are estimated using listwise deletion and FIML with auxiliary variables. Lastly, results of replications are pooled within condition and performance criteria assessed. While it is possible to examine bias in many different parameters of interest (means, variances, skew, regression coefficients, factor loadings, etc.), we only focus on estimates of the population mean. The reason behind this choice was that mean responses (potentially across different groups) are still one of the most widely used measures to describe research phenomena in the social sciences. The examination of regression coefficients is left to future studies and is briefly mentioned in the discussion. Data generation and analysis. The data-generating model for simulation 1.1 is shown in Figure 6. In the model, a single independent variable Y is generated with missing data, indicated by R Y. Auxiliary variable A 1 is spuriously correlated with the probability of missing and the outcome Y, via two unobserved, uncorrelated variables L 1 and L 2. In the model Y and R Y are d-separated but become d-connected as soon as A 1, the collider auxiliary variable, is used in MI or FIML. All continuous variables were multivariate normally distributed and completely standardized by fixing the total variance of each variable to 1 and setting means to 0. We did not vary sample size, but chose a single constant sample of 500. This single sample size was also chosen by other authors in similar simulations (Collins et al., 2001; Saris, Satorra, & Van der Veld, 2009), as a somewhat large, but still reasonable sample size to consider. Furthermore, changes in sample size usually yield predictable results when other factors are held constant, namely that standard errors decrease with increased sample size. We also did not vary the amount of missing data, but fixed it at a relatively high value of 30%, which was in-between the two values chosen by Collins et al. (2001). Varying the amount of missing data is often not very interesting as results of such variation have previously been shown to yield expected results (bias gets worse as missing data increases). All path coefficients in the data-generating model, labeled α were chosen so that the

SELECTION OF AUXILIARY VARIABLES 20 uniquely explained variance in the outcome variable that these paths were connected to was set to a particular value. Paths coefficients were set at 0,.224,.387,.500,.592 and.671. This corresponds to uniquely explained variance of 0%, 5%, 15%, 25%, 35%, and 45%, respectively. See the Appendix for details on how missingness was generated and how explained variance in R Y was defined. Finally, we varied the sign of the coefficient labeled α (positive or negative). This sign change of a single path of the constituent paths of the collider auxiliary variable does not alter the magnitude of the bias that is induced, but alters the direction. Note that it is not of importance which of the four paths α is varied in sign, because the direction of bias is determined by the product of all four constituent paths (Pearl, 2000). Finally note that conditions in which all paths were set to 0 correspond to a pure MCAR condition. In this simulation design we varied all paths labeled α simultaneously. Our primary interest was to observe overall bias and not bias due to differential changes in constituent paths. This simulation design thus yielded 5 conditions with a positive sign, 5 conditions with a negative sign, and one condition in which all paths were set to 0, for a total of 11 conditions. We replicated each condition 1000 times. All simulations were conducted using R (R Development Core Team, 2011) and the following packages: lavaan (Rosseel, 2012), MASS (Venables & Ripley, 2002), mice (van Buuren & Groothuis-Oudshoorn, 2011), MplusAutomation (Hallquist, 2012), and plyr (Wickham, 2011). For the generation of graphs we used ggplot2 (Wickham, 2009) and tikzdevice (Sharpsteen & Bracken, 2012). Performance measures. In order to analyze the results of our simulation study, we assess a range of standard criteria commonly employed in simulation studies. 1. We assessed standardized bias in the estimates (mean, variance) of variables with missing data, defined identical to Collins et al. (2001) as raw bias (average parameter estimate across replications minus true parameter value) divided by the standard error, defined as the standard deviation across all replication estimates. Collins et al. (2001) gives a rule of thumb that absolute values of.4 or higher are worrisome on the standardized bias metric.

SELECTION OF AUXILIARY VARIABLES 21 2. We recorded the precision of the estimates defined as the average standard error across all replications. In general it is desirable to have estimates with smaller standard errors, and hence narrower confidence intervals and more precise estimates. 3. We computed the root mean squared error (RMSE) defined as the square root of the average squared difference between a parameter estimate and the true value of the parameter. 4. Lastly, we observed coverage rates, defined as the percentage of replications whose 95% confidence interval included the true parameter estimate. Ideally, one observes 95% coverage rates, as this would indicate that the confidence intervals of the estimator are in the long run accurately capturing the true parameter and have the nominal α error rate. Again, relying on rules of thumb by Collins et al. (2001), we regard coverage rates below 90% as worrisome. Results of simulation study 1.1. The complete results are shown in Table B1 in the Appendix. In order to communicate the most important findings, we display the amount of standardized bias in the means in Figure 7, and coverage values in Figure 8. Both figures shows that the listwise model is unbiased and has perfect coverage across all conditions. The inclusion of A 1 as an auxiliary variable in the FIML estimation induced bias in the mean, as would be expected based on missingness patterns that are imposed in a linear fashion. Bias emerges in all conditions that used FIML, expect the one in which all paths labeled α are set to 0 (the MCAR condition). Note that this is true even though variable A 1 is related to both Y and R Y and would be included as an auxiliary variable under all current recommendations to achieve MAR. The general pattern as seen in Figure 7 and 8 is that increases in the amount of explained variance yield monotonic increases in bias. Little to none bias is observed in conditions of weak path coefficients and stronger biases are observed in more extreme conditions. The standardized bias (and other performance measures) reach a critical threshold, based on the rule of thumbs by (Collins et al., 2001), when path coefficients are as strong that they explain slightly less than 25% of the variance. Bias in conditions with even stronger effects is so large that confidence intervals approach

SELECTION OF AUXILIARY VARIABLES 22 40% coverage. Also, not surprisingly, the direction of bias changes when the sign of the coefficient α changes its sign. In conditions in which the sign is negative, positive bias is induced due to the inclusion of the collider auxiliary variable, and negative bias is induced when the path coefficient has a positive sign, respectively. The results of this simulation clearly show that an auxiliary variable, even though it exhibits strong correlations with missingness and analysis variables, can increase bias. This somewhat surprising result is evident from the graphical model, in which we can see that A 1 is a collider auxiliary variable which will induce a bias in the path from Y to R Y. Simulation study 1.2 To put the results of the first simulation study into a broader context, we performed a second simulation study that was essentially a replication of earlier findings that an omitted variable that has an effect on both missingness and analysis variables with missing data can bias estimates. While this simulation study by itself does not give us any new insights, we performed this study to answer our research question 2, aimed at exploring whether the magnitude of bias due to omission of a bias-inducing collider auxiliary variable is similar in strength to omission of a potentially more helpful auxiliary variable. We replicated the first simulation study using the exact same values of explained variance in our data-generating model, but changed the role of the collider auxiliary variable to an auxiliary variable that has direct influences on both missingness and analysis variables. Data generation and analysis. The data-generating model for simulation 1.2 is shown in Figure 9. In this model, a single independent variable Y is generated with missing data, indicated by R Y. This time, an auxiliary variable A 2 is directly affecting both Y and R Y, thus d-connecting the two variables. The graphical criterion therefore tells us that A 2 is a bias-inducing variable that should be used in the FIML estimation. The generation of all variables was identical to simulation study 1.1. The unique explained variance of each effect labeled β was also identical to the previous simulation and set to 0%, 5%, 15%, 25%, 35%, and

SELECTION OF AUXILIARY VARIABLES 23 45%. Again, we varied the sign of the path labeled β, for a total of 11 simulation conditions. Results of simulation study 1.2. Table B2 in the Appendix lists the complete results of the second simulation study. To visualize our main findings we present standardized bias in the means and coverage rates of means in Figure 10 and Figure 11 for all conditions. In this simulation we observe a slightly different pattern than the previous simulation. Not surprisingly and shown previously by other researchers, the listwise model is biased in the parameter estimates of the means, and in the more extreme cases even in the variance of Y (not shown in Figure, but in table). The FIML model that included A 2 is virtually unbiased in all conditions and has perfect coverage, because the true data-generating mechanism of the missingness is captured. Several important observations can be made. First, the bias that is induced through the omission of a helpful auxiliary variable is larger in magnitude in comparison with the inclusion of a bias-inducing collider auxiliary variable. This can also be observed when examining coverage rates that drop much more dramatically than in the case of an included collider auxiliary variable. For example, in the condition with 25% explained variance, the standardized bias in the previous simulation was.61, whereas in this simulation with an omitted and helpful auxiliary variable, the bias is 2.45. A second observation is that the direction of bias is flipped compared to the results of the previous study. A negative sign of the path coefficient labeled with a yielded negative bias, and likewise a positive path coefficient yielded positive bias. Intermediate summary of results of simulation study 1 We have shown that in cases that are not MNAR, bias can be induced through the inclusion of auxiliary variables in a FIML estimation framework. The fact that an auxiliary variable can actually make bias worse in parameter estimates in the presence of missing data is a novel point that is not addressed by the currently practiced approaches of including auxiliary variables. It also provides a counter-argument that is sometimes brought forth in defense of including many variables that states that as soon as the explained variance in the

SELECTION OF AUXILIARY VARIABLES 24 missingness or the outcome variable gets very large, there is no more room for any potential biasing influences. This is clearly wrong, as our simulation examined cases in which explained variance through the inclusion of a collider auxiliary variable was very large and yet bias increased. In our simulation studies this bias seemed to become problematic (as assessed through rules of thumbs of standardized bias and coverage) as soon as the explained variance of the unobserved variables associated with the collider auxiliary variable crossed a threshold of slightly less than 25%. On a correlation metric we therefore would have to observe correlations in the magnitude of approximately.4.5. While this may seem very high, it is important to remember that in our simulation studies there was only a single collider auxiliary variable with only 2 unobserved variables, while in reality there could be a multitude of both colliders and unobserved variables, especially if one is considering psychological constructs that are often multiply caused. Those taken together might be able to explain more variance and potentially make the inclusion of collider auxiliary variables more problematic. However, the second simulation study also demonstrated that the bias that is observed due to the inclusion of a collider auxiliary variable is much smaller than the bias observed due to the omission of an auxiliary variable that has directional effects on both missingness and analysis variables with missing data. In our simulation setup we observed troublesome levels of bias, as soon as the omitted auxiliary variable explained slightly less than 15% of the variance in the related variables, which translates to correlations of approximately.3.4. These intermediate results should not give the impression that listwise deletion is generally preferable over MI or FIML models with auxiliary variable, as may erroneously be believed based on the result of the first simulation study. However, it shows that inclusion of auxiliary variables does not always mitigate bias, but can enhance it and that researchers should be aware of picking good auxiliary variables. We discuss some strategies later in the discussion.