Module 14: Missing Data Concepts

Similar documents
Analysis of TB prevalence surveys

Bayesian approaches to handling missing data: Practical Exercises

How should the propensity score be estimated when some confounders are partially observed?

Exploring the Impact of Missing Data in Multiple Regression

Strategies for handling missing data in randomised trials

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Missing Data and Imputation

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

What is Multilevel Modelling Vs Fixed Effects. Will Cook Social Statistics

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Help! Statistics! Missing data. An introduction

The analysis of tuberculosis prevalence surveys. Babis Sismanidis with acknowledgements to Sian Floyd Harare, 30 November 2010

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

MS&E 226: Small Data

Advanced Handling of Missing Data

Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

COMMITTEE FOR PROPRIETARY MEDICINAL PRODUCTS (CPMP) POINTS TO CONSIDER ON MISSING DATA

Appendix 1. Sensitivity analysis for ACQ: missing value analysis by multiple imputation

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Validity and reliability of measurements

PSI Missing Data Expert Group

Section on Survey Research Methods JSM 2009

International Standard on Auditing (UK) 530

IAASB Main Agenda (February 2007) Page Agenda Item PROPOSED INTERNATIONAL STANDARD ON AUDITING 530 (REDRAFTED)

Data and Statistics 101: Key Concepts in the Collection, Analysis, and Application of Child Welfare Data

Challenges of Observational and Retrospective Studies

Inclusive Strategy with Confirmatory Factor Analysis, Multiple Imputation, and. All Incomplete Variables. Jin Eun Yoo, Brian French, Susan Maller

Best Practice in Handling Cases of Missing or Incomplete Values in Data Analysis: A Guide against Eliminating Other Important Data

Multiple Imputation For Missing Data: What Is It And How Can I Use It?

In this module I provide a few illustrations of options within lavaan for handling various situations.

What to do with missing data in clinical registry analysis?

Practice of Epidemiology. Strategies for Multiple Imputation in Longitudinal Studies

Analysis Strategies for Clinical Trials with Treatment Non-Adherence Bohdana Ratitch, PhD

SRI LANKA AUDITING STANDARD 530 AUDIT SAMPLING CONTENTS

The Effects of Maternal Alcohol Use and Smoking on Children s Mental Health: Evidence from the National Longitudinal Survey of Children and Youth

Evaluation Models STUDIES OF DIAGNOSTIC EFFICIENCY

Abstract. Introduction A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

Pharmaceutical Statistics Journal Club 15 th October Missing data sensitivity analysis for recurrent event data using controlled imputation

Missing by Design: Planned Missing-Data Designs in Social Science

Missing data in clinical trials: making the best of what we haven t got.

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis

Identifying relevant sensitivity analyses for clinical trials

Quantitative Methods. Lonnie Berger. Research Training Policy Practice

An application of a pattern-mixture model with multiple imputation for the analysis of longitudinal trials with protocol deviations

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

Revised Cochrane risk of bias tool for randomized trials (RoB 2.0) Additional considerations for cross-over trials

Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd Ed.

Analysis of Environmental Data Conceptual Foundations: En viro n m e n tal Data

Methods for Addressing Selection Bias in Observational Studies

Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling, Sensitivity Analysis, and Causal Inference

Missing data in medical research is

INTERNATIONAL STANDARD ON AUDITING (NEW ZEALAND) 530

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Validity and reliability of measurements

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

A Strategy for Handling Missing Data in the Longitudinal Study of Young People in England (LSYPE)

Evaluating Social Programs Course: Evaluation Glossary (Sources: 3ie and The World Bank)

Data harmonization tutorial:teaser for FH2019

Chapter 1: Exploring Data

PARTIAL IDENTIFICATION OF PROBABILITY DISTRIBUTIONS. Charles F. Manski. Springer-Verlag, 2003

Instrumental Variables Estimation: An Introduction

Unit 1 Exploring and Understanding Data

The prevention and handling of the missing data

Mostly Harmless Simulations? On the Internal Validity of Empirical Monte Carlo Studies

Further data analysis topics

The RoB 2.0 tool (individually randomized, cross-over trials)

Complier Average Causal Effect (CACE)

Imputation approaches for potential outcomes in causal inference

6. A theory that has been substantially verified is sometimes called a a. law. b. model.

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

ethnicity recording in primary care

Master thesis Department of Statistics

Manuscript Presentation: Writing up APIM Results

BLISTER STATISTICAL ANALYSIS PLAN Version 1.0

George B. Ploubidis. The role of sensitivity analysis in the estimation of causal pathways from observational data. Improving health worldwide

research methods & reporting

Assessing the accuracy of response propensities in longitudinal studies

MMI 409 Spring 2009 Final Examination Gordon Bleil. 1. Is there a difference in depression as a function of group and drug?

Chapter 17 Sensitivity Analysis and Model Validation

Lec 02: Estimation & Hypothesis Testing in Animal Ecology

UMbRELLA interim report Preparatory work

Review Statistics review 2: Samples and populations Elise Whitley* and Jonathan Ball

Missing Data: Our View of the State of the Art

Dichotomizing partial compliance and increased participant burden in factorial designs: the performance of four noncompliance methods

Regression Discontinuity Analysis

Missing Data and Institutional Research

Maintenance of weight loss and behaviour. dietary intervention: 1 year follow up

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

Overview of Perspectives on Causal Inference: Campbell and Rubin. Stephen G. West Arizona State University Freie Universität Berlin, Germany

An Instrumental Variable Consistent Estimation Procedure to Overcome the Problem of Endogenous Variables in Multilevel Models

Write your identification number on each paper and cover sheet (the number stated in the upper right hand corner on your exam cover).

Detection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis

Chapter 5: Field experimental designs in agriculture

Designing and Analyzing RCTs. David L. Streiner, Ph.D.

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Ordinary Least Squares Regression

Transcription:

Module 14: Missing Data Concepts Jonathan Bartlett & James Carpenter London School of Hygiene & Tropical Medicine Supported by ESRC grant RES 189-25-0103 and MRC grant G0900724 Pre-requisites Module 3 For the section on multilevel data, Modules 4 and 5 Online resources: www.missingdata.org.uk Contents Introduction... 1 Introduction to the Class Size Data... 2 C14.1 The Model of Interest... 3 C14.2 Investigating Missingness... 3 C14.2.1 Missing completely at random... 4 C14.2.2 Missing at random... 4 C14.2.3 Missing not at random... 5 C14.2.4 Investigating missingness... 5 C14.2.5 Missingness in multiple variables and monotone missingness... 7 C14.3 Ad-hoc Methods... 8 C14.3.1 Missing category method... 8 C14.3.2 Mean imputation... 8 C14.3.3 Regression (conditional) mean imputation... 9 C14.3.4 Random or stochastic regression imputation...11 C14.4 Complete Records Analysis... 13

C14.5 Multiple Imputation... 14 C14.5.1 Creating the imputations...15 C14.5.2 Choosing variables to include in the imputation model...16 C14.5.3 Proper imputation...17 C14.5.4 Using the imputed datasets...17 C14.5.5 How many imputations should we generate?...18 C14.5.6 Class size data...18 C14.5.7 Assumptions made by multiple imputation...19 C14.5.8 Software...19 C14.6 Inverse Probability Weighting... 20 C14.6.1 Implementation...20 C14.6.2 Assumptions and validity...20 C14.6.3 Class size data...21 C14.6.4 Inference after IPW...22 C14.6.5 IPW vs MI...22 C14.7 Multilevel and Longitudinal Studies... 23 C14.7.1 Direct likelihood approaches...23 C14.7.2 Multiple imputation...24 C14.7.3 Inverse probability weighting...25 C14.8 Summary and Conclusions... 25 C14.8.1 What we have not covered...26 References... 27 Acknowledgements... 29

Introduction Missing data are ubiquitous in economic, social, and medical research. In this module we aim to introduce the issues raised by missing data and the central assumptions on which any analysis of partially observed data must rest. We then review commonly used ad-hoc approaches for handling missing data, before going on to introduce and contrast two increasingly widely used methods for the analysis of partially observed data: multiple imputation and inverse probability weighting. The aims of this module, and the associated practicals, are to: Give you an understanding of the issues raised by missing data, and how to explore them in a particular data set Introduce the key concepts of Missing Completely At Random, Missing At Random and Missing Not At Random, which describe the assumptions on which analysis of partially observed data rests Critically review commonly used ad-hoc methods for handling missing data Introduce, and give an intuition for, two increasingly widely used methods for the analysis of partially observed data Illustrate the use of these methods using Stata and MLwiN/REALCOM. At the end of this module, the objective is that you will be able to frame the analysis of partially observed data in terms of the key concepts above, apply multiple imputation in relatively straightforward settings, and have the basic tools to critically review the analysis of partially observed data by others. For more information, please refer to our website, www.missingdata.org.uk, where among other things you will find information on forthcoming courses. By missing data, we mean data that we intended to collect on units, which will often be individuals, but for some reason were unable to collect. We stress that the underlying, unobserved value exists; we just did not observe it. Thus we are not concerned here with counterfactual data, such as what an individual s educational attainment would have been if they had attended a different school. Throughout this module, we assume that in the absence of any missing data, we have in mind the statistical model we would fit to address our research question, in other words the response variable and the key covariates whose association with the response we wish to explore. Although this will often involve fitting one model, given the response and set of covariates we term the most general model the model of interest. If we can obtain valid inference for the model of interest, we can simplify it if appropriate. As with all statistical models, the model of interest contains one or more covariates, whose regression parameters we wish to draw inference for. These typically describe the mutually adjusted associations between explanatory variables and the response variable. When some values that we intended to collect are missing, our model of interest does not generally change. However, the presence of missing values makes fitting the model of interest to the full data as originally intended impossible; reducing to the subset of complete records will often result in bias. It is also inefficient to discard valuable (and costly to collect) partial information on the units with one or more missing values. Centre for Multilevel Modelling, 2013 1

The ubiquity of missing data means there has been a vast amount of research into its theoretical and practical implications over the last 35 years. Consistent with the aims above, we seek to summarise (some of) this and present it in an accessible way for the typical quantitative researcher, who has not had advanced statistical training. The plan for the remainder of this module is therefore as follows. We begin by introducing a set of educational data which we use repeatedly to illustrate the discussion. Then, C14.2 outlines the centrality of assumptions, and gives an intuitive introduction to the key concept of missingness mechanisms, and their implications for the analysis. In the light of this, we critically review commonly used ad-hoc methods, and complete records analysis, in C14.3 and C14.4. We then consider two more theoretically well grounded methods for the analysis of partially observed data, namely multiple imputation (C14.5) and inverse probability weighting (0). In C14.7 we outline extensions to multilevel structures, before summarising the module and presenting a strategy for the analysis of partially observed data in C14.8. Introduction to the Class Size Data To illustrate our discussion, we use data from the class size study (Blatchford et al 2002), kindly made available by Peter Blatchford. We use a subset of the data, so that our analyses are merely illustrative; substantive conclusions should not be drawn. Table 14.1 describes the variables in the dataset, which consists of data from 4,873 children before and during their reception (i.e. first full time year of primary education) year. We have full data on each child s numeracy test score before the reception year (nmatpre) and literacy test score at the end of the reception year (nlitpost). These originally quantitative scores have been standardised to have mean 0 and variance 1. We also have a binary indicator of special educational needs (sen). The dataset also contains nmatpre_m, which is the same as nmatpre except that some values have been made artificially missing. Table 14.1. Variables in the class size data Variable nmatpre Description Pre-reception maths score nmatpre_m Pre-reception maths score, with 2,493 values missing nlitpost Sen Post-reception literacy score Special educational needs (1=yes, 0=no) Centre for Multilevel Modelling, 2013 2

C14.1 The Model of Interest We assume throughout this module that our aim is to fit some model of interest, which will enable us to answer our scientific or research question. To motivate our discussion, we focus on the linear regression model for nlitpost with nmatpre and sen as covariates, or explanatory variables. This is our model of interest. Table 14.2 shows the estimated regression coefficients based on the full data (n = 4873). Table 14.2. Estimated regression coefficients and standard errors based on fit to full data Explanatory variable Coefficient (SE) constant 0.045 (0.012) nmatpre 0.585 (0.012) sen -0.432 (0.043) Based on the full data, there is strong evidence (the coefficient divided by its standard error is much larger than 1.96) that pre-reception maths score and whether the child has special educational needs are independently predictive of post-reception literacy score. Given special educational needs, each standardised unit increase in pre-reception maths score increases post-reception literacy by 0.6 of a standard deviation on average. For a given pre-reception maths score, special educational needs reduces post-reception literacy score by just under half a standard deviation on average. We will use these estimates, based on the full data, as our gold standard. We will perform analyses with nmatpre_m, and using various methods for handling missingness in the variable, compare the estimates with those in Table 14.2. While this example is much simpler than any substantive analysis, it nevertheless is sufficient to illustrate the key concepts and methods. C14.2 Investigating Missingness It is almost a truism to say that the validity of statistical analyses when we have missing data depends critically on the reasons why data are missing. More subtly, below we will show that the observed data cannot definitively identify the reason, for, or probabilistic mechanism causing the missing data (although the observed data can rule out some mechanisms). In substantive analyses, we are therefore going to need to make an assumption about the missing data mechanism, consistent with but not definitively verifiable from the observed data. Our goal is then valid statistical inference under this assumption. Since we cannot know if this assumption is true, we should also be interested in exploring the sensitivity of our inferences, or conclusions, to different assumptions about the missingness mechanism. This is beyond the scope of this module, although we return to it briefly at the end. Centre for Multilevel Modelling, 2013 3

Missingness mechanisms can be classified using a typology first proposed by Rubin (Rubin 1976). We now introduce these, and illustrate them with the class size data. Specifically, we consider the partially observed variable nmatpre_m and how the chance of nmatpre_m being missing is related to the other variables and beyond them possibly the underlying (and in this artificial setting known) value of nmatpre itself. C14.2.1 Missing completely at random Data on variables in the model of interest are said to be missing completely at random (MCAR) if missingness (the chance of observations being missing) is independent of the variables in the model of interest. In the context of the class size data and missingness in the nmatpre_m variable, MCAR means the probability that nmatpre_m is missing is independent of nmatpre, and of the other variables sen and nlitpost. In words, the probability that nmatpre_m is missing for a given child is the same regardless of their (possibly unseen) maths score, their literacy score or whether they had special education needs. We note MCAR is always an assumption that cannot be proven to hold, given the observed data. Specifically, in real data, while we can explore whether the chance of observing a particular variable depends on other variables (which violates MCAR), we cannot explore whether it depends on the underlying unseen values of that variable. Thus MCAR may be a very strong assumption. In practice, the data often suggest that units with missing data are systematically different with respect to some of the variables in our analyses compared to those without missing data. In such cases, data are not MCAR. Implications If data are MCAR, the complete records (i.e. those units without missing data) are a random sub-sample from the original, intended, sample. If we perform our statistical analyses using only data with complete records we will therefore obtain unbiased estimates of parameters and valid inferences. However, we have discarded units with partial (yet still valuable and costly to obtain) data. Thus, our estimates will be less precise; the goal of a more sophisticated analysis will (often) be to include the units with partially observed data. C14.2.2 Missing at random Observations on a variable are said to be missing at random (MAR) when the chance of observing the variable on a unit may depend on the underlying value of that variable, but, given the observed data, this association is broken. In the context of the nmatpre_m variable, MAR means that the probability of nmatpre_m being missing may be associated with the underlying (here artificially known) value of nmatpre, but given values of the other fully observed variables (sen and nlitpost), this association is broken. Implications If data are MAR, analyses based on complete records will in general be biased. For example, the sample mean of the partially observed variable will typically be biased. This is because under MAR, the complete records are not a random sample from the target population. However, all is not lost. We shall see that under MAR it is possible to obtain Centre for Multilevel Modelling, 2013 4

This document is only the first few pages of the full version. To see the complete document please go to learning materials and register: http://www.cmm.bris.ac.uk/lemma The course is completely free. We ask for a few details about yourself for our research purposes only. We will not give any details to any other organisation unless it is with your express permission.