Advanced Handling of Missing Data

Similar documents
Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Help! Statistics! Missing data. An introduction

Missing Data and Imputation

AMELIA II: A Package for Missing Data

Modern Strategies to Handle Missing Data: A Showcase of Research on Foster Children

Validity and reliability of measurements

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Validity and reliability of measurements

Module 14: Missing Data Concepts

Appendix 1. Sensitivity analysis for ACQ: missing value analysis by multiple imputation

Exploring the Impact of Missing Data in Multiple Regression

Best Practice in Handling Cases of Missing or Incomplete Values in Data Analysis: A Guide against Eliminating Other Important Data

Missing Data and Institutional Research

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

The prevention and handling of the missing data

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

Multiple Imputation For Missing Data: What Is It And How Can I Use It?

Inclusive Strategy with Confirmatory Factor Analysis, Multiple Imputation, and. All Incomplete Variables. Jin Eun Yoo, Brian French, Susan Maller

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

Master thesis Department of Statistics

Analysis of TB prevalence surveys

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

Dealing with Missing Data: A comparative exploration of approaches utilizing the Integrated City Sustainability Database

Methods for Computing Missing Item Response in Psychometric Scale Construction

Section on Survey Research Methods JSM 2009

Missing by Design: Planned Missing-Data Designs in Social Science

Accuracy of Range Restriction Correction with Multiple Imputation in Small and Moderate Samples: A Simulation Study

In this module I provide a few illustrations of options within lavaan for handling various situations.

Missing Data: Our View of the State of the Art

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

Preliminary Report on Simple Statistical Tests (t-tests and bivariate correlations)

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

Missing data in clinical trials: making the best of what we haven t got.

COMMITTEE FOR PROPRIETARY MEDICINAL PRODUCTS (CPMP) POINTS TO CONSIDER ON MISSING DATA

Maintenance of weight loss and behaviour. dietary intervention: 1 year follow up

Statistical data preparation: management of missing values and outliers

bivariate analysis: The statistical analysis of the relationship between two variables.

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Data Analysis Using Regression and Multilevel/Hierarchical Models

Week 10 Hour 1. Shapiro-Wilks Test (from last time) Cross-Validation. Week 10 Hour 2 Missing Data. Stat 302 Notes. Week 10, Hour 2, Page 1 / 32

Chapter 3 Missing data in a multi-item questionnaire are best handled by multiple imputation at the item score level

Strategies for handling missing data in randomised trials

WELCOME! Lecture 11 Thommy Perlinger

Graphical Representation of Missing Data Problems

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Regression Analysis II

Political Science 15, Winter 2014 Final Review

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Ball State University

Bayesian approaches to handling missing data: Practical Exercises

Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016

MEASURES OF ASSOCIATION AND REGRESSION

Missing data imputation: focusing on single imputation

Practical Statistical Reasoning in Clinical Trials

Estimands, Missing Data and Sensitivity Analysis: some overview remarks. Roderick Little

Predictive Models for Making Patient Screening Decisions

ASSESSING THE EFFECTS OF MISSING DATA. John D. Hutcheson, Jr. and James E. Prather, Georgia State University

Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd Ed.

SESUG Paper SD

Title. Description. Remarks. Motivating example. intro substantive Introduction to multiple-imputation analysis

Sequential nonparametric regression multiple imputations. Irina Bondarenko and Trivellore Raghunathan

Lecture (chapter 1): Introduction

Before we get started:

Running head: SELECTION OF AUXILIARY VARIABLES 1. Selection of auxiliary variables in missing data problems: Not all auxiliary variables are

What to do with missing data in clinical registry analysis?

Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research

Cross-Lagged Panel Analysis

Manuscript Presentation: Writing up APIM Results

A Strategy for Handling Missing Data in the Longitudinal Study of Young People in England (LSYPE)

Designing and Analyzing RCTs. David L. Streiner, Ph.D.

Clinical trials with incomplete daily diary data

The analysis of tuberculosis prevalence surveys. Babis Sismanidis with acknowledgements to Sian Floyd Harare, 30 November 2010

Supplemental Appendix for Beyond Ricardo: The Link Between Intraindustry. Timothy M. Peterson Oklahoma State University

Problem Set 3 ECN Econometrics Professor Oscar Jorda. Name. ESSAY. Write your answer in the space provided.

MODEL SELECTION STRATEGIES. Tony Panzarella

Biostatistics. Donna Kritz-Silverstein, Ph.D. Professor Department of Family & Preventive Medicine University of California, San Diego

Missing data and multiple imputation in clinical epidemiological research

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Chapter Eight: Multivariate Analysis

On Missing Data and Genotyping Errors in Association Studies

On average, about half the respondents to surveys

MISSING DATA IN FAMILY RESEARCH: EXAMINING DIFFERENT LEVELS OF MISSINGNESS

Missing Data in Homicide Research

Introduction to Econometrics

isc ove ring i Statistics sing SPSS

Profile Analysis. Intro and Assumptions Psy 524 Andrew Ainsworth

Business Statistics Probability

Chapter Eight: Multivariate Analysis

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Chapter 34 Detecting Artifacts in Panel Studies by Latent Class Analysis

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

You must answer question 1.

Abstract. Introduction A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION

Comparison of imputation and modelling methods in the analysis of a physical activity trial with missing outcomes

STATISTICS & PROBABILITY

Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling, Sensitivity Analysis, and Causal Inference

Chapter 1: Explaining Behavior

What is Multilevel Modelling Vs Fixed Effects. Will Cook Social Statistics

Chapter 1 Review Questions

Transcription:

Advanced Handling of Missing Data One-day Workshop Nicole Janz ssrmcta@hermes.cam.ac.uk

2 Goals Discuss types of missingness Know advantages & disadvantages of missing data methods Learn multiple imputation Practical: diagnose, visualize and handle missing data in R

3 Steps in the research process 1. Identify patterns of missingness for each variable 2. Why are data missing? Could this bias your sample? 3. How do other scholars in your field handle missingness? 4. Decide on method to handle missingness for your particular variables 5. Robustness: try different missing data methods, run your analysis, compare the results

Proportions of missingness per variable in a table variable nmiss n propmiss country 0 5568 0.00000000 year 0 5568 0.00000000 UN_FDI_flow 477 5568 0.08566810 US_fdi_electrical 1896 5568 0.34051724 US_fdi_machinery 1922 5568 0.34518678 US_fdi_transport 1968 5568 0.35344828 US_fdi_mining 3908 5568 0.70186782 US_fdi_services 3955 5568 0.71030891 US_fdi_petrol 4258 5568 0.76472701 US_fdi_utilities 4984 5568 0.89511494 4

Proportions of missingness per variable in a graph Proportion of missingness Petrol/GDP Mining/GDP Other FDI/GDP Deposit./GDP Finance/GDP US FDI/GDP Wh.Trade/GDP Food/GDP Chemical/GDP Metal/GDP Transp./GDP Machinery/GDP Mosley Law Mosley Prac. Mosley Labor Electr./GDP PTS Democracy CIRI Women CIRI Phys. CIRI Emp. CIRI Worker Trade GDP p. capita Population Conflict Fariss Life exp. Inf.mort. 0.0 0.2 0.4 0.6 0.8 1.0 5

Time series: number of years with A SIMPLIFIED BIVARIATE TEST GUIDE existing data 6

Heatmap per country-year and variable yellow=missing 7

Why are my data missing? Due to social/natural processes school graduation, dropout, death a country does not exist anymore e.g. GDR statistics office reclassified variables intentional non-disclosure Skip patterns in surveys E.g. only married respondents are asked certain follow-up questions Respondent refusal income 8

Why are my data missing? variable nmiss n propmiss US_fdi_mining 3908 5568 0.70186782 US_fdi_petrol 4258 5568 0.76472701 US_fdi_utilities 4984 5568 0.89511494 Mining FDI is available until 1999 Petrol FDI is available from 2000 Utilities FDI is a new category was introduced after 2000 9

Three types of missingness 1. MCAR - Missing Completely at Random 2. MAR - Missing at Random 3. MNAR Missing not at Random 10

MCAR: Missing Completely at Random Missing value (y) neither depends on x nor y. Probability of missingness is the same for all units. Survey respondent decides whether to answer the earnings question by rolling a die and refusing to answer if a 6 shows up Some survey questions asked of a simple random sample of original sample What to do: If data are missing completely at random, then throwing out cases with missing data does not bias your inferences -> do listwise deletion, then run analysis 11

MAR: Missing at Random Probability that a variable is missing depends only observed data, but not the missing data itself, or unobserved data. If sex, race, education, and age are recorded for all the people in the survey, then earnings is MAR if the probability of nonresponse depends only on these variables If men are more likely to tell you their weight than women, and we record gender, then weight is MAR. What to do? Some say listwise deletion is fine, but only if regression controls for all variables that affect probability of missingness. More common: use multiple imputation (MI) because listwise 12 deletion introduces bias.

MNAR: Missing not at Random (non-ignorable missingness) Missingness depends at least in part on unobserved factors. Special case: Missingness depends on variable that is missing People with college degrees are less likely to reveal their earnings, we don t have education data for all respondents If a particular treatment causes discomfort, a patient is more likely to drop out of the study. We don t have a measure for discomfort for all patients. Respondents with high income less likely to report income. 13

MNAR: Missing not at Random (non-ignorable missingness) What to do? Most problematic case. Potential lurking variables are often unobserved. MI based on auxiliary, external data e.g. estimate race based on Census data associated with the address of the respondent. Try to include as many predictors as possible in a model to get MNAR closer to MAR. 14

How to distinguish between MNAR and MAR? Think about your variables and use your substantive scientific knowledge of the data and your field. Can you collect more data that explain missingness, or is it very likely that they will remain unobserved? What does the literature say about predictors of that particular missing variable? 15

How to distinguish between MAR and MCAR? Again, think about the data. Some indication (but no definitive answer) can be gained from two tests: 1) Little s test for MCAR (Little 1988) Maximum likelihood chi-square test for missing completely at random. H 0 is that the data is MCAR. If the p value for Little's MCAR test is not significant, then the data may be assumed to be MCAR and missingness is ignorable (do listwise deletion). mcartest in STATA; EM option in SPSS; in R see lab 16

How to distinguish between MAR and MCAR? 2. Dummy variable approach for MCAR create dummy variables for whether a variable is missing: 1 = missing 0 = observed Run t-tests (continuous) and chi-square (categorical) tests between this dummy and other variables to see if the missingness is related to the values of other variables Tests which return a finding of significance indicate MAR rather than MCAR (-> use multiple imputation) (SPSS: MVA option, R see lab) 17

Ad-hoc methods Listwise deletion (complete case analysis) Automatically done in regression in most software; or by hand; assumes MCAR If MAR or MNAR: introduces biased sample reduces sample size Pairwise deletion (available case analysis) different aspects of a problem are studied with different subsets of the data Results between subsets not consistent / comparable if the non-respondents differ systematically from the respondents, this will bias the available-case summaries Potential omitted variable bias if excludes a complete variable because its high missingness 18

Ad-hoc methods Last value carried forward replace missing outcome values with pre-treatment measure would lead to underestimates of the true treatment effect ignores changes over time Mean imputation easiest way to impute is to replace each NA with the mean distorts distribution for this variable, e.g. underestimates sd ignores changes over time Filling in values manually based on case-based knowledge from other sources time-consuming prone to measurement error 19

Single imputation Impute missing values from predicted values results from regression the error in these cases becomes zero. However, random errors are a feature of the real world and one variable treated with single imputation will be fundamentally different from the other variables. leads to overconfidence in our models and biases the coefficients upwards 20

Multiple Imputation Techniques Multiple imputation (MI) is also based on the idea of using predicted values, but it builds in mechanisms to incorporate uncertainty about the predicted values. MI imputes values for each missing data point, but it does so n times (usually 5). It then creates n (5) completed data sets. The observed values remain the same, but the imputed value varies across these 5 data sets, reflecting uncertainty. MI is much closer to reality when calculating new values. MI is a good alternative to listwise deletion because the main assumption is that data are MAR, meaning that some other variables in the data set may (and should) explain why an observation is missing 21

Multiple Imputation Techniques Details on expectation maximization (EM) algorithm, see King et al. (2001). 22 Figure: https://cran.r-project.org/web/packages/amelia/vignettes/amelia.pdf

Combination of results Run each analysis (e.g. regression) on all 5 imputed data sets. Collect all 5 coefficients and standard errors (and other measures of interest), and combine them into one estimate according to Rubin s Rule (1987): Estimates: average of the individual estimates Standard error: combine between-imputation variance and within-imputation variance See King et al. (2001). 23

Multiple Imputation Software {Amelia} in R (by Gary King and collaborators) {mi} in R (by Andrew Gelman and collaborators) {mice} in R (by Stef van Buuren and collaborators) SPSS (Analyze > Multiple Imputation) STATA mi estimate 24

Social Sciences Research Methods Centre Lab

Summarizing and Visualizing Missingness in R % of missingness per variable and subsets of variables Graphical display Using Amelia for diagnosis of missingness 26

MCAR patterns? 1) BaylorEdPsych (Little s Test to diagnose MCAR) https://cran.r-project.org/web/packages/bayloredpsych/ BaylorEdPsych.pdf 2) Creating a dummy variable for missingness 0/1, then running correlations among variables 27

Ad-hoc measures in R 1) Listwise deletion, pairwise deletion 2) Carry last value forward 3) Mean imputation 4) Manually recoding particular variables 5) Replace NAs with predicted values from regression 28

Example 1 Adapted from Schlomer et al. (2010) 60 clients under age 21 years at a large university counseling center were referred for counseling by the dean of students due to underage drinking violations. The counseling center randomly assigned the students to one of two treatment programs (independent variable: Group), one of which uses the harm reduction approach, and the other of which is based on a 12-step model. Participants self-efficacy for sobriety was measured before (covariate) and after (dependent variable) the counseling. 7 variations of the DV: DV with no missing; DV with 10%, 20%, and 50% MCAR, and DV with 10%, 20%, and 50% MAR 29

Example 1 Adapted from Schlomer et al. (2010) Goal: Compare biases in estimates of mean, standard deviation, regression coefficient, and standard error when the DV has 20% missing at random with when the DV has 0% missing using different missing data handling techniques. Step 1: Calculate M, SD, B, and SE with DV0Miss Step 2: Create the target data set with DV20MAR 30

Example 1 Adapted from Schlomer et al. (2010) Describe missing patterns Summarize and visualize missingness Little's (1998) MCAR test Dummy code missingness Ad-hoc methods Delete listwise or pairwise Carry last value forward Substitute with mean Recode manually Predict from regression Multiple imputation Amelia II 31

Multiple Imputation with Amelia II How to run an imputation in R incl diagnostics - run Amelia on a data set - saving an imputed data set - combining several data into an amelia object - how to deal with ordinal, nominal, natural log data - time series cross-section - lags and leads - overimputation - time series plots 32

Reproducibility Set seed (!!!) for yourself and others When you re-run Amelia after diagnostics and want to make changes, it s best to re-use exactly what you had with minimal changes Work in R, not the GUI version Keep your Rscript well commented; make a note of sessioninfo(), especially the Amelia and R version used 33

Reproducibility On 12/4/2012 5:40 AM, Nicole Janz wrote: Dear, I'm a PhD student at Cambridge University, and I work on foreign investment and labor standards. I read your with great interest. I was wondering if you could make the imputation Rcode available to me? I am asking this because I am using Amelia as well, and I would like to try and replicate your imputation with the same specifications. Hi Nicole - Thanks for the note. Unfortunately, we did this in AmeliaView, so we don't have R code available (I assume you've found the replication data and Stata code on my website). 34

More practical tips Set the seed! Include any variable in the analysis model in your imputation model. Maybe use auxiliary variables if they make sens. Include variables in the form they enter the model (lags, logs, leads, transformations). Don t impute things that don t make sense! Don t impute decades of missing data. Check diagnostics 35

Literature and tutorials Amelia mailing list https://lists.gking.harvard.edu/mailman/listinfo/amelia Tutorial for three MI software packages by Thomas Leeper http://thomasleeper.com/rcourse/tutorials/mi.html MISSING VALUES ANALYSIS & DATA IMPUTATION http://www.statisticalassociates.com/missingvaluesanalysis_p.pdf James Honaker and Gary King, What to do About Missing Values in Time Series Cross-Section Data American Journal of Political Science Vol. 54, No. 2 (April, 2010): Pp. 561-581. Gary King, James Honaker, Anne Joseph, and Kenneth Scheve. Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation, American Political Science Review, Vol. 95, No. 1 (March, 2001): Pp. 49-69. 36

Literature and tutorials Andrew Gelman and Jeniffer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, CHAPTER 25: Missing-data imputation. Cambridge University Press, Cambridge (2006). Much Ado About Nothing: A Comparison of Missing Data Methods and Software to Fit Incomplete Data Regression Models www.math.smith.edu/~nhorton/muchado.pdf Allison, Paul D. 2001. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences. Thousand Oaks: Sage. Enders, Craig. 2010. Applied Missing Data Analysis. Guilford Press: New York. Little, Roderick J., Donald Rubin. 2002. Statistical Analysis with Missing Data. John Wiley & Sons, Inc: Hoboken. Schafer, Joseph L., John W. Graham. 2002. MissingData: Our View of the State of the Art. Psychological Methods. 37

Thank you! Nicole Janz www.nicolejanz.de