Comparing Multiple Imputation to Single Imputation in the Presence of Large Design Effects: A Case Study and Some Theory

Similar documents
Key words: Health survey; missing data; item nonresponse; fraction of missing information.

Sequential nonparametric regression multiple imputations. Irina Bondarenko and Trivellore Raghunathan

Missing Data and Imputation

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

Section on Survey Research Methods JSM 2009

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Clinical trials with incomplete daily diary data

Small-area estimation of mental illness prevalence for schools

County-Level Small Area Estimation using the National Health Interview Survey (NHIS) and the Behavioral Risk Factor Surveillance System (BRFSS)

Weight Adjustment Methods using Multilevel Propensity Models and Random Forests

Nonresponse Rates and Nonresponse Bias In Household Surveys

A Comparison of Variance Estimates for Schools and Students Using Taylor Series and Replicate Weighting

Title. Description. Remarks. Motivating example. intro substantive Introduction to multiple-imputation analysis

Supplementary Online Content

Alternative indicators for the risk of non-response bias

Epidemiology of Asthma. In the Western Michigan Counties of. Kent, Montcalm, Muskegon, Newaygo, and Ottawa

AnExaminationoftheQualityand UtilityofInterviewerEstimatesof HouseholdCharacteristicsinthe NationalSurveyofFamilyGrowth. BradyWest

Discussion. Ralf T. Münnich Variance Estimation in the Presence of Nonresponse

JSM Survey Research Methods Section

Module 14: Missing Data Concepts

Epidemiology of Asthma. In Wayne County, Michigan

Multiple Imputation For Missing Data: What Is It And How Can I Use It?

Analysis of TB prevalence surveys

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

A Review of Hot Deck Imputation for Survey Non-response

How should the propensity score be estimated when some confounders are partially observed?

Introduction to Survey Sample Weighting. Linda Owens

Accuracy of Range Restriction Correction with Multiple Imputation in Small and Moderate Samples: A Simulation Study

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

Analysis Strategies for Clinical Trials with Treatment Non-Adherence Bohdana Ratitch, PhD

Complier Average Causal Effect (CACE)

Nonresponse Adjustment Methodology for NHIS-Medicare Linked Data

Small-area estimation of prevalence of serious emotional disturbance (SED) in schools. Alan Zaslavsky Harvard Medical School

Subject index. bootstrap...94 National Maternal and Infant Health Study (NMIHS) example

UMbRELLA interim report Preparatory work

A preliminary study of active compared with passive imputation of missing body mass index values among non-hispanic white youths 1 4

SESUG Paper SD

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Should a Normal Imputation Model Be Modified to Impute Skewed Variables?

Review of Pre-crash Behaviour in Fatal Road Collisions Report 1: Alcohol

Exploring the Impact of Missing Data in Multiple Regression

Using Test Databases to Evaluate Record Linkage Models and Train Linkage Practitioners

Trends in Smoking Prevalence by Race based on the Tobacco Use Supplement to the Current Population Survey

Standard Errors of Correlations Adjusted for Incidental Selection

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Enrollment under the Medicaid Expansion and Health Insurance Exchanges. A Focus on Those with Behavioral Health Conditions in Michigan

Data Analysis in Practice-Based Research. Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine

Estimating peer density effects on oral health for community-based older adults

An Application of Propensity Modeling: Comparing Unweighted and Weighted Logistic Regression Models for Nonresponse Adjustments

Chapter 3. Producing Data

Geographical Accuracy of Cell Phone Samples and the Effect on Telephone Survey Bias, Variance, and Cost

ANALYSIS OF SURVEYS WITH EPI INFO AND STATA

educational assessment and educational measurement

Use of Paradata in a Responsive Design Framework to Manage a Field Data Collection

Appendix 1. Sensitivity analysis for ACQ: missing value analysis by multiple imputation

Chapter 1: Exploring Data

Evaluators Perspectives on Research on Evaluation

SECONDARY DATA ANALYSIS: Its Uses and Limitations. Aria Kekalih

2002 NAMCS MICRO-DATA FILE DOCUMENTATION PAGE 1 ABSTRACT

An Empirical Study of Nonresponse Adjustment Methods for the Survey of Doctorate Recipients Wilson Blvd., Suite 965, Arlington, VA 22230

An Empirical Study to Evaluate the Performance of Synthetic Estimates of Substance Use in the National Survey of Drug Use and Health

A Strategy for Handling Missing Data in the Longitudinal Study of Young People in England (LSYPE)

Quantifying the clinical measure of interest in the presence of missing data:

THE EFFECTS OF SELF AND PROXY RESPONSE STATUS ON THE REPORTING OF RACE AND ETHNICITY l

Selected Oral Health Indicators in the United States,

The Impact of Cellphone Sample Representation on Variance Estimates in a Dual-Frame Telephone Survey

Reduction of Measurement Error due to Survey Length: Evaluation of the Split Questionnaire Design Approach

Kelvin Chan Feb 10, 2015

Jinhui Ma 1,2,3, Parminder Raina 1,2, Joseph Beyene 1 and Lehana Thabane 1,3,4,5*

Trends in Emergency Department Visits for Ischemic Stroke and Transient Ischemic Attack: United States,

In this module I provide a few illustrations of options within lavaan for handling various situations.

Impact of Methods of Scoring Omitted Responses on Achievement Gaps

Practice of Epidemiology. Strategies for Multiple Imputation in Longitudinal Studies

Incorporating the sampling design in weighting adjustments for panel attrition

USING THE CENSUS 2000/2001 SUPPLEMENTARY SURVEY AS A SAMPLING FRAME FOR THE NATIONAL EPIDEMIOLOGICAL SURVEY ON ALCOHOL AND RELATED CONDITIONS

Chapter 5: Producing Data

Bayesian Statistics Estimation of a Single Mean and Variance MCMC Diagnostics and Missing Data

Help! Statistics! Missing data. An introduction

Methods for treating bias in ISTAT mixed mode social surveys

Model development including interactions with multiple imputed data

Reducing Decision Errors in the Paired Comparison of the Diagnostic Accuracy of Continuous Screening Tests

Vocabulary. Bias. Blinding. Block. Cluster sample

Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research

Multiple imputation for handling missing outcome data when estimating the relative risk

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

AMELIA II: A Package for Missing Data

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

Missing data in clinical trials: making the best of what we haven t got.

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

Modern Strategies to Handle Missing Data: A Showcase of Research on Foster Children

Within-Household Selection in Mail Surveys: Explicit Questions Are Better Than Cover Letter Instructions

National Ambulatory Medical Care Survey: 1997 Summary

Propensity Score Methods with Multilevel Data. March 19, 2014

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

Practical Statistical Reasoning in Clinical Trials

Incorporating the sampling design in weighting adjustments for panel attrition

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

THE GOOD, THE BAD, & THE UGLY: WHAT WE KNOW TODAY ABOUT LCA WITH DISTAL OUTCOMES. Bethany C. Bray, Ph.D.

Transcription:

Comparing Multiple Imputation to Single Imputation in the Presence of Large Design Effects: A Case Study and Some Theory Nathaniel Schenker Deputy Director, National Center for Health Statistics* (and a former colleague of Rod Little s at UCLA) Symposium in Celebration of Rod Little s 65 th Birthday University of Michigan, Ann Arbor, MI October 31, 2015 * The findings and opinions in this presentation are those of the speaker and do not necessarily represent the views of the National Center for Health Statistics, the Centers for Disease Control and Prevention, or the U.S. government. 1

Outline Empirical results based on the 2008 National Ambulatory Medical Care Survey (Lewis et al. 2014) Theoretical results based on a one-way, normal, random-effects model (He et al. forthcoming) Discussion of a few limitations and areas for future research An aside: Documenting one of Rod s less publicized talents 2

National Ambulatory Medical Care Survey (NAMCS) Administered by NCHS since 1973 Objective: Collect and disseminate nationally representative data on office-based physician care Multistage design 1. Single or grouped counties 2. Physician practices (stratified by specialty) 3. Patient visits during selected week In 2008, 30,000 visits in sample 1997 OMB standards for classifying race/ethnicity data 3

Item Nonresponse Rate (%) Missing Data on Race/Ethnicity in NAMCS (from Lewis et al. 2014) 35 30 25 20 15 10 5 0 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Year 4

Exploring Multiple Imputation for Missing Race Data in 2008 NAMCS NCHS research team developed imputation model Predictors: age, sex, urban/rural, physician specialty, reason for visit, log(time spent with physician), sample weights, zip-code level proportions non-hispanic white and non-hispanic black from 2000 census Variable choices based on previously used cell-based procedure, advice from subject-matter experts, and desire to reflect sample design Used sequential regression multivariate imputation (Raghunathan et al. 2001) as implemented in IVEWare Created five sets of imputations (D = 5) 5

Estimands Considered in Lewis et al. (2014) Race distributions (non-hispanic white, non-hispanic black, other) Overall By four regions By five age groups By diabetes status (yes, no) For each estimand, estimated ratio of standard errors of estimates: multiple imputation/single imputation 6

Estimated Standard Error Ratio (MI/SI) 1.20 1.18 1.16 1.14 1.12 1.10 1.08 1.06 1.04 1.02 SE Ratios by Missingness Levels (from Lewis et al. 2014) 1.00 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% 55.0% 60.0% Percent of Observations Missing 7

SE Ratios by Missingness Levels Levels of ratios roughly consistent with those for 2000 NAMCS based on bootstrap re-imputation (Li et al. 2004) Ratios mostly rather low Why are ratios seemingly not related to missingness levels? 8

Estimated Standard Error Ratio (MI/SI) 1.20 1.18 1.16 1.14 1.12 1.10 1.08 1.06 1.04 1.02 1.00 SE Ratios by Estimated Design Effects (from Lewis et al. 2014) 0 5 10 15 20 25 30 35 40 45 50 Estimated Design Effect 9

SE Ratios by Estimated Design Effects Strong inverse relationship Ratios < 1.04 when DEFFs > 10 Consistent with simulation result in Reiter et al. (2006), who reasoned as follows: The complex design makes the within-imputation variance a dominant factor relative to the between-imputation variance. That is, the fraction of missing information due to missing data is relatively small when compared to the effect of clustering. 10

Increase in Estimated Variance Attributable to Missing Data versus Complex Sample Design Increase attributable to both factors: = [U (SRS) DEFF] + 1 + 1 D B U (SRS) Proportion attributable to missing data: 1 + 1 D B/ 11

Percents Attributable to Missing Data, by Estimated DEFFs (from Lewis et al. 2014) 12

1 1 1.05 1.05 SE Ratio SE Ratio 1.1 1.1 1.15 1.15 1 1 1.05 1.05 SE Ratio SE Ratio 1.1 1.1 1.15 1.15 Value of SE Ratio if DEFF Were Equal to 1? Lowess smoother Lowess smoother 0 20 40 60 80 100 DEFF bandwidth =.8 0 20 40 60 80 100 DEFF bandwidth =.7 Lowess smoother Lowess smoother 0 20 40 60 80 100 DEFF bandwidth =.6 0 20 40 60 80 100 DEFF bandwidth =.5 13

Value of SE Ratio if DEFF Were Equal to 1? Lowess smoother analysis suggests SE Ratio of 1.08 to 1.1 Implies fraction of missing information of 14% to 17% 14

Some Theory for Single-Stage Cluster Sampling (He et al. forthcoming) Simple random sample of m out of M clusters, each containing n elements Model-based representation: For i = 1,, m, j = 1,, n, 15

Some Theory for Single-Stage Cluster Sampling Estimand: μ If data were complete, would have DEFF com = 1 + n 1 ρ, where ρ = τ2 τ 2 +σ 2 With missing data, assuming MCAR, and r observations per cluster, DEFF obs = 1 + (r 1)ρ 16

Some Theory for Single-Stage Cluster Sampling Approximations for multiple imputation (D ) with missingness rate P mis : FMI P mis DEFF obs and FMI P mis (1 P mis )DEFF com +P mis Derivations assume that DEFF obs r and DEFF com n; if assumption violated, formulas can be used as simple upper bounds If ρ = 0, then approximations imply FMI P mis 17

SE Ratios Predicted Using FMI Approximation Based on DEFF com (from He et al. forthcoming) 18

How Well Do Approximations Predict Results for 2008 NAMCS? (from He et al. forthcoming) Approximation based on DEFF com Approximation based on DEFF obs 19

Discussion Case study of 2008 NAMCS Considered coarse domains; often finer domains smaller DEFFs In 2009, awareness among field representatives ; nonresponse on race Beginning in 2012, no clustering by counties; PSUs are physician offices Would be useful to study impacts Theoretical results Can be thought of as extension of Rubin and Schenker (1986) Would be useful to go beyond MCAR Other factors influence DEFFs; e.g., weights, multiple stages of clustering 20

References He, Y., Shimizu, I., Schappert, S., Xu, J., Beresovsky, V., Khan, D., Valverde, R., and Schenker, N. (forthcoming), A Note on the Effect of Data Clustering on the Multiple Imputation Variance Estimator: An Addendum to Taylor et al. (2014), to appear in the Journal of Official Statistics. Lewis, T., Goldberg, E., Schenker, N., Beresovsky, V., Schappert, S., Decker, S., Sonnenfeld, N., and Shimizu, I. (2014), The Relative Impacts of Design Effects and Multiple Imputation on Variance Estimates: A Case Study with the 2008 National Ambulatory Medical Care Survey, Journal of Official Statistics, 30, 147-161. Li, Y., Lynch, C., Shimizu, I, and Kaufman, S. (2004), Imputation Variance Estimation by Bootstrap Method for the National Ambulatory Medical Care Survey, American Statistical Association Proceedings of the Survey Research Methods Section. Raghunathan, T., Lepkowski, J., Van Hoewyk, J., and Solenberger, P. (2001), A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models, Survey Methodology, 27, 85-95. Reiter, J., Raghunathan, T. and Kinney, S. (2006), The Importance of Modeling the Sampling Design in Multiple Imputation for Missing Survey Data, Survey Methodology, 32, 143-150. Rubin, D.B., and Schenker, N. (1986), Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse, Journal of the American Statistical Association, 81, 366-374. 21

22

23

24

25

26

27

28

HAPPY BIRTHDAY, ROD! 29