Comparing Multiple Imputation to Single Imputation in the Presence of Large Design Effects: A Case Study and Some Theory Nathaniel Schenker Deputy Director, National Center for Health Statistics* (and a former colleague of Rod Little s at UCLA) Symposium in Celebration of Rod Little s 65 th Birthday University of Michigan, Ann Arbor, MI October 31, 2015 * The findings and opinions in this presentation are those of the speaker and do not necessarily represent the views of the National Center for Health Statistics, the Centers for Disease Control and Prevention, or the U.S. government. 1
Outline Empirical results based on the 2008 National Ambulatory Medical Care Survey (Lewis et al. 2014) Theoretical results based on a one-way, normal, random-effects model (He et al. forthcoming) Discussion of a few limitations and areas for future research An aside: Documenting one of Rod s less publicized talents 2
National Ambulatory Medical Care Survey (NAMCS) Administered by NCHS since 1973 Objective: Collect and disseminate nationally representative data on office-based physician care Multistage design 1. Single or grouped counties 2. Physician practices (stratified by specialty) 3. Patient visits during selected week In 2008, 30,000 visits in sample 1997 OMB standards for classifying race/ethnicity data 3
Item Nonresponse Rate (%) Missing Data on Race/Ethnicity in NAMCS (from Lewis et al. 2014) 35 30 25 20 15 10 5 0 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Year 4
Exploring Multiple Imputation for Missing Race Data in 2008 NAMCS NCHS research team developed imputation model Predictors: age, sex, urban/rural, physician specialty, reason for visit, log(time spent with physician), sample weights, zip-code level proportions non-hispanic white and non-hispanic black from 2000 census Variable choices based on previously used cell-based procedure, advice from subject-matter experts, and desire to reflect sample design Used sequential regression multivariate imputation (Raghunathan et al. 2001) as implemented in IVEWare Created five sets of imputations (D = 5) 5
Estimands Considered in Lewis et al. (2014) Race distributions (non-hispanic white, non-hispanic black, other) Overall By four regions By five age groups By diabetes status (yes, no) For each estimand, estimated ratio of standard errors of estimates: multiple imputation/single imputation 6
Estimated Standard Error Ratio (MI/SI) 1.20 1.18 1.16 1.14 1.12 1.10 1.08 1.06 1.04 1.02 SE Ratios by Missingness Levels (from Lewis et al. 2014) 1.00 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% 55.0% 60.0% Percent of Observations Missing 7
SE Ratios by Missingness Levels Levels of ratios roughly consistent with those for 2000 NAMCS based on bootstrap re-imputation (Li et al. 2004) Ratios mostly rather low Why are ratios seemingly not related to missingness levels? 8
Estimated Standard Error Ratio (MI/SI) 1.20 1.18 1.16 1.14 1.12 1.10 1.08 1.06 1.04 1.02 1.00 SE Ratios by Estimated Design Effects (from Lewis et al. 2014) 0 5 10 15 20 25 30 35 40 45 50 Estimated Design Effect 9
SE Ratios by Estimated Design Effects Strong inverse relationship Ratios < 1.04 when DEFFs > 10 Consistent with simulation result in Reiter et al. (2006), who reasoned as follows: The complex design makes the within-imputation variance a dominant factor relative to the between-imputation variance. That is, the fraction of missing information due to missing data is relatively small when compared to the effect of clustering. 10
Increase in Estimated Variance Attributable to Missing Data versus Complex Sample Design Increase attributable to both factors: = [U (SRS) DEFF] + 1 + 1 D B U (SRS) Proportion attributable to missing data: 1 + 1 D B/ 11
Percents Attributable to Missing Data, by Estimated DEFFs (from Lewis et al. 2014) 12
1 1 1.05 1.05 SE Ratio SE Ratio 1.1 1.1 1.15 1.15 1 1 1.05 1.05 SE Ratio SE Ratio 1.1 1.1 1.15 1.15 Value of SE Ratio if DEFF Were Equal to 1? Lowess smoother Lowess smoother 0 20 40 60 80 100 DEFF bandwidth =.8 0 20 40 60 80 100 DEFF bandwidth =.7 Lowess smoother Lowess smoother 0 20 40 60 80 100 DEFF bandwidth =.6 0 20 40 60 80 100 DEFF bandwidth =.5 13
Value of SE Ratio if DEFF Were Equal to 1? Lowess smoother analysis suggests SE Ratio of 1.08 to 1.1 Implies fraction of missing information of 14% to 17% 14
Some Theory for Single-Stage Cluster Sampling (He et al. forthcoming) Simple random sample of m out of M clusters, each containing n elements Model-based representation: For i = 1,, m, j = 1,, n, 15
Some Theory for Single-Stage Cluster Sampling Estimand: μ If data were complete, would have DEFF com = 1 + n 1 ρ, where ρ = τ2 τ 2 +σ 2 With missing data, assuming MCAR, and r observations per cluster, DEFF obs = 1 + (r 1)ρ 16
Some Theory for Single-Stage Cluster Sampling Approximations for multiple imputation (D ) with missingness rate P mis : FMI P mis DEFF obs and FMI P mis (1 P mis )DEFF com +P mis Derivations assume that DEFF obs r and DEFF com n; if assumption violated, formulas can be used as simple upper bounds If ρ = 0, then approximations imply FMI P mis 17
SE Ratios Predicted Using FMI Approximation Based on DEFF com (from He et al. forthcoming) 18
How Well Do Approximations Predict Results for 2008 NAMCS? (from He et al. forthcoming) Approximation based on DEFF com Approximation based on DEFF obs 19
Discussion Case study of 2008 NAMCS Considered coarse domains; often finer domains smaller DEFFs In 2009, awareness among field representatives ; nonresponse on race Beginning in 2012, no clustering by counties; PSUs are physician offices Would be useful to study impacts Theoretical results Can be thought of as extension of Rubin and Schenker (1986) Would be useful to go beyond MCAR Other factors influence DEFFs; e.g., weights, multiple stages of clustering 20
References He, Y., Shimizu, I., Schappert, S., Xu, J., Beresovsky, V., Khan, D., Valverde, R., and Schenker, N. (forthcoming), A Note on the Effect of Data Clustering on the Multiple Imputation Variance Estimator: An Addendum to Taylor et al. (2014), to appear in the Journal of Official Statistics. Lewis, T., Goldberg, E., Schenker, N., Beresovsky, V., Schappert, S., Decker, S., Sonnenfeld, N., and Shimizu, I. (2014), The Relative Impacts of Design Effects and Multiple Imputation on Variance Estimates: A Case Study with the 2008 National Ambulatory Medical Care Survey, Journal of Official Statistics, 30, 147-161. Li, Y., Lynch, C., Shimizu, I, and Kaufman, S. (2004), Imputation Variance Estimation by Bootstrap Method for the National Ambulatory Medical Care Survey, American Statistical Association Proceedings of the Survey Research Methods Section. Raghunathan, T., Lepkowski, J., Van Hoewyk, J., and Solenberger, P. (2001), A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models, Survey Methodology, 27, 85-95. Reiter, J., Raghunathan, T. and Kinney, S. (2006), The Importance of Modeling the Sampling Design in Multiple Imputation for Missing Survey Data, Survey Methodology, 32, 143-150. Rubin, D.B., and Schenker, N. (1986), Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse, Journal of the American Statistical Association, 81, 366-374. 21
22
23
24
25
26
27
28
HAPPY BIRTHDAY, ROD! 29