ethnicity recording in primary care

Similar documents
How should the propensity score be estimated when some confounders are partially observed?

Analysis of TB prevalence surveys

Supplementary Online Content

Bayesian methods for combining multiple Individual and Aggregate data Sources in observational studies

Propensity scores and instrumental variables to control for confounding. ISPE mid-year meeting München, 2013 Rolf H.H. Groenwold, MD, PhD

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

Strategies for handling missing data in randomised trials

Module 14: Missing Data Concepts

Bayesian graphical models for combining multiple data sources, with applications in environmental epidemiology

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Missing Data and Imputation

Treatment effect estimates adjusted for small-study effects via a limit meta-analysis

Missing data in clinical trials: making the best of what we haven t got.

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Least likely observations in regression models for categorical outcomes

Multiple imputation for handling missing outcome data when estimating the relative risk

Master thesis Department of Statistics

Section on Survey Research Methods JSM 2009

SESUG Paper SD

Alternative indicators for the risk of non-response bias

Exploring the Impact of Missing Data in Multiple Regression

Appendix 1. Sensitivity analysis for ACQ: missing value analysis by multiple imputation

Maintenance of weight loss and behaviour. dietary intervention: 1 year follow up

Implementing double-robust estimators of causal effects

What to do with missing data in clinical registry analysis?

UN Handbook Ch. 7 'Managing sources of non-sampling error': recommendations on response rates

ESM1 for Glucose, blood pressure and cholesterol levels and their relationships to clinical outcomes in type 2 diabetes: a retrospective cohort study

Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd Ed.

JSM Survey Research Methods Section

Downloaded from:

Kelvin Chan Feb 10, 2015

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Best Practice in Handling Cases of Missing or Incomplete Values in Data Analysis: A Guide against Eliminating Other Important Data

OBSERVATIONAL MEDICAL OUTCOMES PARTNERSHIP

Performance of the Trim and Fill Method in Adjusting for the Publication Bias in Meta-Analysis of Continuous Data

Individual Participant Data (IPD) Meta-analysis of prediction modelling studies

AMELIA II: A Package for Missing Data

Editor Executive Editor Associate Editors Copyright Statement:

CASE STUDY: Measles Mumps & Rubella vaccination. Health Equity Audit

Sequential nonparametric regression multiple imputations. Irina Bondarenko and Trivellore Raghunathan

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

Inclusive Strategy with Confirmatory Factor Analysis, Multiple Imputation, and. All Incomplete Variables. Jin Eun Yoo, Brian French, Susan Maller

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

Subject index. bootstrap...94 National Maternal and Infant Health Study (NMIHS) example

Confounding Bias: Stratification

Comparison of methods for imputing limited-range variables: a simulation study

Evaluators Perspectives on Research on Evaluation

Quantifying the clinical measure of interest in the presence of missing data:

Missing by Design: Planned Missing-Data Designs in Social Science

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Chapter 3 Missing data in a multi-item questionnaire are best handled by multiple imputation at the item score level

Should a Normal Imputation Model Be Modified to Impute Skewed Variables?

research methods & reporting

Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling, Sensitivity Analysis, and Causal Inference

Missing data and multiple imputation in clinical epidemiological research

Jinhui Ma 1,2,3, Parminder Raina 1,2, Joseph Beyene 1 and Lehana Thabane 1,3,4,5*

Change In Population TB Incidence Trends After The Roll-Out Of ART in Karonga District, North Malawi:

2017 American Medical Association. All rights reserved.

Mostly Harmless Simulations? On the Internal Validity of Empirical Monte Carlo Studies

Empirical evidence on sources of bias in randomised controlled trials: methods of and results from the BRANDO study

In this module I provide a few illustrations of options within lavaan for handling various situations.

Moving beyond regression toward causality:

Genetics, epigenetics, and Mendelian randomization: cuttingedge tools for causal inference in epidemiology

George B. Ploubidis. The role of sensitivity analysis in the estimation of causal pathways from observational data. Improving health worldwide

Modeling unobserved heterogeneity in Stata

Advanced Handling of Missing Data

Identifying relevant sensitivity analyses for clinical trials

Further data analysis topics

Multivariate Regression with Small Samples: A Comparison of Estimation Methods W. Holmes Finch Maria E. Hernández Finch Ball State University

Accuracy of Range Restriction Correction with Multiple Imputation in Small and Moderate Samples: A Simulation Study

Analyzing diastolic and systolic blood pressure individually or jointly?

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

Improving ecological inference using individual-level data

Clinical Epidemiology for the uninitiated

cloglog link function to transform the (population) hazard probability into a continuous

Title. Description. Remarks. Motivating example. intro substantive Introduction to multiple-imputation analysis

Using negative control outcomes to identify biased study design: A self-controlled case series example. James Weaver 1,2.

SCHIATTINO ET AL. Biol Res 38, 2005, Disciplinary Program of Immunology, ICBM, Faculty of Medicine, University of Chile. 3

Bias reduction with an adjustment for participants intent to dropout of a randomized controlled clinical trial

You must answer question 1.

Propensity Score Methods for Causal Inference with the PSMATCH Procedure

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Multiple Imputation For Missing Data: What Is It And How Can I Use It?

Confounding by indication developments in matching, and instrumental variable methods. Richard Grieve London School of Hygiene and Tropical Medicine

Substance Use Among Potential Kidney Transplant Candidates and its Impact on Access to Kidney Transplantation: A Canadian Cohort Study

Dr Elizabeth Williamson. Medical Statistics Department Faculty of Epidemiology & Population Health. CSM Big Data Symposium, 2017

Statistical considerations in indirect comparisons and network meta-analysis

educational assessment and educational measurement

Methods for Addressing Selection Bias in Observational Studies

Modern Strategies to Handle Missing Data: A Showcase of Research on Foster Children

Propensity score methods to adjust for confounding in assessing treatment effects: bias and precision

Sarwar Islam Mozumder 1, Mark J Rutherford 1 & Paul C Lambert 1, Stata London Users Group Meeting

RESEARCH. Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2

Improved Transparency in Key Operational Decisions in Real World Evidence

SAIL. Rowena Bailey. Farr Institute, Swansea University

(b) empirical power. IV: blinded IV: unblinded Regr: blinded Regr: unblinded α. empirical power

Discussion Meeting for MCP-Mod Qualification Opinion Request. Novartis 10 July 2013 EMA, London, UK

Longitudinal data monitoring for Child Health Indicators

Supplementary Appendix

Transcription:

ethnicity recording in primary care Multiple imputation of missing data in ethnicity recording using The Health Improvement Network database Tra Pham 1 PhD Supervisors: Dr Irene Petersen 1, Prof James Carpenter 2, Dr Tim Morris 3 1 Department of Primary Care & Population Health - UCL 2 Department of Medical Statistics - LSHTM 3 Hub for Trials Methodology Research, MRC Clinical Trials Unit - UCL Research funded by the Farr Institute of Health Informatics, London

outline

outline Background Ethnicity recording in primary care Aims of this project Study I - Impute missing ethnicity in THIN Sample Weighted multiple imputation Comparison of missing data methods Study II - Simulation study Motivation Methods Results Weighted multiple imputation by chained equations Motivation Syntax Summary 2

background

ethnicity recording in the health improvement network THIN is one of UK s largest primary care databases Contains records 12+ million patients in over 550 practices Broadly representative of the UK population Offers many opportunities for research Complete and reliable ethnicity recording is important for health research High level of missing ethnicity in primary care 4

aims of this project 1. Explore the method of weighted multiple imputation to deal with missing ethnicity data 2. Introduce a new Stata command to implement weighted multiple imputation in multivariate missing data settings 5

study i - impute missing ethnicity in thin

sample 7 millions+ individuals in THIN between 2000-2013 Ethnicity information is extracted using Read codes 5 categories of ethnicity: white, mixed, Asian, black, other 60% of individuals had missing ethnicity status 7

how should missing ethnicity data be handled? Studies often handle missing ethnicity data by Excluding ethnicity in the analysis Excluding individuals with missing ethnicity from the analysis Imputing missing ethnicity as white A popular alternative is to use multiple imputation (MI)...! DATASET& 1& RESULTS& 1& DATASET& 2& RESULTS& 2& DATASET& WITH& MISSING& VALUES& COMBINED& RESULTS& Create&new&values&to&replace& missing&values&in&dataset& DATASET& M& RESULTS& M& Analyse&each&new&dataset& separately& 8

how should missing ethnicity data be handled?... However, MI does not give plausible estimates compared with the census Census data can be use to weight MI such that the correct ethnicity distribution is recovered 9

post-stratification (ps) weights Deals with bias from non-response and under-represented groups in analysis of survey data Hypothetical example: Weight ps = 1/(p sample /p census ) Census Proportion Sample Proportion Weight ps White 0.6 0.8 0.75 Asian 0.2 0.1 2 Black 0.15 0.05 3 Other 0.05 0.05 1 White ethnic group: p weighted = 0.8 0.75 = 0.6 = p census 10

weighted mi of ethnicity in thin using ps weights Table 1: weights Ethnic proportions in one imputed dataset using PS Ethnicity (%) Observed Imputed Combined * Census White 67.1 63.2 65.7 59.8 Mixed 2.9 4.3 3.4 5.0 Asian 11.7 16.6 13.5 18.5 Black 11.7 13.2 12.2 13.3 Other 6.6 2.7 5.2 3.4 Total 100 100 100 100 * Combined data = observed data + imputed data 11

adjusted weights for multiple imputation Proportions required in the imputed data to get to the census proportions after imputation Ethnicity Observed Imputed Combined* White p obs N obs p req N mis p census N total.... Total N obs N mis N total * Combined data = observed data + imputed data p req = p census N total p obs N obs N mis Weight MI = 1/(p obs /p req ) 12

weighted mi of ethnicity in thin using adjusted weights Table 2: Ethnic proportions in one imputed dataset using adjusted weights Ethnicity Observed Imputed Combined * Census White 67.1 50.6 61.0 59.8 Mixed 2.9 7.4 4.6 5.0 Asian 11.7 26.6 17.3 18.5 Black 11.7 15.4 13.0 13.3 Other 6.6 0 4.1 3.4 Total 100 100 100 100 * Combined data = observed data + imputed data 13

weighted mi of ethnicity in thin using adjusted weights Figure 1: Ethnic proportions in one imputed dataset using adjusted weights 80 60 Observed Imputed Observed + Imputed Census % 40 20 0 White Mixed Asian Black Other Ethnicity 14

weighted multiple imputation in stata Predictors for imputation: year of birth, sex, deprivation score, indicator if first registration is between 2006-2013 mi impute mlogit command allows for probability weights specification: mi impute mlogit varlist [pweights] [, mi_options ] 15

comparison of missing data methods Figure 2: Ethnic proportions under different missing data methods % 100 90 80 70 60 50 2 1 3 4 White 1 Complete case analysis 2 Impute missing as white 3 Regular MI 4 Weighted MI Census 40 30 20 10 0 Mixed 1 3 4 2 Asian 4 1 3 2 Ethnicity Black 1 3 4 2 1 3 2 4 Other 16

study ii - simulation study

motivation So far the weighted MI gives plausible estimates of ethnicity proportions However, studies often focus on the association between ethnicity and outcome of interest How to deal with missing data when ethnicity is an exposure/covariate? Aim: evaluate the performance of weighted MI using a simulation study 18

design of simulation study COMPLETE CASE IMPUTE AS WHITE REGULAR MI WEIGHTED MI DATA GENERATING MODEL FULL DATA MNAR MODEL FOR ETHNICITY INCOMPLETE DATA COMPARE OUTCOMES MODEL OF INTEREST MODEL OF INTEREST 19

model of interest Motivated by the association between ethnicity and myocardial infarction Statistical model: exponential survival model for the association between ethnicity and time to first diagnosis of myocardial infarction, adjusted for age and sex 20

data generating and mnar mechanisms 1. Data generating mechanism: Data are simulated that closely model a sample derived from THIN Ethnicity is generated as a 3-category variable: white, Asian, black, using proportions from the census 2. MNAR mechanism: Ethnicity is made MNAR depending on follow-up time, sex and ethnicity 21

results - bias Define β = 1 K ˆβ k Estimated bias = β β k Age Female 0 -.001 -.002.06.04 -.003.02 Bias in point estimate -.004 0 -.2 Asian 0 0 -.2 Black -.4 Full data Complete case Impute as white Regular MI Weighted MI -.4 Full data Complete case Impute as white Regular MI Weighted MI Bias Bias ± 2MCSE 22

results - empirical standard errors Empirical standard error = 1 K 1 ( ˆβ k β) 2 k.0012 Age.026 Female.0011.024.001.022.0009.02 Empirical SE.045.04.035 Asian.05.045.04 Black.03.035.025 Full data Complete case Impute as white Regular MI Weighted MI.03 Full data Complete case Impute as white Regular MI Weighted MI Empirical SE Emperial SE ± 2MCSE 23

results - coverage Coverage = 1 K 1( ˆβ k β < z α/2 s k ) k Coverage of nominal 95% CI Age 100 50 0 Asian 100 50 0 Full data Complete case Impute as white Regular MI Weighted MI Female 100 80 60 40 20 Black 100 50 0 Full data Complete case Impute as white Regular MI Weighted MI Coverage Coverage ± 2MCSE 24

summary of results 1. Complete case analysis leads to biased results and loss in efficiency 2. Impute missing data as white biases the association between non-white groups and outcome 3. Regular MI does not correct for bias when ethnicity is MNAR 4. Weighted MI gives closest estimates to true values and substantial coverages 25

weighted mi by chained equations

motivation Weighted MI only implemented in univariate missing data of ethnicity so far Multiple imputation of chained equations (MICE) for multivariate missing data problems To get one set of imputed values, iterate over t = 0, 1,..., T and impute: Y t+1 1 using Y t 2, Y 1 3,..., Y t k Y t+1 2 using Y t+1 1, Y t 3,..., Y t k. Y t+1 k using Y t+1 1, Y t+1 2,..., Y t+1 k 1 mi impute chained only allows specification of global weights, which are applied to all equations 27

weighted multiple imputation by chained equations in stata Weighted imputation by chained equations - wice Allows weights specification for one conditional models while no weights for others: Y t+1 1 using Y t 2, Y 1 3,..., Y t k pw 1 Y t+1 2 using Y t+1 1, Y t 3,..., Y t k. Y t+1 k using Y t+1 1, Y t+1 2,..., Y t+1 k 1 28

weighted multiple imputation by chained equations in stata Stata syntax: wice (method 1 impvar 1 [pweight 1 ])... (method K impvar K [pweight K ]) : indvars [, options] Current features: 1. Univariate imputation methods: regress, logit, ologit, mlogit 2. Global options: add, cycles, seed, noisily, dryrun 29

conclusions Ad-hoc methods can lead to biased results and misleading conclusions Regular MI fails to produce valid inferences when data are MNAR Weighted MI is a simple solution and performs better than regular MI wice offers flexibility for weighted MI by chained equations 30