Selected Topics in Biostatistics Seminar Series Missing Data Sponsored by: Center For Clinical Investigation and Cleveland CTSC Brian Schmotzer, MS Biostatistician, CCI Statistical Sciences Core brian.schmotzer@case.edu June 23, 2010
Outline Missing data What is it, what does it look like? How did we get in this mess? What are the consequences? Goals for analyzing data in the presence of missingness Missing data assumptions What types of missing data are there? Traditional approaches What have people typically done in the past? What are the consequences of these approaches? Newer approach What is the state of the art now? How is it better than traditional approaches? 2
Missing Data Warnings Missing data is the single most pervasive analytical problem in research studies Most medical research papers do not refer to an adequate analysis approach for dealing with missing data Authors unaware/untrained? Journals and/or reviewers not savvy? 3
What is Missing Data? Any value for any variable that you do not have Can arise due to: Subject lost to follow-up Missed/skipped visits Instrument errors or failures Misplaced data extraction sheets We just didn t collect that value, etc 4
Example: coronary artery bypass grafting ID Age # Diseased Vessels Previous Surgery Pump Type Mortality Status 1 65 3 No Off Alive 2 77 3 No Off Alive 3 49 6 Yes On Dead 4 62 3 No Off Alive 5 80 4 No On Alive 6 70 2 No Off Alive 7 83 3 No On Alive 5
Example: Some Missing Data ID Age # Diseased Vessels Previous Surgery Pump Type Mortality Status 1 65 No Off Alive 2 77 3 No Off Alive 3 6 Yes On Dead 4 62 3 No Off Alive 5 80 4 Alive 6 70 2 No Off Alive 7 83 3 No On Alive 6
Example: More Missing Data ID Age # Diseased Vessels Previous Surgery Pump Type Mortality Status 1 65 No Off Alive 2 77 3 No Alive 3 6 Yes On Dead 4 62 No Off Alive 5 80 4 Alive 6 70 2 No Off 7 83 3 On Alive 7
Consequences of Missing Data Default for software packages is to throw out observations with any missing data Complete case or completers analysis Reduced sample size (best case) Loss of power Poorer estimates of parameters of interest No sample size (worst case) 8
More Subtle Consequence Bias: a systematic distortion of an estimate away from its true value Selection bias: bias due to systematic differences between subjects in the sample compared to the target population 9
Populations Mean = 190 100 150 200 250 Male Weight Mean = 160 100 150 200 250 Female Weight 10
Full samples Mean = 188 100 150 200 250 Male Weight Mean = 162 100 150 200 250 Female Weight 11
Missing values 100 150 200 250 Male Weight 100 150 200 250 Female Weight 12
Available samples Mean = 189 100 150 200 250 Male Weight Mean = 152 100 150 200 250 Female Weight 13
Analysis Goals Maintain the relationships among the variables so that we may: Minimize any bias Maximize the utilization of available information Get good estimates of uncertainty 14
NOT the Goals Try to impute values that are close to plausible replacements for representative of that might mirror the real, unknown, missing data values We are not here to recreate the truth 15
Missing Data Assumptions Missing Completely At Random (MCAR) Missing At Random (MAR) Not Missing At Random (NMAR) or Non-Ignorable Missingness (NIM) 16
MCAR Y is a variable with some values missing Assume MCAR if: The probability that Y is missing is unrelated to the value of Y The probability that Y is missing is unrelated to the set of other observed X variables P(Y is missing X, Y) = P(Y is missing) 17
MCAR Example In a laboratory experiment, a test tube is dropped and the cholesterol level that would have been measured from the blood sample is lost Probability that this data would be lost does not depend on the cholesterol level of the blood in the test tube, nor on the age, gender, race, etc. of the subject whose blood it is 18
MCAR Consequences MCAR is the strongest assumption In real world situations, MCAR is rare Difficult to convince the world of MCAR If MCAR, then complete case analysis is unbiased Essentially analyzing a random sub-sample of the original data sample 19
MAR Y is a variable with some values missing Assume MAR if: The probability that Y is missing is unrelated to the value of Y after controlling for other observed variables X P(Y is missing X, Y) = P(Y is missing X) 20
MAR Example In a survey, the probability of missing income depends on marital status, but within each marital status, the probability of missing income does not depend on income 21
MAR 0 50 100 150 200 Individual Income (Single) 0 50 100 150 200 Individual Income (Married) 22
MAR Example One can test if missingness of income depends on marital status (chi-square test) Missing Income Not Missing Income Single 10 90 Married 50 50 This evidence refutes MCAR, but does not prove MAR 23
MAR Consequences MAR is a weaker assumption than MCAR Easier to convince the world that data is MAR Complete case analysis is likely to be biased if MAR Tractable solutions exist for analyzing data under the MAR assumption 24
NMAR Y is a variable with some values missing Assume NMAR if: The probability that Y is missing is related to the value of Y even after controlling for other observed variables X P(Y is missing X, Y) cannot be simplified 25
NMAR Example In a study of body self image, it is found that women and men are equally likely to not self-report their weight, but it is suspected that heavier women are even more likely to not report their weight 26
NMAR 100 150 200 250 Male Weight 100 150 200 250 Female Weight 27
NMAR Example One can test if missingness of weight depends on gender (chi-square test) Missing Weight Not Missing Weight Male 9 21 Female 9 21 This evidence fails to refute MCAR, but could still be NMAR 28
NMAR Consequences NMAR is impossible to prove (relies on unknown data values), but easy to suspect No good, canned solutions exist for analyzing data under NMAR Open area of research Some success in specific situations Requires strong, situation-specific assumptions about how the data is missing 29
Assumptions Summary Most important missing data assumptions are untestable You will almost never have real data that is MCAR MAR is a common assumption to make Leads to tractable analysis solutions Can usually be defended to the world Note: defense is logical and subjectknowledge based rather than statistical in nature 30
Analysis Approaches Traditional Modern Listwise deletion (complete case analysis) Replacement with means Dummy variable adjustment Replacement with conditional means Hot Deck imputation Last observation carried forward (longitudinal) Multiple Imputation (MI) 31
Listwise Deletion Delete any case with missing data Strengths: Easy to implement (default for most software) Works for all types of analyses Unbiased if MCAR Data is a simple random sample of original data Standard error estimates are usually conservative 32
Weaknesses: Listwise Deletion Likely to introduce bias if MAR instead of MCAR Loss of power due to deleting observations Doesn t utilize all the information that is available 33
Replacement with Means Replace all missing values of variable X with the sample mean of X from available cases BMI 33 27 38 28 Sample Mean 31.5 BMI 33 27 31.5 38 31.5 28 31.5 34
Strengths: Replacement with Means Easy to implement Comforting use of statistics Weaknesses: Inclusion of many repeated constant values at the mean guarantees a crippling bias towards a too low estimate of variability Variable is now useless for any future analysis you may have planned for it In general, a biased approach under MAR 35
Dummy Variable Adjustment In a regression predicting Y, suppose there are missing values of predictor X Create a new variable: D=1 if X is missing D=0 if X is present When X is missing, set X=c c is some constant (usually the sample mean of X) Regress Y on both X and D 36
Dummy Variable Adjustment Serum Vitamin D BMI 20 33 18 27 22 17 38 23 19 28 26 Serum Vitamin D BMI D 20 33 0 18 27 0 22 31.5 1 17 38 0 23 31.5 1 19 28 0 26 31.5 1 VitD = b 0 + b 1 BMI + b 2 D 37
Strengths: Dummy Variable Adjustment Adjusts for using the mean as the imputation value May be OK for not applicable (skip pattern) type of missing data (Allison, 1999) Weaknesses: Still biased under MAR Produces biased coefficient estimates (Jones, JASA, 1996) 38
Replacement with Conditional Means Replace missing values with predictions from an estimated regression equation Serum Vitamin D BMI 20 33 18 27 22 17 38 23 19 28 26 Serum Vitamin D BMI 20 33 18 27 17 38 19 28 BMI = a 0 + a 1 VitD 39
Replacement with Conditional Means Use full dataset to estimate the regression model of interest Serum Vitamin D BMI 20 33 18 27 22 31.8 17 38 23 31.0 19 28 26 29.6 VitD = b 0 + b 1 BMI 40
Sample size 100 Missingness 30% Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 41
Sample size 100 Missingness 30% Complete data correlation -0.50 Imputed data correlation -0.62 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 42
Replacement with Conditional Means Strengths: Better than replacement with means Can utilize auxiliary information from other covariates Weaknesses: Ruins the relationships among the variables Still produces biased estimates 43
Conditional Means Plus Error Same as before except randomly wiggle the estimates away from a straight line How much wiggle? 44
Conditional Means Plus Error Serum Vitamin D BMI 20 33 18 27 22 17 38 23 19 28 26 Serum Vitamin D BMI 20 33 18 27 17 38 19 28 BMI = a 0 + a 1 VitD 45
Conditional Means Plus Error Wiggle for each imputed BMI is chosen randomly based on the residual standard error for the BMI prediction model Serum Vitamin D BMI 20 33 18 27 22 28.2 17 38 23 31.5 19 28 26 33.7 VitD = b 0 + b 1 BMI 46
Sample size 100 Missingness 30% Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 47
Sample size 100 Missingness 30% Complete data correlation -0.50 Imputed data correlation -0.54 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 48
Strengths: Conditional Means Plus Error Better than conditional means An attempt is made to adjust the variability upwards Weaknesses: The attempt is insufficient Still produces biased estimates Method is inefficient because of introduced variability (i.e., the random wiggles ) 49
Multiple Imputation Do single imputation (previous example) several times and combine the results Combining several results increases efficiency The size of the wiggle needs to be purposely inflated There are many flavors of MI where the details differ (areas of open research) 50
Imputation 1 Serum Vitamin D BMI 20 33 18 27 22 33.0 17 38 23 30.7 19 28 26 37.5 Imputed data correlation -0.47 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 51
Imputation 2 Serum Vitamin D BMI 20 33 18 27 22 29.9 17 38 23 31.8 19 28 26 31.5 Imputed data correlation -0.52 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 52
Imputation 3 Serum Vitamin D BMI 20 33 18 27 22 29.9 17 38 23 31.2 19 28 26 27.9 Imputed data correlation -0.55 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 53
Imputation 4 Serum Vitamin D BMI 20 33 18 27 22 29.1 17 38 23 30.9 19 28 26 32.9 Imputed data correlation -0.37 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 54
Combine Results Correlation 1-0.47 Correlation 2-0.52 Correlation 3-0.55 Correlation 4-0.37 Serum Vitamin D 15 20 25 30 35 Ave correlation -0.48 20 25 30 35 40 BMI 55
Multiple Imputation Strengths: Unbiased for MAR Available as a canned procedure Weaknesses: Specialized software Complicated 56
Example: Compare Methods Simulate the truth: After bypass surgery, Mortality depends on: Age Number of diseased vessels Previous surgery Pump type *** of primary interest *** Force missing values (MAR) Compare analysis methods 57
Example: Compare Methods Table 1: Summary Statistics Variable % missing data Mean ± SD or % Mortality 0.0% 10.7% Age 37.8% 69.8 ± 6.7 # of Diseased Vessels 22.9% 3.4 ± 1.5 Previous Surgery 22.8% 67.6% On-pump 8.4% 68.9% 58
Example: Compare Methods Table 2: Results Method Odds Ratio of On-Pump Relative Difference Full dataset 1.85 -- Complete Case Analysis 2.27 22.5% Replace with Means 2.56 38.0% Dummy Variable Adjustment 2.27 22.4% Replace with Conditional Means 2.64 42.3% Multiple Imputation 1.78-4.2% 59
Remaining Issues with MI Assumptions: Multivariate normality Harmless assumption for variables with no missing data Robust method, works well even if assumption is violated Software SAS PROC MI and MIANALYZE Stata R (MICE or RMS packages) 60
Remaining Issues Consult an expert for more about: How much missingness can MI handle? Should we use the response (dependent variable) for multiple imputation? Should we impute the response itself? What about dichotomous, nominal, ordinal variables? How to impute when the model includes interactions and other non-linearities? What to do with non-ignorable missing? 61
Conclusions You will encounter missing data in your research Inappropriate methods will make a bad situation worse Good methods will maximize the information you can get from your data Your data is not MCAR Traditional methods are insufficient for MAR Multiple imputation has optimal properties for MAR (unbiased and efficient) 62
Conclusions The goal is not to recreate the truth The goal is to maintain relationships and Minimize bias Maximize utilization of information Get good estimates of uncertainty You statisticians are making up data! Yes, and we are adjusting for the fact that we have made up data. 63