Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Similar documents
Missing Data and Imputation

Help! Statistics! Missing data. An introduction

Analysis of TB prevalence surveys

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Advanced Handling of Missing Data

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

Module 14: Missing Data Concepts

Designing and Analyzing RCTs. David L. Streiner, Ph.D.

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

The prevention and handling of the missing data

Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Validity and reliability of measurements

AMELIA II: A Package for Missing Data

Inclusive Strategy with Confirmatory Factor Analysis, Multiple Imputation, and. All Incomplete Variables. Jin Eun Yoo, Brian French, Susan Maller

Missing data in medical research is

Validity and reliability of measurements

The analysis of tuberculosis prevalence surveys. Babis Sismanidis with acknowledgements to Sian Floyd Harare, 30 November 2010

Missing Data and Institutional Research

Strategies for handling missing data in randomised trials

Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd Ed.

Best Practice in Handling Cases of Missing or Incomplete Values in Data Analysis: A Guide against Eliminating Other Important Data

Modern Strategies to Handle Missing Data: A Showcase of Research on Foster Children

Linear Regression in SAS

Multiple Imputation For Missing Data: What Is It And How Can I Use It?

Chapter Eight: Multivariate Analysis

Section on Survey Research Methods JSM 2009

Propensity Score Methods for Estimating Causality in the Absence of Random Assignment: Applications for Child Care Policy Research

SESUG Paper SD

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Missing data imputation: focusing on single imputation

Master thesis Department of Statistics

Recent developments for combining evidence within evidence streams: bias-adjusted meta-analysis

Week 10 Hour 1. Shapiro-Wilks Test (from last time) Cross-Validation. Week 10 Hour 2 Missing Data. Stat 302 Notes. Week 10, Hour 2, Page 1 / 32

PSI Missing Data Expert Group

In this module I provide a few illustrations of options within lavaan for handling various situations.

Chapter Eight: Multivariate Analysis

AVOIDING BIAS AND RANDOM ERROR IN DATA ANALYSIS

Predictive Models for Making Patient Screening Decisions

Accuracy of Range Restriction Correction with Multiple Imputation in Small and Moderate Samples: A Simulation Study

Exploring the Impact of Missing Data in Multiple Regression

Estimands, Missing Data and Sensitivity Analysis: some overview remarks. Roderick Little

Longitudinal data monitoring for Child Health Indicators

Daniel Boduszek University of Huddersfield

COMMITTEE FOR PROPRIETARY MEDICINAL PRODUCTS (CPMP) POINTS TO CONSIDER ON MISSING DATA

Alternative indicators for the risk of non-response bias

Matched Cohort designs.

bivariate analysis: The statistical analysis of the relationship between two variables.

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

What to do with missing data in clinical registry analysis?

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Evaluators Perspectives on Research on Evaluation

Statistical data preparation: management of missing values and outliers

Methods for Computing Missing Item Response in Psychometric Scale Construction

Handling Missing Data in Educational Research Using SPSS

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

Problem 1) Match the terms to their definitions. Every term is used exactly once. (In the real midterm, there are fewer terms).

WELCOME! Lecture 11 Thommy Perlinger

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Sequential nonparametric regression multiple imputations. Irina Bondarenko and Trivellore Raghunathan

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Methodology for Non-Randomized Clinical Trials: Propensity Score Analysis Dan Conroy, Ph.D., inventiv Health, Burlington, MA

ExperimentalPhysiology

A Strategy for Handling Missing Data in the Longitudinal Study of Young People in England (LSYPE)

SISCR Module 7 Part I: Introduction Basic Concepts for Binary Biomarkers (Classifiers) and Continuous Biomarkers

The RoB 2.0 tool (individually randomized, cross-over trials)

An Empirical Study of Nonresponse Adjustment Methods for the Survey of Doctorate Recipients Wilson Blvd., Suite 965, Arlington, VA 22230

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Abstract. Introduction A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION

Missing Data: Our View of the State of the Art

LOCF and MMRM: Thoughts on Comparisons

Should a Normal Imputation Model Be Modified to Impute Skewed Variables?

Instrumental Variables Estimation: An Introduction

Multiple imputation for handling missing outcome data when estimating the relative risk

Module Overview. What is a Marker? Part 1 Overview

1.4 - Linear Regression and MS Excel

Clincial Biostatistics. Regression

How should the propensity score be estimated when some confounders are partially observed?

Missing data in clinical trials: making the best of what we haven t got.

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

Comparison of imputation and modelling methods in the analysis of a physical activity trial with missing outcomes

Part 8 Logistic Regression

Maintenance of weight loss and behaviour. dietary intervention: 1 year follow up

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Bayesian approaches to handling missing data: Practical Exercises

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

Regression Discontinuity Analysis

Survey research (Lecture 1) Summary & Conclusion. Lecture 10 Survey Research & Design in Psychology James Neill, 2015 Creative Commons Attribution 4.

Survey research (Lecture 1)

research methods & reporting

Missing by Design: Planned Missing-Data Designs in Social Science

Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016

Multivariable Systems. Lawrence Hubert. July 31, 2011

Comparison And Application Of Methods To Address Confounding By Indication In Non- Randomized Clinical Studies

ethnicity recording in primary care

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Ball State University

(C) Jamalludin Ab Rahman

Introduction to Multilevel Models for Longitudinal and Repeated Measures Data

Data harmonization tutorial:teaser for FH2019

Transcription:

Selected Topics in Biostatistics Seminar Series Missing Data Sponsored by: Center For Clinical Investigation and Cleveland CTSC Brian Schmotzer, MS Biostatistician, CCI Statistical Sciences Core brian.schmotzer@case.edu June 23, 2010

Outline Missing data What is it, what does it look like? How did we get in this mess? What are the consequences? Goals for analyzing data in the presence of missingness Missing data assumptions What types of missing data are there? Traditional approaches What have people typically done in the past? What are the consequences of these approaches? Newer approach What is the state of the art now? How is it better than traditional approaches? 2

Missing Data Warnings Missing data is the single most pervasive analytical problem in research studies Most medical research papers do not refer to an adequate analysis approach for dealing with missing data Authors unaware/untrained? Journals and/or reviewers not savvy? 3

What is Missing Data? Any value for any variable that you do not have Can arise due to: Subject lost to follow-up Missed/skipped visits Instrument errors or failures Misplaced data extraction sheets We just didn t collect that value, etc 4

Example: coronary artery bypass grafting ID Age # Diseased Vessels Previous Surgery Pump Type Mortality Status 1 65 3 No Off Alive 2 77 3 No Off Alive 3 49 6 Yes On Dead 4 62 3 No Off Alive 5 80 4 No On Alive 6 70 2 No Off Alive 7 83 3 No On Alive 5

Example: Some Missing Data ID Age # Diseased Vessels Previous Surgery Pump Type Mortality Status 1 65 No Off Alive 2 77 3 No Off Alive 3 6 Yes On Dead 4 62 3 No Off Alive 5 80 4 Alive 6 70 2 No Off Alive 7 83 3 No On Alive 6

Example: More Missing Data ID Age # Diseased Vessels Previous Surgery Pump Type Mortality Status 1 65 No Off Alive 2 77 3 No Alive 3 6 Yes On Dead 4 62 No Off Alive 5 80 4 Alive 6 70 2 No Off 7 83 3 On Alive 7

Consequences of Missing Data Default for software packages is to throw out observations with any missing data Complete case or completers analysis Reduced sample size (best case) Loss of power Poorer estimates of parameters of interest No sample size (worst case) 8

More Subtle Consequence Bias: a systematic distortion of an estimate away from its true value Selection bias: bias due to systematic differences between subjects in the sample compared to the target population 9

Populations Mean = 190 100 150 200 250 Male Weight Mean = 160 100 150 200 250 Female Weight 10

Full samples Mean = 188 100 150 200 250 Male Weight Mean = 162 100 150 200 250 Female Weight 11

Missing values 100 150 200 250 Male Weight 100 150 200 250 Female Weight 12

Available samples Mean = 189 100 150 200 250 Male Weight Mean = 152 100 150 200 250 Female Weight 13

Analysis Goals Maintain the relationships among the variables so that we may: Minimize any bias Maximize the utilization of available information Get good estimates of uncertainty 14

NOT the Goals Try to impute values that are close to plausible replacements for representative of that might mirror the real, unknown, missing data values We are not here to recreate the truth 15

Missing Data Assumptions Missing Completely At Random (MCAR) Missing At Random (MAR) Not Missing At Random (NMAR) or Non-Ignorable Missingness (NIM) 16

MCAR Y is a variable with some values missing Assume MCAR if: The probability that Y is missing is unrelated to the value of Y The probability that Y is missing is unrelated to the set of other observed X variables P(Y is missing X, Y) = P(Y is missing) 17

MCAR Example In a laboratory experiment, a test tube is dropped and the cholesterol level that would have been measured from the blood sample is lost Probability that this data would be lost does not depend on the cholesterol level of the blood in the test tube, nor on the age, gender, race, etc. of the subject whose blood it is 18

MCAR Consequences MCAR is the strongest assumption In real world situations, MCAR is rare Difficult to convince the world of MCAR If MCAR, then complete case analysis is unbiased Essentially analyzing a random sub-sample of the original data sample 19

MAR Y is a variable with some values missing Assume MAR if: The probability that Y is missing is unrelated to the value of Y after controlling for other observed variables X P(Y is missing X, Y) = P(Y is missing X) 20

MAR Example In a survey, the probability of missing income depends on marital status, but within each marital status, the probability of missing income does not depend on income 21

MAR 0 50 100 150 200 Individual Income (Single) 0 50 100 150 200 Individual Income (Married) 22

MAR Example One can test if missingness of income depends on marital status (chi-square test) Missing Income Not Missing Income Single 10 90 Married 50 50 This evidence refutes MCAR, but does not prove MAR 23

MAR Consequences MAR is a weaker assumption than MCAR Easier to convince the world that data is MAR Complete case analysis is likely to be biased if MAR Tractable solutions exist for analyzing data under the MAR assumption 24

NMAR Y is a variable with some values missing Assume NMAR if: The probability that Y is missing is related to the value of Y even after controlling for other observed variables X P(Y is missing X, Y) cannot be simplified 25

NMAR Example In a study of body self image, it is found that women and men are equally likely to not self-report their weight, but it is suspected that heavier women are even more likely to not report their weight 26

NMAR 100 150 200 250 Male Weight 100 150 200 250 Female Weight 27

NMAR Example One can test if missingness of weight depends on gender (chi-square test) Missing Weight Not Missing Weight Male 9 21 Female 9 21 This evidence fails to refute MCAR, but could still be NMAR 28

NMAR Consequences NMAR is impossible to prove (relies on unknown data values), but easy to suspect No good, canned solutions exist for analyzing data under NMAR Open area of research Some success in specific situations Requires strong, situation-specific assumptions about how the data is missing 29

Assumptions Summary Most important missing data assumptions are untestable You will almost never have real data that is MCAR MAR is a common assumption to make Leads to tractable analysis solutions Can usually be defended to the world Note: defense is logical and subjectknowledge based rather than statistical in nature 30

Analysis Approaches Traditional Modern Listwise deletion (complete case analysis) Replacement with means Dummy variable adjustment Replacement with conditional means Hot Deck imputation Last observation carried forward (longitudinal) Multiple Imputation (MI) 31

Listwise Deletion Delete any case with missing data Strengths: Easy to implement (default for most software) Works for all types of analyses Unbiased if MCAR Data is a simple random sample of original data Standard error estimates are usually conservative 32

Weaknesses: Listwise Deletion Likely to introduce bias if MAR instead of MCAR Loss of power due to deleting observations Doesn t utilize all the information that is available 33

Replacement with Means Replace all missing values of variable X with the sample mean of X from available cases BMI 33 27 38 28 Sample Mean 31.5 BMI 33 27 31.5 38 31.5 28 31.5 34

Strengths: Replacement with Means Easy to implement Comforting use of statistics Weaknesses: Inclusion of many repeated constant values at the mean guarantees a crippling bias towards a too low estimate of variability Variable is now useless for any future analysis you may have planned for it In general, a biased approach under MAR 35

Dummy Variable Adjustment In a regression predicting Y, suppose there are missing values of predictor X Create a new variable: D=1 if X is missing D=0 if X is present When X is missing, set X=c c is some constant (usually the sample mean of X) Regress Y on both X and D 36

Dummy Variable Adjustment Serum Vitamin D BMI 20 33 18 27 22 17 38 23 19 28 26 Serum Vitamin D BMI D 20 33 0 18 27 0 22 31.5 1 17 38 0 23 31.5 1 19 28 0 26 31.5 1 VitD = b 0 + b 1 BMI + b 2 D 37

Strengths: Dummy Variable Adjustment Adjusts for using the mean as the imputation value May be OK for not applicable (skip pattern) type of missing data (Allison, 1999) Weaknesses: Still biased under MAR Produces biased coefficient estimates (Jones, JASA, 1996) 38

Replacement with Conditional Means Replace missing values with predictions from an estimated regression equation Serum Vitamin D BMI 20 33 18 27 22 17 38 23 19 28 26 Serum Vitamin D BMI 20 33 18 27 17 38 19 28 BMI = a 0 + a 1 VitD 39

Replacement with Conditional Means Use full dataset to estimate the regression model of interest Serum Vitamin D BMI 20 33 18 27 22 31.8 17 38 23 31.0 19 28 26 29.6 VitD = b 0 + b 1 BMI 40

Sample size 100 Missingness 30% Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 41

Sample size 100 Missingness 30% Complete data correlation -0.50 Imputed data correlation -0.62 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 42

Replacement with Conditional Means Strengths: Better than replacement with means Can utilize auxiliary information from other covariates Weaknesses: Ruins the relationships among the variables Still produces biased estimates 43

Conditional Means Plus Error Same as before except randomly wiggle the estimates away from a straight line How much wiggle? 44

Conditional Means Plus Error Serum Vitamin D BMI 20 33 18 27 22 17 38 23 19 28 26 Serum Vitamin D BMI 20 33 18 27 17 38 19 28 BMI = a 0 + a 1 VitD 45

Conditional Means Plus Error Wiggle for each imputed BMI is chosen randomly based on the residual standard error for the BMI prediction model Serum Vitamin D BMI 20 33 18 27 22 28.2 17 38 23 31.5 19 28 26 33.7 VitD = b 0 + b 1 BMI 46

Sample size 100 Missingness 30% Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 47

Sample size 100 Missingness 30% Complete data correlation -0.50 Imputed data correlation -0.54 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 48

Strengths: Conditional Means Plus Error Better than conditional means An attempt is made to adjust the variability upwards Weaknesses: The attempt is insufficient Still produces biased estimates Method is inefficient because of introduced variability (i.e., the random wiggles ) 49

Multiple Imputation Do single imputation (previous example) several times and combine the results Combining several results increases efficiency The size of the wiggle needs to be purposely inflated There are many flavors of MI where the details differ (areas of open research) 50

Imputation 1 Serum Vitamin D BMI 20 33 18 27 22 33.0 17 38 23 30.7 19 28 26 37.5 Imputed data correlation -0.47 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 51

Imputation 2 Serum Vitamin D BMI 20 33 18 27 22 29.9 17 38 23 31.8 19 28 26 31.5 Imputed data correlation -0.52 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 52

Imputation 3 Serum Vitamin D BMI 20 33 18 27 22 29.9 17 38 23 31.2 19 28 26 27.9 Imputed data correlation -0.55 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 53

Imputation 4 Serum Vitamin D BMI 20 33 18 27 22 29.1 17 38 23 30.9 19 28 26 32.9 Imputed data correlation -0.37 Serum Vitamin D 15 20 25 30 35 20 25 30 35 40 BMI 54

Combine Results Correlation 1-0.47 Correlation 2-0.52 Correlation 3-0.55 Correlation 4-0.37 Serum Vitamin D 15 20 25 30 35 Ave correlation -0.48 20 25 30 35 40 BMI 55

Multiple Imputation Strengths: Unbiased for MAR Available as a canned procedure Weaknesses: Specialized software Complicated 56

Example: Compare Methods Simulate the truth: After bypass surgery, Mortality depends on: Age Number of diseased vessels Previous surgery Pump type *** of primary interest *** Force missing values (MAR) Compare analysis methods 57

Example: Compare Methods Table 1: Summary Statistics Variable % missing data Mean ± SD or % Mortality 0.0% 10.7% Age 37.8% 69.8 ± 6.7 # of Diseased Vessels 22.9% 3.4 ± 1.5 Previous Surgery 22.8% 67.6% On-pump 8.4% 68.9% 58

Example: Compare Methods Table 2: Results Method Odds Ratio of On-Pump Relative Difference Full dataset 1.85 -- Complete Case Analysis 2.27 22.5% Replace with Means 2.56 38.0% Dummy Variable Adjustment 2.27 22.4% Replace with Conditional Means 2.64 42.3% Multiple Imputation 1.78-4.2% 59

Remaining Issues with MI Assumptions: Multivariate normality Harmless assumption for variables with no missing data Robust method, works well even if assumption is violated Software SAS PROC MI and MIANALYZE Stata R (MICE or RMS packages) 60

Remaining Issues Consult an expert for more about: How much missingness can MI handle? Should we use the response (dependent variable) for multiple imputation? Should we impute the response itself? What about dichotomous, nominal, ordinal variables? How to impute when the model includes interactions and other non-linearities? What to do with non-ignorable missing? 61

Conclusions You will encounter missing data in your research Inappropriate methods will make a bad situation worse Good methods will maximize the information you can get from your data Your data is not MCAR Traditional methods are insufficient for MAR Multiple imputation has optimal properties for MAR (unbiased and efficient) 62

Conclusions The goal is not to recreate the truth The goal is to maintain relationships and Minimize bias Maximize utilization of information Get good estimates of uncertainty You statisticians are making up data! Yes, and we are adjusting for the fact that we have made up data. 63