Bayesian approaches to handling missing data: Practical Exercises

Similar documents
Module 14: Missing Data Concepts

Strategy for modelling non-random missing data mechanisms in observational studies using Bayesian methods

Missing data. Patrick Breheny. April 23. Introduction Missing response data Missing covariate data

How should the propensity score be estimated when some confounders are partially observed?

Analysis of TB prevalence surveys

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Previously, when making inferences about the population mean,, we were assuming the following simple conditions:

In this module I provide a few illustrations of options within lavaan for handling various situations.

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Stat 13, Lab 11-12, Correlation and Regression Analysis

Clinical trials with incomplete daily diary data

An Introduction to Bayesian Statistics

Introduction to Logistic Regression

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

An informal analysis of multilevel variance

Bayesian Statistics Estimation of a Single Mean and Variance MCMC Diagnostics and Missing Data

Strategies for handling missing data in randomised trials

Missing Data and Imputation

Data Analysis Using Regression and Multilevel/Hierarchical Models

Help! Statistics! Missing data. An introduction

An application of a pattern-mixture model with multiple imputation for the analysis of longitudinal trials with protocol deviations

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

MS&E 226: Small Data

Generalized Linear Models and Logistic Regression

Bayesian graphical models for combining multiple data sources, with applications in environmental epidemiology

Statistical inference provides methods for drawing conclusions about a population from sample data.

6. Unusual and Influential Data

STAT 135 Introduction to Statistics via Modeling: Midterm II Thursday November 16th, Name:

Sheila Barron Statistics Outreach Center 2/8/2011

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Advanced Handling of Missing Data

The analysis of tuberculosis prevalence surveys. Babis Sismanidis with acknowledgements to Sian Floyd Harare, 30 November 2010

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

EPI 200C Final, June 4 th, 2009 This exam includes 24 questions.

Supplementary Online Content

Exploring the Impact of Missing Data in Multiple Regression

Sociology 593 Exam 2 March 28, 2003

Treatment effect estimates adjusted for small-study effects via a limit meta-analysis

12/30/2017. PSY 5102: Advanced Statistics for Psychological and Behavioral Research 2

BIOL 458 BIOMETRY Lab 7 Multi-Factor ANOVA

1 Simple and Multiple Linear Regression Assumptions

Creative Commons Attribution-NonCommercial-Share Alike License

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

PEER REVIEW HISTORY ARTICLE DETAILS VERSION 1 - REVIEW. Ball State University

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Simple Linear Regression the model, estimation and testing

Business Statistics Probability

Strategy for Modelling Nonrandom Missing Data Mechanisms in Observational Studies Using Bayesian Methods

Comparison of imputation and modelling methods in the analysis of a physical activity trial with missing outcomes

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Bayesian methods for combining multiple Individual and Aggregate data Sources in observational studies

Problem 1) Match the terms to their definitions. Every term is used exactly once. (In the real midterm, there are fewer terms).

AP Statistics Practice Test Ch. 3 and Previous

Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling, Sensitivity Analysis, and Causal Inference

Modelling Research Productivity Using a Generalization of the Ordered Logistic Regression Model

Instrumental Variables I (cont.)

Chapter 20 Confidence Intervals with proportions!

Supplement 2. Use of Directed Acyclic Graphs (DAGs)

Nature of measurement

Addendum: Multiple Regression Analysis (DRAFT 8/2/07)

Design and Analysis Plan Quantitative Synthesis of Federally-Funded Teen Pregnancy Prevention Programs HHS Contract #HHSP I 5/2/2016

County-Level Small Area Estimation using the National Health Interview Survey (NHIS) and the Behavioral Risk Factor Surveillance System (BRFSS)

OCW Epidemiology and Biostatistics, 2010 David Tybor, MS, MPH and Kenneth Chui, PhD Tufts University School of Medicine October 27, 2010

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Multiple imputation for handling missing outcome data when estimating the relative risk

Poisson regression. Dae-Jin Lee Basque Center for Applied Mathematics.

A Strategy for Handling Missing Data in the Longitudinal Study of Young People in England (LSYPE)

What to do with missing data in clinical registry analysis?

Predictive Models for Making Patient Screening Decisions

10.1 Estimating with Confidence. Chapter 10 Introduction to Inference

Multiple Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Please return your completed questionnaire thank you.

Analysis of Hearing Loss Data using Correlated Data Analysis Techniques

Methods for Determining Random Sample Size

Table. A [a] Multiply imputed. Outpu

arxiv: v2 [stat.ap] 7 Dec 2016

Intro to SPSS. Using SPSS through WebFAS

A point estimate is a single value that has been calculated from sample data to estimate the unknown population parameter. s Sample Standard Deviation

Econometric Game 2012: infants birthweight?

5 To Invest or not to Invest? That is the Question.

On Regression Analysis Using Bivariate Extreme Ranked Set Sampling

Write your identification number on each paper and cover sheet (the number stated in the upper right hand corner on your exam cover).

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd Ed.

Regression Including the Interaction Between Quantitative Variables

Meta-Analysis. Zifei Liu. Biological and Agricultural Engineering

Generalized Estimating Equations for Depression Dose Regimes

10-1 MMSE Estimation S. Lall, Stanford

A point estimate is a single value that has been calculated from sample data to estimate the unknown population parameter. s Sample Standard Deviation

BAYESIAN HYPOTHESIS TESTING WITH SPSS AMOS

Bayesian hierarchical modelling

Clinical trial design issues and options for the study of rare diseases

Abstract. Introduction A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION

NORTH SOUTH UNIVERSITY TUTORIAL 2

Improving ecological inference using individual-level data

You can use this app to build a causal Bayesian network and experiment with inferences. We hope you ll find it interesting and helpful.

Content. Basic Statistics and Data Analysis for Health Researchers from Foreign Countries. Research question. Example Newly diagnosed Type 2 Diabetes

Transcription:

Bayesian approaches to handling missing data: Practical Exercises 1 Practical A Thanks to James Carpenter and Jonathan Bartlett who developed the exercise on which this practical is based (funded by ESRC). In this practical you will explore the three types of missingness MCAR, MAR and MNAR. It would be helpful to have a calculator handy for answering some of the questions. Suppose that you wish to investigate the cognitive function of a group of 100 males aged 90. You devise a way of classifying their cognitive function into good, moderate or poor based on the scores attained in some standard test of cognitive function (Cog90). Further, you have available a classification of the same individuals into high and low cognitive function for a different test undertaken when they were aged 85 (Cog85). No other data is available to you. Table 1 provides the data that you would observe if all the individuals take the test. 1. Use the data in Table 1 to answer the following questions, which are about marginal and conditional distributions. What is the overall mean cognitive function score at age 90 (Cog90)? The marginal distribution of Cog85 is 72% in category high and 28% in category low. What is the marginal distribution of Cog90 (i.e. the percentage in each category of Cog90)? What is the conditional distribution of Cog90 given high Cog85? What is the conditional distribution of Cog90 given low Cog85? Are they different? Does this make sense? Now suppose that not all the individuals take the test. Tables 2-4 provide three different realisations (A, B and C) of the data that you actually observe. Each realisation corresponds to a different type of missing data mechanism. 2. What does it mean to say is Missing Completely At Random (MCAR)? is Missing At Random (MAR)? is Missing Not At Random (MNAR)? 3. For each of the three observed data tables (A, B, C): (d) Estimate the mean cognitive function score at age 90 from the observed data. Is this an unbiased estimate? Is the marginal estimate of the cognitive function at age 90 distribution, calculated from the observed data, unbiased? Is the conditional estimate of the cognitive function score at age 90 given high cognitive function at age 85, calculated from the observed data, unbiased? Is the conditional estimate of the cognitive function score at age 85 given poor cognitive function at age 90, calculated from the observed data, unbiased?

(e) Which type of missing data mechanism do you think produced the missing values (MCAR, MAR or MNAR). Hint: using the table of true values, calculate the proportion of missing data in each cell (i.e. the probability that an observation in that cell is missing). 4. For each of the three observed data tables, decide if you can obtain an unbiased estimate of the overall mean cognitive function score at age 90 from the observed data? If so, write down the formula. Hint: this may be a weighted average of the scores in different cells.

Table 1: Cognitive function of 100 males (true data) high 44 24 4 72 low 4 12 12 28 age 85 margin 48 36 16 100 Table 2: Cognitive function of 100 males (observed data A) high 33 12 1 46 low 3 6 3 12 age 85 margin 36 18 4 58 Table 3: Cognitive function of 100 males (observed data B) high 22 12 2 36 low 2 6 6 14 age 85 margin 24 18 8 50 Table 4: Cognitive function of 100 males (observed data C) high 33 18 3 54 low 1 3 3 7 age 85 margin 34 21 6 61

2 Practical B In this practical you will construct a series of graphical models that represent the Bayesian models that you would build to answer the questions below. Materials: You should have been given a graphical model toolkit consisting of: a magnetic white board yellow magnetic circles: use these to represent random variables orange magnetic circles: use these to represent random variables with missing values blue magnetic squares: use these to represent quantities that are regarded as constants, for example fully observed covariates coloured marker pens use for labelling your nodes (you can write on the coloured circles/squares) and drawing arrows to link nodes together: use the black pen to draw stochastic links to which a probability relationship is attached, i.e. any relationship between two variables specified by in an equation; use the red pen to draw deterministic relationships, i.e. any relationship between two variables specified by = in an equation. a dry-wipe eraser Research question: Suppose you are planning to conduct an analysis to investigate what is the effect of ethnicity on an individual s income? Available data: Taken from a cross-sectional survey of 1000 individuals. Available variables are: lpay = a continuous measure of annual income in, transformed to the log scale age = age, a continuous variable measured in years eth = ethnicity, a 2 level categorical variable (0=white; 1=non-white) Use your graphical models toolkit to construct DAGs to represent suitable Bayesian models to address the questions below. You decide to fit a simple Bayesian linear regression model to investigate the effect of ethnicity on income, adjusting for age, as follows: lpay i Normal(µ i, σ 2 ) µ i = β 0 + β eth eth i + β age age i β 0, β eth, β age, σ 2 Fully specified vague priors (i) Construct a DAG to represent this model (which we will call the analysis model of interest). (Leave some space on your white board, as you will be extending this DAG in later parts of the question). When the data arrive from the survey company, you discover that 20% of the values for lpay are missing. The survey company tells you that pay was originally collected from all the individuals in the study, but the answers for the first 200 individuals were subsequently lost due to a data handling error (embarrassingly, no backups had been taken). (i) What sort of missing data mechanism does this correspond to?

Let m be a missingness indicator for lpay (set to 0 when pay is observed and 1 when pay is missing). Construct a DAG for m to represent a suitable model of response missingness. Does this DAG share any links with the DAG for your analysis model of interest? Suppose that the survey company contacts you again to say that they had made a mistake, and in fact they had never had income data for these 200 individuals because they had refused to answer the question. (i) Discuss in your group how this new information might change your assumptions about the missing data mechanism. Change your DAG to reflect any new assumptions you decide to make about the missing response mechansim. (d) (e) Suppose you decide that your analysis could be improved by adding education to your analysis model. The survey company tell you that they can provide you with a binary education variable indicating whether or not each subject had any formal educational qualifications. (i) Modify your DAG to include education (denoted edu) as an additional covariate in your analysis model When the education data arrive from the survey company, you find that 35% of the values are missing. You do some exploratory analysis of the data and find that the majority of missing values are for young, non-white individuals. (i) Discuss in your group what might be a reasonable assumption about the missing data mechanism for edu. Change your DAG to reflect the fact that edu contains missing values, and how you plan to model this. (f) Optional question if you have time! You then notice that the 200 individuals with missing income data are a subset of the 350 with missing education data. You further notice that, amongst the subjects with observed income, lpay tends to be lower for those with missing values for edu. (i) In your group, discuss the implications of this information in terms of the assumptions you have made about the missing data mechanism for edu. If necessary, elaborate your DAG to take account of any new modelling assumptions you make.

3 Practical C In this exercise you will discuss a fictitious analysis of some data relating to HIV status from a demographic and health survey for an African Country. Research aim: to estimate HIV prevalence for men and women among the adult population of country X. Available data: in addition to HIV status (h), the survey includes information on age, gender (g), region of residence (r), socioeconomic (s) and behavioral variables. There are no results of HIV testing for 30% of eligible men and 25% of the eligible women, and for those with a HIV test result, 10% have missing values for one of more of the socioeconomic or behavioral variables. In a recent study in another African country, it was found that among people who already know their HIV status, those who are HIV-positive are four times more likely to refuse a test than those who are HIV-negative. Proposed analysis: Your colleague proposes to carry out the following analysis in order to estimate HIV prevalence using these data: Discard any individuals with missing covariates Using data from the remaining individuals (which includes those with missing HIV status but fully observed covariates) fit the following Bayesian logistic regression model h i Bernoulli(p i ) logit(p i ) = α r + βz where Z = {a, g, s, b} and α r are region specific intercepts. Then calculate prevalence from the observed and imputed values of h. In your group, discuss the following questions: What assumption is being made about the type of missingness for the response? Do you consider this assumption to be reasonable? What assumption is being made about the type of missingness for the covariates? Do you consider this assumption to be reasonable? You persuade your colleague that it is not necessary to discard the records with missing covariates, and that these could be imputed within a joint Bayesian model, along with the missing HIV test results. You plan to use the general strategy for Bayesian analysis with missing data, discussed in Lecture 4. (i) What model would you use as a base case? (iii) What further analyses would you carry out to explore the robustness of the results? How might you make use of the extra information about refusal rates for HIV testing in the other African country?