SCHIATTINO ET AL. Biol Res 38, 2005, Disciplinary Program of Immunology, ICBM, Faculty of Medicine, University of Chile. 3

Similar documents
A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Kelvin Chan Feb 10, 2015

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

Section on Survey Research Methods JSM 2009

Appendix 1. Sensitivity analysis for ACQ: missing value analysis by multiple imputation

Methods for Computing Missing Item Response in Psychometric Scale Construction

Statistical Analysis Using Machine Learning Approach for Multiple Imputation of Missing Data

Using Test Databases to Evaluate Record Linkage Models and Train Linkage Practitioners

An application of a pattern-mixture model with multiple imputation for the analysis of longitudinal trials with protocol deviations

The prevention and handling of the missing data

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Multiple imputation for handling missing outcome data when estimating the relative risk

SESUG Paper SD

Abstract. Introduction A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION

Statistical Audit. Summary. Conceptual and. framework. MICHAELA SAISANA and ANDREA SALTELLI European Commission Joint Research Centre (Ispra, Italy)

Best Practice in Handling Cases of Missing or Incomplete Values in Data Analysis: A Guide against Eliminating Other Important Data

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Jinhui Ma 1,2,3, Parminder Raina 1,2, Joseph Beyene 1 and Lehana Thabane 1,3,4,5*

Bayesian Estimation of a Meta-analysis model using Gibbs sampler

Application of Multinomial-Dirichlet Conjugate in MCMC Estimation : A Breast Cancer Study

Missing Data and Institutional Research

Multiple Imputation For Missing Data: What Is It And How Can I Use It?

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

Estimating treatment effects from longitudinal clinical trial data with missing values: comparative analyses using different methods

Mediation Analysis With Principal Stratification

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

ethnicity recording in primary care

SLAUGHTER PIG MARKETING MANAGEMENT: UTILIZATION OF HIGHLY BIASED HERD SPECIFIC DATA. Henrik Kure

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

bivariate analysis: The statistical analysis of the relationship between two variables.

Statistical data preparation: management of missing values and outliers

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

Accuracy of Range Restriction Correction with Multiple Imputation in Small and Moderate Samples: A Simulation Study

Bayesian Inference Bayes Laplace

Chapter 3 Missing data in a multi-item questionnaire are best handled by multiple imputation at the item score level

NIH Public Access Author Manuscript Med Sci Sports Exerc. Author manuscript; available in PMC 2008 June 20.

Multiple imputation for multivariate missing-data. problems: a data analyst s perspective. Joseph L. Schafer and Maren K. Olsen

Applied Medical. Statistics Using SAS. Geoff Der. Brian S. Everitt. CRC Press. Taylor Si Francis Croup. Taylor & Francis Croup, an informa business

Evaluators Perspectives on Research on Evaluation

Missing Data and Imputation

BMC Medical Research Methodology

Missing by Design: Planned Missing-Data Designs in Social Science

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

Advanced Bayesian Models for the Social Sciences

Proof. Revised. Chapter 12 General and Specific Factors in Selection Modeling Introduction. Bengt Muthén

Review of Pre-crash Behaviour in Fatal Road Collisions Report 1: Alcohol

Statistical Primer for Cardiovascular Research

Analysis Strategies for Clinical Trials with Treatment Non-Adherence Bohdana Ratitch, PhD

Inclusive Strategy with Confirmatory Factor Analysis, Multiple Imputation, and. All Incomplete Variables. Jin Eun Yoo, Brian French, Susan Maller

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

Exploring the Impact of Missing Data in Multiple Regression

WinBUGS : part 1. Bruno Boulanger Jonathan Jaeger Astrid Jullion Philippe Lambert. Gabriele, living with rheumatoid arthritis

Nonresponse Rates and Nonresponse Bias In Household Surveys

Applications with Bayesian Approach

Methods for treating bias in ISTAT mixed mode social surveys

MEA DISCUSSION PAPERS

Monte Carlo Analysis of Univariate Statistical Outlier Techniques Mark W. Lukens

Selected Topics in Biostatistics Seminar Series. Missing Data. Sponsored by: Center For Clinical Investigation and Cleveland CTSC

OLS Regression with Clustered Data

SCHOOL OF MATHEMATICS AND STATISTICS

Imputation classes as a framework for inferences from non-random samples. 1

ST440/550: Applied Bayesian Statistics. (10) Frequentist Properties of Bayesian Methods

Modelling heterogeneity variances in multiple treatment comparison meta-analysis Are informative priors the better solution?

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

Reporting and dealing with missing quality of life data in RCTs: has the picture changed in the last decade?

Using SAS to Calculate Tests of Cliff s Delta. Kristine Y. Hogarty and Jeffrey D. Kromrey

Introduction to Meta-Analysis

Title. Description. Remarks. Motivating example. intro substantive Introduction to multiple-imputation analysis

On Missing Data and Genotyping Errors in Association Studies

Sequential nonparametric regression multiple imputations. Irina Bondarenko and Trivellore Raghunathan

ABSTRACT THE INDEPENDENT MEANS T-TEST AND ALTERNATIVES SESUG Paper PO-10

Meta-analysis of two studies in the presence of heterogeneity with applications in rare diseases

Introduction to Survival Analysis Procedures (Chapter)

1 Missing Data. Geert Molenberghs and Emmanuel Lesaffre. 1.1 Introduction

COMMITTEE FOR PROPRIETARY MEDICINAL PRODUCTS (CPMP) POINTS TO CONSIDER ON MISSING DATA

Technical Appendix: Methods and Results of Growth Mixture Modelling

Missing data in medical research is

Score Tests of Normality in Bivariate Probit Models

Introduction ORIGINAL ARTICLE

Many studies conducted in practice settings collect patient-level. Multilevel Modeling and Practice-Based Research

Advanced Handling of Missing Data

Retrospective Genetic Analysis of Efficacy and Adverse Events in a Rheumatoid Arthritis Population Treated with Methotrexate and Anti-TNF-α

Missing data in clinical trials: making the best of what we haven t got.

Imputation approaches for potential outcomes in causal inference

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

Practice of Epidemiology. Strategies for Multiple Imputation in Longitudinal Studies

Bayesian and Frequentist Approaches

Impact and adjustment of selection bias. in the assessment of measurement equivalence

UK Liver Transplant Audit

Detection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis

Technical Whitepaper

Genomics Research. May 31, Malvika Pillai

A Bayesian Account of Reconstructive Memory

Tilburg University. Published in: Methodology: European Journal of Research Methods for the Behavioral and Social Sciences

A Strategy for Handling Missing Data in the Longitudinal Study of Young People in England (LSYPE)

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

Transcription:

Biol Res 38: 7-12, 2005 BR 7 Multiple imputation procedures allow the rescue of missing data: An application to determine serum tumor necrosis factor (TNF) concentration values during the treatment of rheumatoid arthritis patients with anti-tnf therapy IRENE SCHIATTINO 1, RODRIGO VILLEGAS 1, ANDREA CRUZAT 2, JIMENA CUENCA 2, LORENA SALAZAR 2, OCTAVIO ARAVENA 2, BÁRBARA PESCE 2, DIEGO CATALÁN 2, CAROLINA LLANOS 2,3, MIGUEL CUCHACOVICH 3 and JUAN C. AGUILLÓN 2 1 School of Public Health, Faculty of Medicine, University of Chile. 2 Disciplinary Program of Immunology, ICBM, Faculty of Medicine, University of Chile. 3 Rheumatology Section, Department of Medicine, University of Chile Clinical Hospital. ABSTRACT Longitudinal studies aimed at evaluating patients clinical response to specific therapeutic treatments are frequently summarized in incomplete datasets due to missing data. Multivariate statistical procedures use only complete cases, deleting any case with missing data. MI and MIANALYZE procedures of the SAS software perform multiple imputations based on the Markov Chain Monte Carlo method to replace each missing value with a plausible value and to evaluate the efficiency of such missing data treatment. The objective of this work was to compare the evaluation of differences in the increase of serum TNF concentrations depending on the 308 TNF promoter genotype of rheumatoid arthritis (RA) patients receiving anti-tnf therapy with and without multiple imputations of missing data based on mixed models for repeated measures. Our results indicate that the relative efficiency of our multiple imputation model is greater than 98% and that the related inference was significant (p-value < 0.001). We established that under both approaches serum TNF levels in RA patients bearing the G/A 308 TNF promoter genotype displayed a significantly (p-value < 0.0001) increased ability to produce TNF over time than the G/G patient group, as they received successively doses of anti-tnf therapy. Key terms: multiple imputation, mixed model, TNF polymorphism. INTRODUCTION Longitudinal studies oriented toward evaluating patients clinical responses to specific therapeutic treatments are frequently affected by a lack of significant elements of laboratory data. The primary cause that generates incomplete data in a clinical study is the scientists inability to obtain patient blood samples at defined times of the protocol. Unlike the conventional statistic procedures that use only complete cases, the inferential analysis allows statistical evaluations in spite of partial loss of scientific information (missing data) (Lavory et al., 1995). When a variable is studied over time, an imputation procedure allows us to predict plausible values for those unavailable figures. The new analysis will be based on those known values for the variable and the statistical evaluation will proceed as if the Corresponding author: Dr. Juan C. Aguillón, Programa Disciplinario de Inmunología, ICBM, Facultad de Medicina, Universidad de Chile, Independencia 1027, Santiago. Casilla 13898, Correo 21. Tel: (56 2) 678-6724, Fax: (56 2) 735-3346. E-mail: jaguillo@med.uchile.cl Recieved: March 2, 2004. In Revised Form: November 2, 2004. Accepted: April 22, 2005

8 information were complete. However, the application of the simple imputation analysis could generate slanted estimations, underestimated standard errors, and distortional hypothesis tests (Rubin, 1976). Multiple imputation methods developed in recent years combine the simple imputation benefits with the incorporation of a multiple component to capture the uncertainty of absent data (Rubin, 1987). This component is obtained by the creation of k (typically 5 to 10) databases that contain plausible values for the incomplete cases. Each imputed file is analyzed separately by a standard statistical procedure, and the results obtained are combined in a single group with the estimated parameters that reflect the uncertainty provided for the missing data in the original database. The multiple imputation (MI) and multiple imputation analysis (MIANALYZE) procedures of the SAS software offer an interesting possibility to implement this strategy (Darmawan, 2002). This study aims to apply this statistical methodology to a study with repeated measures and missing data. We evaluated differences in the increase of serum TNF concentrations depending on the 308 TNF promoter genotype of rheumatoid arthritis (RA) patients receiving anti-tnf therapy. The successful rescue of seven missing serum TNF levels allows us to establish that RA patients with G/A genotype displayed higher serum amounts of the cytokine than those of the G/G patient group during the treatment. MATERIAL AND METHODS Patients and Laboratory Determinations Twenty RA patients, as defined by the American College of Rheumatology criteria of diagnosis, were treated with anti-tnf therapy (Infliximab ) to determine the influence of the 308 (G/A) TNF promoter polymorphism on the responsiveness to the treatment. For this purpose 10 G/A and 10 G/G RA patients received Infliximab (3 mg/kg of weight) at the beginning of the treatment (week 0) and at weeks 2, 6 and 14 thereafter (Cruzat et al., 2003; Cuchacovich et al., 2004). Before each Infliximab infusion serum TNF concentrations were measured, with 4 dependent variables (y1, y2, y3 and y4) per individual. In this setup we had four individual dependent observations; from a statistical point of view we had a repeated measure problem. 308 TNF genotypes were performed by polymerize chain reaction followed by a restriction fragment length polymorphism (Cuenca et al., 2001; 2003). Statistical Methodology The statistical treatment of repeated measures implies a multivariate dependence structure. We also had a treatment-fixed effect and a patient random factor, therefore suggesting that a mixed model analysis would be a valid alternative. The procedures, options, and sentences used in this work are documented at http:// support.sas.com. The Mixed Models method was considered to be the standard procedure, and it is assumed that missing data would not only form an arbitrary pattern, but would occur in any variable without any order restriction and depending only on the same variable (Darmawan, 2002; Der and Everitt, 2002). The MI procedure uses Markov Chain Monte Carlo Method (MCMC) to generate random variables in multiple chains, producing k sets of complete data. Parameter estimates are computed k times by the MIXED procedure, generating valid statistic inferences. The MIANALYZE procedure reads the estimated parameters and the associated covariance matrix. Missing data are assumed to be partially or completely randomly distributed, and the procedures for this application are summarized in Figure 1 (Littell et al., 1996; Chantala and Suchindran, 2003). For the multiple imputation procedure with the k imputed and analyzed sets, we define: ^ Q i = Estimation for the i th analyzed group (i ^ = 1, 2, 3,, k). U i Variance for the i th analyzed group. Q = ^ Σ Q, i corresponds l k k i=1

9 Complete Analysis of the datasets imputed datasets Figure 1: Schematic representation for the application of MI, Mixed, and MIANALYSE procedures. The analysis is performed under the assumption that missing data are partially or completely distributed at random. to the multiple imputation s estimated point and represents the estimated average for each analyzed group. The total estimated variance associated with Q is (Rubin, 1987): T = 1 k ^ Σ U i + k +1 l [ Σ k (Q i - Q) ] 2 k i=1 k k - 1 i=1 The first term corresponding to the within imputations variance and the second term to the between imputations variance. RESULTS AND DISCUSSION Means and standard deviations of age for G/A and G/G RA patients were 47.0 ± 10.6 and 53.0 ± 12.6 years, respectively. All G/A patients and 90% of G/G patients were females. No significant differences on age or sex were observed between G/A and G/G groups. Table I shows descriptive statistics (minimum, maximum, mean, and standard deviation) for each of the four serum TNF concentrations in the G/A and the G/G patient groups. The third column displays the corresponding number of missing data. In this case the variables correspond to TNF measured immediately before patients began the treatment (y1) and before patients received the second (y2), the third (y3) and fourth (y4) Infliximab doses. Differences between G/A and G/G groups for serum TNF concentrations at time 1 and 2 were not significant. Since most of missing data for TNF measurements were concentrated at time 3 and 4, it is difficult to evaluate differences between groups for those times. Considering a total of seven missing values (35%) for the TNF evaluation on time, the MI procedure was applied. MI procedure uses Markov Chain Monte Carlo Method (MCMC) to generate random

10 variables in multiple chains for the imputation analysis performed (Li, 1988). In this application, the procedure completed 50 iterations before the imputed dataset was created. Since there is no a priori information about the media and covariance estimation, by default a non-informative a priori Jeffrey s distribution was used to estimate the media and covariances that will be used in the next step (Schafer, 1997). In our model, complete data were available in 75% of the observations, and the last measurement was missing in four out of twenty patients. Table II shows the three missing data patterns observed in our original data with the corresponding absolute and relative frequencies. The fraction of missing information (λ) and the relative efficiencies (1+λ/k) -1 for k = 5 imputations for each of the variables with missing data are shown in Table III. The multiple-imputation parameter estimates for each variable are displayed in Table IV. The inference for the estimated parameters are based on t distribution, and in this case, all three are significantly different than zero (p<0.001). Next the MIXED procedure with the affirmation BY _IMPUTATION_ was applied for each of the five complete data sets created. The MIANALYZE procedure was applied using the previous results. The estimations and the multiple imputation inference of estimated parameters proved to be significant to the variable time. A similar result was obtained when a patient from the original database belonging to the G/G genotype group and whose serum TNF concentration was only determined at the beginning of treatment was excluded (Table V). The missing information in our original database was greater than 5% of total data. In this situation, Roth (1994) and SAS Institute Inc. (1999) recommend the use of multiple imputation methods instead of simple imputation methods. TABLE I Descriptive statistics and number of missing data in both groups of patients for each variable Patient Groups and N N of Variable (TNF Levels at Missing Serum TNF Concentration (pg/ml) Time of Determination) Data Min Max Mean Standard deviation G/G: y1 10 0 4.0 92.6 18.3 28.8 y2 9 1 6.9 159.0 68.7 44.8 y3 9 1 10.0 316.0 133.1 100.3 y4 8 2 7.3 409.0 202.5 171.9 G/A: y1 10 0 4.0 52.5 12.8 16.5 y2 10 0 4.0 196.0 76.3 66.3 y3 10 0 4.2 148.0 72.8 44.7 y4 7 3 4.0 242.0 98.9 90.9 TABLE II Observed missing data patterns (.), frequency and percentage of the groups Pattern y1 y2 y3 y4 Frequency Percentage (%) 1 15 75.00 2. 4 20.00 3.. 1 5.00

11 TABLE III The fraction of missing information (λ) and the relative efficiencies for k=5 imputations for each one of the variables with missing data Variable Fraction of missing information (λ) RelativeEfficiency = (1+λ/k) -1 y 2 0.05 0.989 y 3 0.04 0.992 y 4 0.12 0.975 TABLE IV Multiple-Imputation Parameter Estimates. Estimated mean and standard error for each variable. The inferences are based on the t distribution Variable Mean Std Error Mean 95% Confidence Limits DF t for H0: Mean=Mu0 y2 73.195 12.674 46.327 100.063 16 5.775* y3 100.529 18.070 62.222 138.838 16 5.563* y4 156.798 33.726 84.463 229.132 14 4.649* * Significant (p-value<0.05) TABLE V Final information after the procedure was applied The Mixed Procedure without Multiple-Imputation Variable Mean Std ErrorMean DF t value Pr > t Intercept -30.09 11.52 18-2.61 0.02 Group -0.39 8.34 18-0.05 0.96 Time 46.09 9.04 18 5.10 <0.001 The MIANALYZE Procedure: Multiple-Imputation Parameter Estimates of our original database (20 patients). Intercept -28.84 11.07 4985-2.61 0.009 Group 1.46 8.06 14426 0.18 0.857 Time 45.11 8.64 10635 5.22 <.001 The MIANALYZE Procedure: Multiple-Imputation Parameter Estimates of the database without the patient whose serum TNF concentration was only determined at the beginning of treatment (19 patients). Intercept -28.99 11.74 5176-2.47 0.01 Group -0.86 8.85 1076-0.09 0.92 Time 45.75 8.92 101448 5.12 <0.01 The assumption of normality for applying this methodology could be managed by the use of normalizing transformations available in the MI procedure or alternatively by using the multiple imputation for non-normal distributions model, which assumes that the slant produced is minimum (SAS Institute Inc., 1999). Our results indicate that the relative efficiency of the multiple imputation model is greater than 99% for y2 and y3 variable values and 98% for the y4 value. Interestingly, the related inference was significant (p-value < 0.001). Based on Mixed Models, we demonstrated that serum TNF levels of RA patients bearing the G/A 308 TNF promoter genotype significantly (p-value < 0.001) increase over time as they received successive doses of anti-tnf therapy. This observation is qualitatively concordant with other results obtained without multiple imputation analysis.

12 Finally, we can conclude that the presence of missing data from a database does not create an unsolved problematic situation if they are treated with Mixed Models. The application of this methodology allows the construction of a complete database, avoiding the statistical analysis problem derived from repeating measures of missing data. REFERENCES CRUZAT A, CUCHACOVICH M, SALAZAR L, CATALÁN D, SCHIATTINO I, AGUILLÓN JC (2003) Treatment with anti-tnf monoclonal antibodies in rheumatoid arthritis and the 308 TNF promoter polymorphism influence. Biol Res 36(3-4): R62 CUCHACOVICH M, FERREIRA L, ALISTE M, SOTO L, CUENCA J, CRUZAT A, GATICA H, SCHIATTINO I, PÉREZ C, AGUIRRE A, SALAZAR-ONFRAY F, AGUILLÓN JC (2004) TNF-α levels and influence of 308 TNF-a promoter polymorphism on the responsiveness to infliximab in patients with rheumatoid arthritis. Scand J Rheumatol 33: 228-232 CUENCA J, PÉREZ C, AGUIRRE A, SCHIATTINO I, AGUILLÓN JC (2001) Genetic polymorphism at position 308 in the promoter region of the tumor necrosis factor (TNF): Implications of its allelic distribution on susceptibility or resistance to diseases in the Chilean population. Biol Res 34: 237-241 CUENCA J, CUCHACOVICH M, PÉREZ C, FERREIRA L, AGUIRRE A, SCHIATTINO I, SOTO L, CRUZAT A, SALAZAR-ONFRAY F, AGUILLÓN JC (2003) The 308 polymorphism in the tumor necrosis factor gene promoter region and ex - vivo lipopolysaccharideinduced TNF expression and cytotoxic activity in chilean patients with rheumatoid arthritis. Rheumatology (Oxford) 42: 308-313 CHANTALA K, SUCHINDRAN C (2003) Multiple Imputation for Missing Data. SAS OnlineDoc TM : Version 8. www.cpc.unc.edu/services/computer/ presentations/ mi_presentation2.pdf DARMAWAN IGN (2002) NORM software review: handling missing values with multiple imputation methods. Evaluation Journal of Australasia 2 (1): 20-24 DER G, EVERITT B (2002) A Handbook of Statistical Analyses using SAS. New York: Chapman & Hall LAVORI PW, DAWSON R, SHERA D (1995) A Multiple Imputation Strategy for Clinical Trials with Truncation of Patient Data. Statistics in Medicine, 14: 1913-1925 LI KH (1988) Imputation using Markov Chains. J Stat Comput Simul 30: 57-79 LITTELL RC, MILLIKEN GA, STROUP WW, WOLFINGER R (1996) SAS System for Mixed Models. SAS Institute Inc., Cary, NC, USA ROTH P (1994) Missing data: A conceptual review for applied psychologist. Personnel Psychology, 47: 537-560 RUBIN DB (1976) Inference and missing data. Biometrika 63: 581-592 RUBIN DB (1987) Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons, Inc. SCHAFER JL (1997) Analysis of incomplete multivariate data. New York: Chapman & Hall SAS Institute Inc. (1999) SAS Procedures Guide, Version 8, Cary, NC: SAS Institute Inc.