Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Similar documents
Methods for Computing Missing Item Response in Psychometric Scale Construction

A Brief Introduction to Bayesian Statistics

A Bayesian Nonparametric Model Fit statistic of Item Response Models

Objectives. Quantifying the quality of hypothesis tests. Type I and II errors. Power of a test. Cautions about significance tests

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

SLAUGHTER PIG MARKETING MANAGEMENT: UTILIZATION OF HIGHLY BIASED HERD SPECIFIC DATA. Henrik Kure

ESTIMATING PISA STUDENTS ON THE IALS PROSE LITERACY SCALE

The Impact of Relative Standards on the Propensity to Disclose. Alessandro Acquisti, Leslie K. John, George Loewenstein WEB APPENDIX

Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study

Model fit and robustness? - A critical look at the foundation of the PISA project

Appendix 1. Sensitivity analysis for ACQ: missing value analysis by multiple imputation

The Classification Accuracy of Measurement Decision Theory. Lawrence Rudner University of Maryland

Adjustments for Rater Effects in

Latent Trait Standardization of the Benzodiazepine Dependence. Self-Report Questionnaire using the Rasch Scaling Model

Missing Data and Imputation

On Test Scores (Part 2) How to Properly Use Test Scores in Secondary Analyses. Structural Equation Modeling Lecture #12 April 29, 2015

Role of respondents education as a mediator and moderator in the association between childhood socio-economic status and later health and wellbeing

Sawtooth Software. The Number of Levels Effect in Conjoint: Where Does It Come From and Can It Be Eliminated? RESEARCH PAPER SERIES

Impact and adjustment of selection bias. in the assessment of measurement equivalence

Author's response to reviews

Missing data. Patrick Breheny. April 23. Introduction Missing response data Missing covariate data

Table of Contents. Preface to the third edition xiii. Preface to the second edition xv. Preface to the fi rst edition xvii. List of abbreviations xix

HOW STATISTICS IMPACT PHARMACY PRACTICE?

Connexion of Item Response Theory to Decision Making in Chess. Presented by Tamal Biswas Research Advised by Dr. Kenneth Regan

Bias in regression coefficient estimates when assumptions for handling missing data are violated: a simulation study

Estimands, Missing Data and Sensitivity Analysis: some overview remarks. Roderick Little

References. Christos A. Ioannou 2/37

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

Measures. David Black, Ph.D. Pediatric and Developmental. Introduction to the Principles and Practice of Clinical Research

Scaling TOWES and Linking to IALS

Chapter 2 Norms and Basic Statistics for Testing MULTIPLE CHOICE

Logistic Regression with Missing Data: A Comparison of Handling Methods, and Effects of Percent Missing Values

Strategies for handling missing data in randomised trials

CHAPTER VI RESEARCH METHODOLOGY

On the purpose of testing:

Help! Statistics! Missing data. An introduction

Using Sample Weights in Item Response Data Analysis Under Complex Sample Designs

A Comparison of Several Goodness-of-Fit Statistics

Advanced Handling of Missing Data

Section 5. Field Test Analyses

Dr. Kelly Bradley Final Exam Summer {2 points} Name

The prevention and handling of the missing data

Structural Equation Modeling (SEM)

Introduction to statistics Dr Alvin Vista, ACER Bangkok, 14-18, Sept. 2015

Doing Quantitative Research 26E02900, 6 ECTS Lecture 6: Structural Equations Modeling. Olli-Pekka Kauppila Daria Kautto

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

The Effect of Guessing on Item Reliability

Adaptive EAP Estimation of Ability

Missing by Design: Planned Missing-Data Designs in Social Science

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

Introduction to Observational Studies. Jane Pinelis

A Strategy for Handling Missing Data in the Longitudinal Study of Young People in England (LSYPE)

Having your cake and eating it too: multiple dimensions and a composite

EPSE 594: Meta-Analysis: Quantitative Research Synthesis

Unit 1 Exploring and Understanding Data

Kelvin Chan Feb 10, 2015

Known-Groups Validity 2017 FSSE Measurement Invariance

PLS structural Equation Modeling for Customer Satisfaction -Methodological and Application Issues-

Abstract Title Page Not included in page count. Authors and Affiliations: Joe McIntyre, Harvard Graduate School of Education

An Introduction to Multiple Imputation for Missing Items in Complex Surveys

NORTH SOUTH UNIVERSITY TUTORIAL 1

Technical Specifications

Item Analysis: Classical and Beyond

Contents. What is item analysis in general? Psy 427 Cal State Northridge Andrew Ainsworth, PhD

UN Handbook Ch. 7 'Managing sources of non-sampling error': recommendations on response rates

bivariate analysis: The statistical analysis of the relationship between two variables.

Chapter 2--Norms and Basic Statistics for Testing

PRINCIPLES OF STATISTICS

Discussion Meeting for MCP-Mod Qualification Opinion Request. Novartis 10 July 2013 EMA, London, UK

MEA DISCUSSION PAPERS

Evidence Based Medicine

Making comparisons. Previous sessions looked at how to describe a single group of subjects However, we are often interested in comparing two groups

Complier Average Causal Effect (CACE)

Statistical Audit. Summary. Conceptual and. framework. MICHAELA SAISANA and ANDREA SALTELLI European Commission Joint Research Centre (Ispra, Italy)

Supplementary Materials. Worksite Wellness Programs: Sticks (Not Carrots) Send Stigmatizing Signals

Exploring the Impact of Missing Data in Multiple Regression

Psychometric properties of the PsychoSomatic Problems scale an examination using the Rasch model

The Relative Performance of Full Information Maximum Likelihood Estimation for Missing Data in Structural Equation Models

Examining the Psychometric Properties of The McQuaig Occupational Test

Bruno D. Zumbo, Ph.D. University of Northern British Columbia

Abstract. Introduction A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION

Impact of Violation of the Missing-at-Random Assumption on Full-Information Maximum Likelihood Method in Multidimensional Adaptive Testing

Jumpstart Mplus 5. Data that are skewed, incomplete or categorical. Arielle Bonneville-Roussy Dr Gabriela Roman

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

Introduction to diagnostic accuracy meta-analysis. Yemisi Takwoingi October 2015

Chapter 11: Advanced Remedial Measures. Weighted Least Squares (WLS)

Influences of IRT Item Attributes on Angoff Rater Judgments

A review of statistical methods in the analysis of data arising from observer reliability studies (Part 11) *

Ecological Statistics

Recent advances in non-experimental comparison group designs

Measuring noncompliance in insurance benefit regulations with randomized response methods for multiple items

MCAS Equating Research Report: An Investigation of FCIP-1, FCIP-2, and Stocking and. Lord Equating Methods 1,2

MODELING WITHIN-PAIR ORDER EFFECTS IN PAIRED-COMPARISON JUDGMENTS.

On indirect measurement of health based on survey data. Responses to health related questions (items) Y 1,..,Y k A unidimensional latent health state

Basic concepts and principles of classical test theory

A Case Study: Two-sample categorical data

RAG Rating Indicator Values

Estimating drug effects in the presence of placebo response: Causal inference using growth mixture modeling

Transcription:

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Karl Bang Christensen National Institute of Occupational Health, Denmark Helene Feveille National Institute of Occupational Health, Denmark Svend Kreiner Department of Biostatistics, University of Copenhagen Jakob Bue Bjorner QualityMetric, Inc. July 3, 2006 1

Abstract Psychometric scales and ordinal categorical items are often used in surveys with different modes of administration like mailed questionnaires and telephone interviews. If items are perceived differently by people over the telephone this can be a source of bias. We describe how such bias can be tested using item response theory and propose a method to correct for such differences using multiple imputation and item response theory. The method is motivated and illustrated by analyzing data from an occupational health study using mailed questionnaire and telephone interview data to evaluate job influence. Keywords: Item response theory, survey, multiple imputation, mode of administration, telephone interview. 2

Introduction Many surveys use different modes of administration, e.g. mail, telephone, or in-person administration, and the effect that the mode of administration has on responses can affect the results (McHorney, Kosinski, & Ware, 1994). While mixed modes of administration can be used to improve response rates, differences in response behavior can lead to biased results. In a study comparing respondents to mailed questionnaires to non respondents subsequently interviewed by telephone (Brambilla & McKinlay, 1987) socioeconomic differences were disclosed. Controlling for these differences did not, however, remove differences in reported health outcomes between mail and telephone respondents. When responses are not comparable across modes of administration the situation can be regarded as a missing data problem in the senses that, e.g. persons responding by mailed questionnaire have a missing telephone interview response. Multiple imputation (Rubin, 1976, 1987) replaces each missing value with a number, say five, plausible values that represent the uncertainty about the right value to impute. Each of the resulting complete data sets are analyzed, and the results are then combined generating valid statistical inferences that properly reflect the uncertainty due to missing values. In a recent study Powers and colleges estimated the differences in selfrated health by mode of administration and used multiple imputation to make self-rated health comparable for telephone and mail administered questionnaires (Powers, Mishra, & Young, 2005). The effect of mode of administration on changes in mental health scores were of a magnitude that is considered to be clinically meaningful. Multiple imputation yielded effect estimates that were similar for telephone and mail respondents. 3

Multiple imputation is thus an attractive method that may be used to adjust scale scores for mode of administration bias. For the purpose of surveys, imputing data on the sum score scale is of interest, because underlying latent variables are seldom used in reporting of results. This paper describes how translation based on the joint distribution of sum scores can be done using item response theory models. Doing this yields a framework where differences in response behavior can be tested, interpreted, and taken into account. 4

Statistical model In this section the model is described in the situation where subjects are randomized to either mailed questionnaire or telephone interview. Persons who are randomized to one mode of administration, but for some reason respond using another method are excluded, as are those with incomplete item responses. Translation from observed responses to mailed questionnaires to telephone interview responses is used as an example. Let Y i denote the response to item i, for i = 1,..., I, and let X denote an indicator taking the value one if the person has responded by telephone interview and zero otherwise. The probability of each of the ordinal categorical item responses h = 1,..., m i are modeled using a generalized partial credit model (Muraki, 1992) where item parameters depend on mode of administration P (Y i = h θ, X = x) = exp( h mi c=1 exp( c v=1 α(x) i v=1 α(x) i (θ β (x) iv )) (θ β (x) iv )) (1) Where (α (0) i, i1,..., β(0) im i ) are item parameters for the questionnaire sample, (α (1) i, β (1) i1,..., β(1) im i ) are item parameters for the telephone sample, and θ is the person parameter. The α s are discrimination parameters and the β s are threshold parameters (the locations on the θ-axis where there is equal probability of adjacent response categories). Assuming that item responses are conditionally independent given the person parameter (local independence), the probability of a response vector y = (y 1,..., y I ), is given by the product 5

I P (Y 1 = y 1,..., Y I = y I θ, X = x) = P (Y i = y i θ, X = x) (2) i=1 The model is identified if β (x) i1 = 0 for x = 0, 1 and all i and the distribution of the latent variable θ in the population is assumed to be standard normal. This implies that there are no differences between the mailed questionnaire sample and the telephone interview sample with respect to the trait being measured - an assumption that is realistic when subjects are randomized to either mailed questionnaire or telephone interview. The marginal distribution is then P (Y 1 = y 1,..., Y I = y I X = x) = I P (Y i = y i θ, X = x)ϕ(θ)dθ (3) i=1 where ϕ(θ) = 1 2π exp( 1 2 θ2 ). A likelihood approach based on these probabilities will yield marginal maximum likelihood estimates of item parameters, and a likelihood ratio test of the null hypothesis H 0 : (α (0) i, i1,..., β(0) im i ) = (α (1) i, β (1) i1,..., β(1) im i ) (4) can be computed. Other hypotheses could also be of interest, e.g. equal 6

thresholds but different discriminations. For the sum score S = I i=1 Y i, the distribution conditional on the mode of administration X can be computed by summation P (S = s θ, X = x) = ( ) (y 1,...,y I ) P (Y 1 = y 1,..., Y I = y I θ, X = x) (5) over the set over response vectors (y 1,..., y I ) with I i=1 y i = s. If a person were to respond to both set of items independently the joint probabilities would be P (S = s 1 θ, X = 1)P (S = s 0 θ, X = 0) and this can be used to compute the joint marginal probabilities Q(s 1, s 0 ) = P (S = s 1 θ, X = 1)P (S = s 0 θ, X = 0)ϕ(θ)dθ. (6) Based on these joint marginal probabilities the conditional distribution of telephone responses given questionnaire responses Q(s 1 s 0 ) = Q(s 1, s 0 ) s Q(s, s 0) (7) can be computed. 7

Translation of scores between modes of administration A model based on the probabilities, Equation 1, enables translation of observed sum scores from mailed questionnaires to telephone interview sum scores. This can be done using the joint probabilities, Equation 6. Given a mailed questionnaire scale score s 0, that is observing (S, X) = (s 0, 0), an estimate of the sum score on the telephone interview scale can be given in several ways, e.g. as the conditional mean s sq(s s 0) or as the most likely telephone interview sum score: that is s 1, for which Q(s 1 s 0 ) = max s Q(s s 0 ). All that is required is computation of the joint probabilities, Equation 6. These are reasonable translations for individual responses, but the uncertainty is ignored. When the focus is on groups rather than single persons it is preferable to use multiple imputation (Rubin, 1976, 1987) that replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. This has been implemented in version 9 of SAS. Based on item response theory and plausible values, the approach is as follows: (i) upon estimation of item parameters, the conditional probabilities, Equation 7, are computed, (ii) for the persons in the sample who have responded by questionnaire, and thus have a missing telephone interview score, a number, say five, of plausible telephone interview score are drawn from the conditional distribution, and (iii) the imputed data sets are analyzed and results of are combined generating valid statistical inferences that properly reflect the uncertainty due to missing values. 8

Example In 1997 the Danish National Institute of Occupational Health conducted a study of the psychosocial work environment in a general population sample of people between 20 and 60 years (Kristensen, Hannerz, Høgh, & Borg, 2005). Persons were randomized for mailed questionnaires (2/3 of the sample) or for telephone interviews (1/3 of the sample). Data from 1609 employees are considered here. Here responses to four items about influence are considered: Do you have a large degree of influence concerning your work?, Do you have a say in choosing who you work with?, Can you influence the amount of work assigned to you?, and Do you have any influence on WHAT you do? (response options: Always, Often, Sometimes, Seldom, Never/hardly ever). When disregarding persons with incomplete item responses the simple sum score, for these four items with five response categories, has 17 categories. This sum score is often transformed linearly to a zero to 100 scale when reporting results. For the questionnaire sample the mean score was 54.9 and the standard deviation was 23.1 For the telephone interview sample the mean score was 55.8 and the standard deviation was 25.8. Using the folded F statistic (max(s 2 1, s 2 0)/ min(s 2 1, s 2 0), where s 2 1 and s 2 0 are the sample variances in the two samples) the variances were found to be significantly different (p = 0.0017). The means did not differ significantly. The generalized partial credit model, Equation 1, was fitted to these four items. Inspection of observed and expected mean item scores showed an excellent fit to the model for the mailed questionnaire sample, and an acceptable fit for the telephone interview sample (results not shown). The estimated item parameters are shown in Table 1. For all items the thresholds 9

are closer together for the telephone interview sample, and discriminations are lower. This indicates that extreme categories are used more often in the telephone interview sample. A possible explanation for this is difficulties in remembering and distinguishing between five response categories over the telephone. This would also explain why the fit of the model was worse for the telephone interview sample. The fact that the extreme categories dominate also has the consequence that thresholds are also disordered. An example of item characteristic curves are shown in Figure 1. This difference in response behavior explains why the variance is larger in the telephone interview sample. The likelihood ratio test of the hypothesis, Equation 4, of equal item parameters across mode of administration is highly significant (χ 2 (20) = 180.5, p = 0.000). This means that the telephone interview and questionnaire samples should not uncritically be collapsed. Based on these item parameters the probabilities, Equation 6, in the joint distribution of sum scores were computed. An illustration is shown in Figure 2, where darker shades of grey indicate higher probability. Two of the main purposes for collecting data like these are surveillance, e.g. comparison of job groups, and epidemiology, e.g. evaluation of the association between work environment and sickness absence. Table 2 shows the difference between job groups - means, standard deviations and pairwise T tests in the two samples. The same trend is seen in both samples, but the differences are only significant in the telephone sample. This is not due to differences in sample size. It is clear from this that results can depend crucially on mode of administration. Table 3 shows the results of Poisson regression analysis of the effect of influence on number of sickness absence days. For the telephone interview sample the result is borderline significant. Imputation of scores on the tele- 10

phone interview sample scale for the 1012 persons with a missing telephone interview score (the mailed questionnaire sample) resulted in data sets with 597+1012=1609 subjects and the combined results of the analyses of these data sets, yielded a significant effect of the same magnitude as the small sample analysis. 11

Discussion Although a mixed mode approach may reduce non response bias, more research is required concerning the reasons for response differences between modes and to eliminate any differences caused by problems in data quality. Multiple imputation methods have been proposed as a solution to this problem (Powers et al., 2005). Here multiple imputation using an item response theory model was discussed and implemented. Item response theory models are useful for equating between scales using information from persons who have responded to both set of items. This paper described a situation where no persons have responded using both modes of administration, and the population was used as the link between the two measurement scales. The assumption that the θ s are independent and identically distributed, i.e. that the true level of the persons are independent of the method used, is crucial. This assumption is reasonable because of the randomization. The normality assumption could be relaxed. The method of equating sum scores is quite general in the sense, that other item response theory models, like the graded response model can also be used. The approach can also be applied to single ordinal items. Multiple imputation methods have been used in item response theory in sampling designs that provide information about population characteristics, but administer too few items to estimate the latent trait individually (Mislevy, 1991; Mislevy, Beaton, Kaplan, & Sheehan, 1992). In these applications marginal estimation procedures are approximated by constructing plausible values of individual values of the latent trait. The example showed that results can depend critically on mode of administration, and that using multiple imputation to include information can increase the statistical power of the analyses. Using item response theory 12

models the differences in response behavior could be tested, and interpreted. The multiple imputation based method described and implemented here is mainly of interest when the purpose is statistical analysis and not the study of individual levels. If the focus is on individual, e.g. in a clinical setting or in educational testing, the translation based on the joint distribution of sum scores, Equation 6, can still be used, but the conditional mean is likely to be a better prediction than imputation. In this case the illustration of the joint probabilities in Figure 2 can yield information about the specificity of the prediction. Importantly, the method is only feasible for complete case analysis. One solution in the presence of incomplete item responses is to impute item by item, this could however cause the variability to be underestimated because the uncertainty in the item parameter estimates is ignored. Further research on the performance and properties of this method is thus warranted. 13

References Brambilla, D. J., & McKinlay, S. M. (1987). A comparison of responses to mailed questionnaires and telephone interviews in a mixed mode health survey. American Journal of Epidemiology, 126 (5), 962 971. Kristensen, T. S., Hannerz, H., Høgh, A., & Borg, V. (2005). The copenhagen psychosocial questionnaire - a tool for the assessment and improvement of the psychosocial work environment. Scandinavian Journal of Work, Environment & Health, 31, 438 449. McHorney, C. A., Kosinski, M., & Ware, J. E. (1994). Comparisons of the costs and quality of norms for the sf-36 health survey collected by mail versus telephone interview: results from a national survey. Medical Care, 32, 551 567. Mislevy, R. J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56, 177 196. Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29, 133 161. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16 (2), 159 176. Powers, J. R., Mishra, G., & Young, A. F. (2005). Differences in mail and telephone responses to self-rated health: use of multiple imputation in correcting for response bias. Australian and New Zealand Journal of Public Health, 29, 149 154. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581 592. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley. 14

Figure 1: An example of item characteristic curves: the item Do you have a large degree of influence on decisions about your work? Mailed questionnaire Telephone interview probability 0.0 0.2 0.4 0.6 0.8 1.0 probability 0.0 0.2 0.4 0.6 0.8 1.0 4 2 0 2 4 latent variable 4 2 0 2 4 latent variable 15

Table 1: Estimated parameters for the four items about influence Mailed Questionnaire Telephone interview Item Par. Est. (S.E.) Par. Est. (S.E.)... decisions α (0) i 1.40 (0.13) α (1) i 0.92 (0.12) i2-1.92 (0.14) β (1) i2-0.94 (0.23) i3-1.29 (0.10) β (1) i3-1.33 (0.21) i4-0.54 (0.08) β (1) i4-1.03 (0.18) i5 1.23 (0.09) β (1) i5 0.71 (0.13)... who you α (0) i 0.83 (0.08) α (1) i 0.71 (0.10) work with i2-0.51 (0.12) β (1) i2 0.49 (0.24) i3-0.02 (0.12) β (1) i3-0.10 (0.21) i4 0.38 (0.12) β (1) i4 0.30 (0.21) i5 1.55 (0.15) β (1) i5 0.68 (0.21)... amount α (0) i 0.54 (0.05) α (1) i 0.53 (0.07) of work i2-1.13 (0.19) β (1) i2 0.25 (0.29) i3-0.19 (0.17) β (1) i3-0.34 (0.28) i4 0.40 (0.17) β (1) i4-0.31 (0.26) i5 1.27 (0.20) β (1) i5 0.45 (0.24)... what α (0) i 1.35 (0.14) α (1) i 1.04 (0.16) you do i2-1.45 (0.11) β (1) i2-0.18 (0.24) i3-0.85 (0.09) β (1) i3-1.37 (0.22) i4-0.40 (0.08) β (1) i4-0.59 (0.15) i5 0.66 (0.07) β (1) i5 0.39 (0.12) 16

Figure 2: Illustration of the probabilities in the joint sum score distribution (darker shades of grey indicate higher probability). sumscore 0 20 40 60 80 100 0 20 40 60 80 100 sumscore 17

Table 2: The difference between job groups - means, standard deviations and pairwise T tests in the two samples. Telephone sample Questionnaire sample N Mean (S.D.) T test N Mean (S.D.) T test Trade workers 44 62.5 (25.3) - 51 54.7 (21.3) - Industry workers 62 48.1 (24.2) p=0.0037 71 50.1 (21.6) p=0.2487 Office workers 82 44.0 (26.4) p=0.0002 170 47.3 (21.4) p=0.0324 Table 3: The effect of influence on sickness absence: Poisson regression controlled for gender and age. Multiple imputation imputes scores on the telephone sample scale for the questionnaire sample. N RR* 95% CI Influence - telephone sample 597 0.96 (0.92, 1.00) Influence - multiple imputation 1609 0.96 (0.93, 0.98) *: Rate ratios show the effect of a ten point increase in influence. 18