Interchangeability of the EQ-5D and the SF-6D in Long-Lasting Low Back Pain

Similar documents
Time Trade-Off and Ranking Exercises Are Sensitive to Different Dimensions of EQ-5D Health States

Mapping the EORTC QLQ C-30 onto the EQ-5D Instrument: The Potential to Estimate QALYs without Generic Preference Data

Supplementary Appendix

Estimating EQ-5D values from the Neck Disability Index and numeric rating scales for neck and arm pain

A panel data comparison of two commonly-used health-related quality of life instruments

Learning Effects in Time Trade-Off Based Valuation of EQ-5D Health States

15D: Strengths, weaknesses and future development

To what extent do people prefer health states with higher values? A note on evidence from the EQ-5D valuation set

This is a repository copy of A comparison of the EQ-5D and the SF-6D across seven patient groups.

Time to tweak the TTO: results from a comparison of alternative specifications of the TTO

Assessment of Health State in Patients With Tinnitus: A Comparison of the EQ-5D and HUI Mark III

Japan Journal of Medicine

Valuing health using visual analogue scales and rank data: does the visual analogue scale contain cardinal information?

Economic evaluation of end stage renal disease treatment Ardine de Wit G, Ramsteijn P G, de Charro F T

Using Discrete Choice Experiments with duration to model EQ-5D-5L health state preferences: Testing experimental design strategies

Valuation of the SF-6D Health States Is Feasible, Acceptable, Reliable, and Valid in a Chinese Population

Is EQ-5D-5L Better Than EQ-5D-3L? A Head-to-Head Comparison of Descriptive Systems and Value Sets from Seven Countries

Unit 1 Exploring and Understanding Data

How is the most severe health state being valued by the general population?

Utility is a quantitative expression of an individual s preference. Utilities Should Not Be Multiplied

NIH Public Access Author Manuscript J Clin Epidemiol. Author manuscript; available in PMC 2010 March 1.

Keep it simple: Ranking health states yields values similar to cardinal measurement approaches

Using HAQ-DI to estimate HUI-3 and EQ-5D utility values for patients with rheumatoid arthritis in Spain

NICE DSU TECHNICAL SUPPORT DOCUMENT 8: AN INTRODUCTION TO THE MEASUREMENT AND VALUATION OF HEALTH FOR NICE SUBMISSIONS

Impact of Chronic Liver Disease and Cirrhosis on Health Utilities Using SF-6D and the Health Utility Index

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

EuroQol Working Paper Series

This is a repository copy of Estimating an EQ-5D population value set: the case of Japan.

Importance of sociodemographic and morbidity aspects in measuring health-related quality of life: performances of three tools

MEA DISCUSSION PAPERS

Aliasghar A. Kiadaliri 1,2,6*, Björn Eliasson 3 and Ulf-G Gerdtham 4,5

Valuation of EQ-5D Health States in Poland: First TTO-Based Social Value Set in Central and Eastern Europevhe_

Swedish experience-based value sets for EQ-5D health states

This is a repository copy of The estimation of a preference-based measure of health from the SF-36.

Comparing Generic and Condition-Specific Preference-Based Measures in Epilepsy: EQ-5D-3L and NEWQOL-6D

An update on the analysis of agreement for orthodontic indices

Impact of primary care-based disease management on the health-related quality of life in patients with type 2 diabetes and co-morbidity

2.75: 84% 2.5: 80% 2.25: 78% 2: 74% 1.75: 70% 1.5: 66% 1.25: 64% 1.0: 60% 0.5: 50% 0.25: 25% 0: 0%

Study Protocol: Comparison of Inconsistency between Time Trade Off and Discrete Choice Experiments in EQ-5D-3 L Health State Valuations

Validity of the EuroQoL (EQ-5D) Instrument in a Greek General Population

Original Research Article Pain Quality of Life as Measured by Utilities

Statistical techniques to evaluate the agreement degree of medicine measurements

To what extent can we explain time trade-off values from other information about respondents?

Condition-Specific Preference-Based Measures: Benefit or Burden?

Effects of Mode and Order of Administration on Generic Health-Related Quality of Life Scoresvhe_

Kelvin Chan Feb 10, 2015

Mapping EORTC QLQ-C30 onto EQ-5D for the assessment of cancer patients

Interim Scoring for the EQ-5D-5L: Mapping the EQ-5D-5L to EQ-5D-3L Value Sets

Mapping the Positive and Negative Syndrome Scale scores to EQ-5D- 5L and SF-6D utility scores in patients with schizophrenia

University of Bristol - Explore Bristol Research. Publisher's PDF, also known as Version of record

Carrying out an Empirical Project

Comparing the UK EQ-5D-3L and English EQ-5D-5L value sets

Simple Linear Regression the model, estimation and testing

The EuroQol and Medical Outcome Survey 36-item shortform

Comparison of Value Set Based on DCE and/or TTO Data: Scoring for EQ-5D-5L Health States in Japan

Development of a self-reported Chronic Respiratory Questionnaire (CRQ-SR)

Analysis of EQ-5D scores from two phase 3 clinical trials of romiplostim in the treatment of immune thrombocytopenia (ITP)

STATISTICS 8 CHAPTERS 1 TO 6, SAMPLE MULTIPLE CHOICE QUESTIONS

Mapping health outcome measures from a stroke registry to EQ-5D weights

Methods of eliciting time preferences for health A pilot study

Assessing Agreement Between Methods Of Clinical Measurement

AN INTERVIEW-BASED COMPARISON OF THE TTO AND VAS VALUES GIVEN TO EQ-50 STATES OF HEALTH BY THE GENERAL GERMAN POPULATION

Mapping QLQ-C30, HAQ, and MSIS-29 on EQ-5D

Reliability and validity of the International Spinal Cord Injury Basic Pain Data Set items as self-report measures

Business Statistics Probability

Validation of the Russian version of the Quality of Life-Rheumatoid Arthritis Scale (QOL-RA Scale)

Citation for published version (APA): Ebbes, P. (2004). Latent instrumental variables: a new approach to solve for endogeneity s.n.

STATISTICS & PROBABILITY

reliability and validity of the EuroQol ( EQ-5D), an patients with osteoarthritis of the knee M. Fransen and J. Edmonds

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

A comparison of injured patient and general population valuations of EQ-5D health states for New Zealand

Department of Medicine,Yong Loo Lin School of Medicine, National University of Singapore, Singapore

Population Health Metrics

Table of Contents. Plots. Essential Statistics for Nursing Research 1/12/2017

Physical violence and health-related quality of life: Danish cross-sectional analyses

The EQ-5D is internationally one of the most

Properties of patient-reported outcome measures in individuals following acute whiplash injury

HEDS Discussion Paper 05/05

Assessment of the SF-36 version 2 in the United Kingdom

Tails from the Peak District: Adjusted Limited Dependent Variable Mixture Models of EQ-5D Questionnaire Health State Utility Values

Is Quality-Adjusted Life Years (QALY) terminal? A literature review into QALY s criticisms

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

Experience-based VAS values for EQ-5D-3L health states in a national general population health survey in China

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

AP Statistics. Semester One Review Part 1 Chapters 1-5

Meta-Analysis and Publication Bias: How Well Does the FAT-PET-PEESE Procedure Work?

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Revised Cochrane risk of bias tool for randomized trials (RoB 2.0) Additional considerations for cross-over trials

White Rose Research Online URL for this paper:

Validity and responsiveness of the Core Outcome Measures Index (COMI) for the neck

4 Diagnostic Tests and Measures of Agreement

South Australian Research and Development Institute. Positive lot sampling for E. coli O157

Assessing Studies Based on Multiple Regression. Chapter 7. Michael Ash CPPA

Regression Discontinuity Analysis

Choice of EQ 5D 3L for Economic Evaluation of Community Paramedicine Programs Project

Mapping scores from the Strengths and Difficulties Questionnaire (SDQ) to preferencebased

Cochrane Pregnancy and Childbirth Group Methodological Guidelines

Section on Survey Research Methods JSM 2009

University of Groningen

Transcription:

Volume 12 Number 4 2009 VALUE IN HEALTH Interchangeability of the EQ-5D and the SF-6D in Long-Lasting Low Back Pain Rikke Søgaard, MSc, MPH, PhD, 1 Finn Bjarke Christensen, MD, PhD, DMSc, 2 Tina Senholt Videbæk, MD, 3 Cody Bünger, MD, DMSc, 3 Terkel Christiansen, MSc 4 1 CAST (Centre for Applied Health Services Research and Technology Assessment), Institute for Public Health, University of Southern Denmark, Odense, Denmark; 2 Orthopaedic Division, Aalborg Hospital, Aarhus University Hospital, Aarhus, Denmark; 3 Spine Unit, Aarhus Hospital, Aarhus University Hospital, Aarhus, Denmark; 4 Health Economics Unit, Institute for Public Health, University of Southern Denmark, Odense, Denmarkvhe_466 606..612 ABSTRACT Objectives: The objective of this study was to investigate the interchangeability of the EuroQol 5D (EQ-5D) and the Short Form 6D (SF-6D) in individuals with long-lasting low back pain to guide the optimal choice of instrument and to inform decision-makers about any between-measure discrepancy, which require careful interpretation of the results of costutility evaluations. Methods: A cross-sectional study was conducted across 275 individuals who had spinal surgery on indication of chronic low back pain. EQ-5D and SF-6D were mailed to respondents for self-completion. Statistical analysis of between-measure agreement (using English weights) was based on Bland and Altman s limits of agreement and a series of linear regressions. Results: A moderate mean difference of 0.085 (SD 0.241) was found, but because it masked more severe bidirectional variation, the expected variation between observations of EQ-5D and SF-6D in future studies was estimated at 0.546. The EQ-5D s N3 term alone explained a factor of 0.79 of the variation in between-measure differences, while the explanatory value of adding variables of age, sex, diagnosis, previous surgery, and occupational status was basically zero. A final model including only dummy variables for the N3 term and five identified framing effects explained a factor of 0.86 of the variation in between-measure differences. Conclusions: Although the EQ-5D and the SF-6D are both psychometrically valid for generic outcome assessment in long-lasting low back pain, it appears that they cannot generally be used interchangeably for measurement of preference values. Sensitivity analysis examining the impact of between-measure discrepancy thus remains a necessary condition for the interpretation of the results of cost-utility evaluations. Keywords: EQ-5D, health-related quality of life, low back pain, SF-6D. Introduction Preference-based, generic outcome measurement has become a fundamental part of cost-utility evaluation, ultimately informing priority setting in health care. Under the objective of quantifying individual preferences, a preference-based measure is usually defined by two components: a descriptive instrument for the classification of health and an ancillary scoring algorithm for the assignment of a preference value. Alternative measures have become available in recent years, and a necessary condition for their claimed interchangeability is that they produce similar guidance. Several studies have demonstrated a case for concern with different instruments producing different preference values [1 10]. One of the most widely used measures is the EuroQol 5D (EQ-5D), which was proposed by the EuroQol group in 1990 [11]. The current version includes dimensions of mobility, selfcare, usual activity, pain/discomfort, and anxiety/depression, each with three levels of function and thus producing a total of 243 health states (245 when the states unconscious and immediate death are added for completeness). The first valuation study was conducted by the Measurement and Valuation of Health group in York, using the time-trade-off (TTO) technique in a representative sample of the UK general population [12]. Preference values of 42 health states were elicited in the valuation Address correspondence to: Rikke Søgaard, CAST (Centre for Applied Health Services Research and Technology Assessment), Institute for Public Health, University of Southern Denmark, J.B. Winsløwsvej 9B, DK-5000 Odense, Denmark. E-mail: ris@cast.sdu.dk 10.1111/j.1524-4733.2008.00466.x study, and these were then used to model the remaining health states [13]. More recently, similar studies were conducted in Spain [14], Denmark [15], Japan [16], Zimbabwe [17], Germany [18], United States [19], and The Netherlands [20]. The Short Form 6D (SF-6D) is an alternative measure that came about in 1998 as a result of Brazier et al. s conceptual restructuring of the SF-36 [21]. The current version includes dimensions of physical, role limitations, social, pain, mental health, and vitality, each with 4 to 6 levels of function and altogether producing a total of 18,000 health states. The SF-6D has been valued in a representative sample of the UK general population, using the standard gamble (SG) technique, where preference values for a sample of 249 health states were elicited and then modeled into the final scoring algorithm [22]. Psychometric performances of the two measures have been examined specifically for low back pain. Properties of construct validity, reliability, and practicality were established in separate studies for each of the two measures upon comparison with widely used and validated disease-specific measures [23,24]. For the EQ-5D, the property of responsiveness was concluded as well. Head-to-head comparisons of the two measures have also been reported, demonstrating that the EQ-5D is perhaps better at discriminating between individuals in poor health due to a wider scale [9]. While psychometric properties are main concerns of classical measurement theory, economic theory is, in addition, concerned with the extent to which individual preferences are quantified [25]. A gold standard for individual preferences would be revealed preferences, but these are not viable, as one usually cannot choose what health state to be in. Therefore, the investigation of 606 2008, International Society for Pharmacoeconomics and Outcomes Research (ISPOR) 1098-3015/09/606 606 612

Interchangeability EQ-5D SF-6D Low Back Pain 607 economic validity is, to a wide extent, reduced to one of theoretical validity and empirical agreement between measures. Specific to populations suffering low back pain, there is some evidence for a discrepancy between the EQ-5D and the SF-6D. In acute pain, a mean difference of, on average, 0.022, was demonstrated, but the investigators concluded that it masked more severe discrepancy relating to different distributions of values [1]. In chronic pain, a mean difference of, on average, 0.180, was demonstrated in a large-scale study (n = 2097) among patients referred for surgery [9]. The agreement between measures in a post-treatment population remains unknown. The etiology for a difference between measures remains uncertain despite the fact that many studies have investigated empirical and theoretical aspects. Not only is it uncertain whether the instruments descriptive components measure identical constructs, their valuation components have also been derived using different methodologies. For example, a crossover effect has been demonstrated between the TTO technique, which is used for valuing EQ-5D health states, and the SG technique, which is used for valuing SF-6D health states, with the former underestimating values for more severe health states and the latter overestimating values for milder health states [26]. In addition, the modeling of health states not directly valued has been conducted using different econometric models and, in turn, has resulted in different scoring algorithms. For both measures, the scoring model includes an interaction term, which is a simple dummy variable taking on the value of 1 if a respondent scores the worst level in any dimension. Thus, the EQ-5D-model includes an N3 term with a coefficient of -0.269 (using English weights), and the SF-6D-model includes a MOST-term with a coefficient of -0.032. The obvious difference between coefficients have been suggested to account for some discrepancy between measures [1,27]. In the light of current practice, where measures are used somewhat interchangeably despite conceptual and methodological differences, the contribution of the present work is to examine application-specific interchangeability, which has been pointed out by leading outcome researchers as an important next step [28]. This has been undertaken already in a population suffering acute pain [1] and in a preoperative population suffering chronic pain [9]; this work adds information about interchangeability in a postoperative population, which is important because this population is not restored to full health and thus represents a disease-specific population different from a preoperative or a healthy population. The hypotheses underlying this study, overall, follow the idea that this population, who have had at least one event of spine surgery on indication of long-lasting low back pain, is not a special case, i.e., existing evidence on the measurement of preference values will hold in this population, too. It is expected from the literature that EQ-5D will demonstrate a ceiling effect, while SF-6D will demonstrate a floor effect as shown by, among others, McDonough et al. [9]. It is also expected that SF-6D will overestimate preference values relative to the EQ-5D, while the latter will show a wider dispersion as has been shown by several studies, though only in different diseases or disease severities [1 10]. Perhaps a more innovative hypothesis is that the difference between instruments preference values is a function of framing effects, i.e., that preference values of individual measures cannot take on a value higher or lower than some ceiling or floor level. The objective of this study was to investigate the interchangeability of the EQ-5D and the SF-6D in individuals with longlasting low back pain to guide the optimal choice of instrument and to inform decision-makers about any between-measure discrepancy, which require careful interpretation of the results of cost-utility evaluations. Methods Material and Study Design A sample of 275 individuals was identified from two randomized, controlled trials, with each trial investigating components of the optimal surgical technique for lumbar spinal fusion [29,30]. Participants in the first trial (n = 129) had surgery from 1992 through 1994, and patients in the second trial (n = 146) had surgery from 1996 through 2000. Long-term follow-up was conducted in 2005 using the instruments of SF-36 (Danish version 1.1) and EQ-5D (Danish version 1, including the visual analog scale valuation component) for a postal survey. Each participant received up to two mailings, including a cover letter describing the clinical long-term follow-up but not the present study. In case of no response to mailings, individuals were phoned and asked to participate in a telephone interview. For the present study, the long-term follow-up represents a crosssectional study, i.e., population characteristics of age, sex, and occupation, and health status refers to the point of measurement in 2005. Statistical Analysis Distribution of responses across dimensions and levels and possible framing effects were examined using simple frequency tables with valid percentages. Differences between single indices were examined using conventional summary statistics and intraclass correlation based on a two-way random effects model. Agreement between measures was investigated using Bland and Altman plots with difference between single indices on the y-axis and their average on the x-axis. The expected variation for any future pair of observations was estimated by 95% limits of agreement (mean difference 1.96 SD) [31]. A key assumption for this approach is that differences are unrelated to averages, whereas in this data, differences tended to be positive for low preference values and negative for high preference values. Conventional log transformation was ineffective in removing heteroskedasticity and other simple transformations (e.g., reciprocals or squareds) would not allow results to be interpreted in relation to original data. Simple ordinary least squares (OLS) regression was therefore used to adjust for trends before estimating limits of agreement. This is in accordance with Bland and Altman s more recent recommendations [32]. To explore the nature of between-measure discrepancy, a series of linear OLS regressions using White s heteroskedasticityrobust standard errors was conducted. Two models were established, mainly to be able to relate this material to that of other studies and, in particular, to that of Brazier et al. [1], from whom we adapted the following specifications: SF6D = α β1 EQ5D μ (1) SF6D = α β1eq5d β2n3 μ (2) where SF6D and EQ5D are the two single indices, and N3 is the dummy variable taking on the value 1 if a respondent scores the worst level in any dimension of the EQ-5D. These models are informative in relating this material to that of others, but they are not optimal for examining the nature of between-measure discrepancy, because depending on what measure is chosen as the dependent, coefficients will change. To give equal status to the two single indices when examining the nature of discrepancy, an index model of differences regressed on the N3 term was established:

608 Søgaard et al. ( SF6D EQ5D) = α β2 N3 μ (3) This model was extended from two further ideas: to explore whether any disagreement would be more or less severe in some clinically relevant subgroups and to explore the extent to which the framing effects of the two measures would explain discrepancy. The first idea was examined in the following model: ( SF6D EQ5D)= α β2n3 β2age β4fml β5olist β6prev β7 LABOR μ (4) where AGE is a continuous variable, FML is a dummy for female (vs. male), OLIST is a dummy for spondylolisthesis (vs. degenerative disc disease), PREV is a dummy for previous spine surgery (vs. no previous spine surgery), and LABOR is a dummy for participation in the labor market (vs. retired or pensioned). The selection of variables was based on their association with outcome after spine surgery (evidence from the clinical literature) and data availability. There is no gold standard for the prediction of outcome after spine surgery, but it is well established that the variables put forward here play key roles. The second idea was examined in the following models: ( SF6D EQ5D)= α β2n3 β8cmob β9cself β10 CANX μ (5) ( SF6D EQ5D)= α β2n3 β8cmob β9cself β CANX β CSOC μ 10 11 (6) ( SF6D EQ5D)= α β2n3 β8cmob β9cself β10canx β11csoc β12frol μ (7) where CMOB, CSELF, CANX, and CSOC are dummies for ceiling effects, i.e., 1 indicates a respondent scoring the best level in the respective EQ-5D dimensions of mobility, self-care, or anxiety/depression, or the SF-6D dimension of social. Similarly, FROL is a dummy for a floor effect in the SF-6D dimension of role limitation. The selection of variables for models (5) to (7) was based on the identified framing effects in the first part of the present empirical analysis. It follows for models (3) to (7) that dummy variables with positive coefficients, on average, are associated with the SF-6D, overestimating preference values relative to the EQ-5D and vice versa. A significance level of 0.05 and two-sided tests were used. Analyses were conducted using STATA version 9.0 (StataCorp, College Station, TX). The Danish Data Protection Agency approved this study, and no other ethical approvals were relevant. Results Characteristics of the Study Population Table 1 details some characteristics of the study population: mean age was 54 years (SD 9), 54% were female, 51% were pensioned, 33% were working, 13% were without a job, and 4% were on sick leave. The study population expressed a poorer health status across all dimensions of the SF-36 as well as in the two standardized summary scales physical component summary and mental component summary as compared with the general population. Practicality The rates of response (unit response) and completion (item response) are typically considered when determining the practicality of a measure. In the present study, unit nonresponses were 16.7% and 17.1% for the EQ-5D and the SF-36, respectively, while item nonresponses were 4.8% and 20.6%. Nevertheless, when the item nonresponse for the SF-36 was restricted to the sample of items sufficient to derive the SF-6D, the item non- Table 1 Sociodemographics and health status of the study population as compared with the general population Study population (n = 275) Participants with complete data (n = 198) General adult population* Age, mean (SD) [min; max] 54 (10) [24; 79] 54 (10) [33; 74] NA Females, number of cases (%) 148 (54) 41 (53) NA Indication for spinal fusion, number of cases (%) Isthmic spondylolisthesis 86 (31) 24 (31) NA Disk degeneration 189 (69) 53 (69) NA Previous surgery (before index), number of cases (%) 93 (33) 24 (31) NA Occupational status, number of cases (%) Working 73 (33) 3 (10) NA Without job 28 (13) 3 (10) NA On sick leave 9 (4) 2 (7) NA Pensioned 114 (51) 22 (73) NA Single dimensions of Short Form 36, mean [median] (SD) Physical 65 [65] (25) 64 [65] (25) 88 [95] (20) Role physical 42 [25] (43) 42 [25] (43) 83 [100] (31) Bodily pain 51 [51] (21) 50 [51] (28) 79 [84] (23) General health 57 [57] (25) 56 [55] (26) 76 [82] (20) Vitality 54 [55] (26) 53 [55] (26) 70 [75] (20) Social 77 [88] (27) 76 [88] (27) 91 [100] (17) Role emotional 64 [100] (44) 64 [100] (43) 86 [100] (28) Mental health 74 [80] (21) 74 [80] (22) 82 [85] (16) Standardized scales of Short Form 36, mean [median] (SD) Physical component summary 38 [37] (11) 43 [45] (10) 51 [54] (9) Mental component summary 51 [54] (12) 51 [55] (11) 54 [56] (8) *Values are summary statistics adapted from Bjorner et al. (1997) [35]. Ware et al. (1993) [36]. Ware et al. (1995) [37]. NA, not applicable.

Interchangeability EQ-5D SF-6D Low Back Pain 609 Table 2 Distribution of EurQol 5D responses (n = 218) across levels of single dimensions (%) Table 4 Distribution of respondents scoring full health according to the EuroQol 5D (n = 28) across levels of the Short Form 6D (%) Level Mobility Self-care Usual activities Pain/ discomfort Anxiety/ depression Level Physical Role limitation Social Pain Mental health Vitality 1 54.6 74.4 29.5 16.1 65.9 2 45.0 24.7 54.2 65.0 30.0 3 0.4 0.9 16.3 18.8 4.0 response was reduced to 12.7%. Altogether, unit and item nonresponses lead to response rates in single indices of 79.3% for the EQ-5D and 74.2% for the SF-6D. One hundred ninety-eight individuals (72%) had complete data on both EQ-5D and SF-6D. This group was compared with the group of nonrespondents as well as respondents with missing data. There were no statistical differences in age, sex, diagnosis, or events of previous surgery (full information was available on both respondents and nonrespondents). In terms of occupational status, there was an indication for a difference between groups, i.e., individuals with full information were less likely to be working and more likely to be pensioned than those with sporadic item nonresponse. There seemed to be no difference across the eight SF-36 dimensions but that was testable for only about half of the (item) nonrespondents. Descriptive Classifications Both measures produced good dispersions of responses across levels, although moderate ceiling effects were indicated. This is detailed in Tables 2 and 3 with 54.6%, 74.4%, and 65.9% of responses clustering in level 1 of three EQ-5D dimensions (mobility, self-care, anxiety/depression), and 47.9% of responses clustering in level 1 of one SF-6D dimension (social ). The dimensions traditionally considered most important in low back pain pain and physical for the SF-6D and pain/ discomfort and usual activities for the EQ-5D showed modal responses in middle levels and dispersions across the full ranges of levels. A ceiling effect in the EQ-5D was further examined as detailed in Table 4. In case of perfect interchangeability, respondents in full health according to the EQ-5D would score full health according to the SF-6D as well. This was observed for more than 86% of the respondents (n = 28) when responses distributed in level 1 or level 2 were taken to indicate agreement (SF-6D has up to twice as many levels as the EQ-5D; hence, level 2 was accepted in addition to level 1). None of the respondents reported the poorest function in any of the first four dimensions, whereas for the dimensions of mental health and vitality, responses were spread in a peculiar right-skewed distribution. It is thus possible that the EQ-5D suffers from a ceiling effect in dimensions analogous to mental health and vitality, i.e., anxiety/ depression. Table 3 Distribution of Short Form 6D responses (n = 204) across levels of single dimensions (%) Level Physical Role limitation Social Pain Mental health Vitality 1 8.0 29.8 47.9 10.4 37.3 7.3 2 16.5 26.6 20.5 13.5 30.9 33.8 3 33.0 6.0 21.5 24.8 18.6 24.7 4 7.6 37.6 8.2 20.7 9.5 19.2 5 30.4 NA 1.8 18.0 3.6 15.1 6 4.5 NA NA 12.6 NA NA NA, not applicable. 1 42.9 85.7 82.8 65.5 79.3 36.7 2 42.9 10.7 10.3 20.7 13.8 50.0 3 7.1 3.6 6.9 13.8 0.0 3.3 4 7.1 0.0 0.0 0.0 0.0 3.3 5 0.0 NA 0.0 0.0 6.9 6.7 6 0.0 NA NA 0.0 NA NA NA, not applicable. A possible ceiling effect in the SF-6D was weakly indicated in one dimension (social ), but, overall, only five respondents scored full health, and they all scored full health in the EQ-5D as well. A possible floor effect in the SF-6D was also weakly indicated in one dimension (role limitation), but, overall, zero respondents scored the poorest health status. Agreement between EQ-5D and SF-6D Single Indices The distribution of EQ-5D values was bimodal, with a concentration of values in the region of zero and a concentration of values in the region between 0.5 and full health, whereas, in contrast, the distribution of SF-6D values was approximately normal. For these reasons, the two measures resulted in significantly different summary statistics as listed in Table 5: the mean value of SF-6D was significantly higher than that of EQ-5D, whereas the variation across observations was significantly greater for the latter. The intraclass correlation coefficient was 0.553, which is usually taken to indicate moderate agreement. Figure 1 graphically presents the discrepancy between measures in a Bland and Altman plot. Clearly, the difference between measures was associated with the average value of health, i.e., the poorer the health, the higher the SF-6D overestimation relative to EQ-5D. The limits of agreement represent a 95% prediction interval; hence, the interpretation is that, if the range between limits is not relevant to decision-making, the two measures can be used interchangeably. The limits of agreement were -0.188 to 0.358 for a mean difference of 0.085, which is equal to an expected variation of 0.546 for any pair of future observations. Relationship between Single Indices Two models of the relationship between measures are listed in Table 6. As expected, these simple models show a positive, proportional relationship, where the inclusion of the N3 term Table 5 Summary statistics and intraclass correlation between EuroQol 5D and Short Form 6D single indices EQ-5D (n = 218) SF-6D (n = 204) Difference (n = 198) Mean 0.583 0.677 0.085 Median 0.691 0.667 0.011 SD 0.346 0.152 0.241 Min -0.59 0.33-0.33 Max 1.00 1.00 0.94 1st percentile -0.0160 0.471-0.156 9th percentile 1.000 0.852 0.495 Intraclass correlation coefficient 0.553 (0.421; 0.658) (95% CI) CI, confidence interval.

610 Søgaard et al. Difference (SF-6D - EQ-5D) -.5 0.5 1 0 1 Average Figure 1 Agreement between Short Form 6D and EuroQol 5D single indices in 198 individuals with long-lasting low back pain. Bland and Altman plots with limits of agreement (mean difference 1.96 SD) showing an expected between-measure variation of 0.546. markedly shifts the b 1 up (and the intercept down). The following thus assumes that the N3 term is a confounder, and since our interest is in interchangeability rather than predicting values, the models are specified with the difference between single indices as the dependent variable rather than giving unequal status to any of the two indices. Differences between Single Indices as a Function of Framing Effects Five models of the difference between single indices are listed in Table 7. Interestingly, the index model including only the N3 term explained a factor of 0.79 of differences and further demonstrated how the average difference in relatively poor health (N3 = 1) was about 0.47, whereas across respondents in relatively good health (N3 = 0), it was only about 0.03. Model (4) examined whether this discrepancy could be more or less severe in subgroups with characteristics that are well known to be associated with clinical outcomes. In contrast to our hypothesis, the additional value of including explanatory variables of age, sex, diagnosis, previous surgery, and participation in the labor force amounted to practically zero, i.e., there was no indication for a differentiated (lack of) interchangeability across subgroups on top of what was already captured by the N3 term. Model (5) illustrates the relationship between differences and the ceiling effects of EQ-5D that were suggested by the first part of our analysis. The ceiling effects were specified as dummy Table 6 Relationship between EuroQol 5D and Short Form 5D single indices: Coefficients of ordinary least squares regressions with White s standard errors in brackets SF-6D Model (1) Model (2) a 0.47 (0.01) 0.30 (0.04) b 1EQ-5D 0.35 (0.02) 0.56 (0.05) b 2N3 0.19 (0.04) R 2 0.60 0.66 Root MSE 0.10 0.09 All coefficients are significant at the 0.05 level. N3 is a dummy variable taking on the value 1 if a respondent scores the worst level in any of the EQ-5D dimensions. MSE, mean squared error. variables taking on the value of 1 if a respondent had scored the best level in the respective dimension. The correlation coefficient (R 2 ) increased from a factor of 0.79 in model (3) to a factor of 0.82 in model (5), although only one of the explanatory variables was significant: a ceiling effect in the dimension self-care with a coefficient of, on average, -0.10. Model (6) additionally included a dummy for the anticipated ceiling effect of the SF-6D dimension social. That raised the explanatory power from 0.82 to 0.85 while the full set of explanatory variables became significant. The ceiling effects in EQ-5D dimensions all had negative coefficients, with averages of -0.06, -0.11, and -0.03, whereas the anticipated ceiling effect of the SF-6D in social had a positive coefficient of 0.10. The final model (7) further increased the explanatory power to 0.86 by including the possible floor effect of the SF-6D in role limitation, which was significant, with a coefficient of, on average, -0.06. Each of the models from models (4) to (7) was run with interaction terms, but because none of these were significant, they were again excluded. Conventional regression diagnostics was conducted to examine the validity of the models; in particular, the extent of under-specification was examined using the Ramsey RESET test (Regression Equation Specification Error Test), which did not reject the null of models being consistent with data (F = 2.01 for model [4] and F 1.20 for models [5] to [7]). Normality of residuals was assessed by inspection of scatter plots of residuals versus each of the determinants, scatter plots of residuals versus fitted values and histograms of residuals. Except for one participant, who demonstrated somewhat extreme values (participant no. 248, female, 43 years, SF-6D = 0.35, EQ-5D = -0.594), all of the diagnostic plots showed symmetric distributions. A secondary analysis without participant no. 248 did not alter the primary findings of the expected variation between measures for future observations or the significance of differences being a function of framing effects. Discussion The present study investigated the interchangeability of EQ-5D and SF-6D for the assessment of preference values in long-lasting low back pain. The two measures were found to produce significantly different mean values of, on average, 0.085. Such differential need not be vital for decision-making, but the fact that it masks more severe bidirectional variation makes it potentially larger in other applications. The expected variation for observations in future studies was estimated at 0.546, and unless such differential is irrelevant to decision-making, the SF-6D and the EQ-5D cannot generally be used interchangeably in individuals suffering long-lasting low back pain. We found moderate evidence for a ceiling effect of the EQ-5D in dimensions analogous to the SF-6D s dimensions of mental health and vitality. These findings are in consensus with those of Brazier et al. [1], except that, in addition, they identified a floor effect of the SF-6D which we did not. Our results are also comparable to those of McDonough et al. [9], except for a minor discrepancy concerning the extent of a ceiling effect; in the present study, it was present in the dimensions of mobility, self-care, and anxiety/depression, whereas in the study by McDonough et al., it was limited in the two latter dimensions. We consider these discrepancies attributable to the different levels of disease severity. Underlying the differential between single indices, the SF-6D produced higher values for poorer health states and the EQ-5D tended to produce higher values for better health states. This is in consensus with previous studies in low back pain where distri-

Interchangeability EQ-5D SF-6D Low Back Pain 611 Table 7 Difference between single indices Short Form 6D and EuroQol 5D (SF-6D EQ-5D) as a function of patient characteristics or framing effects: Coefficients of ordinary least squares regressions with White s standard errors in brackets DIFF Model (3) Model (4) Model (5) Model (6) Model (7) a -0.03 (0.01) 0.02 (0.05) 0.07 (0.02) 0.06 (0.02) 0.10 (0.02) b 2N3 0.50 (0.02) 0.50 (0.02) 0.44 (0.02) 0.44 (0.02) 0.46 (0.02) b 3AGE -0.00 (0.02) b 4FML -0.00 (0.02) b 5OLIST 0.00 (0.02) b 6PREV 0.00 (0.02) b 7LABOR -0.00 (0.02) b 8CMOB -0.02 (0.02) -0.06 (0.02) -0.07 (0.02) b 9CSELF -0.10 (0.02) -0.11 (0.02) -0.11 (0.02) b 10CANX 0.00 (0.02) -0.03 (0.01) -0.05 (0.02) b 11CSOC 0.10 (0.02) 0.09 (0.02) b 12FROLE -0.06 (0.02) R 2 0.79 0.79 0.82 0.85 0.86 Root MSE 0.11 0.11 0.10 0.07 0.09 Significant coefficients at the 0.05 significance level are bolded. Except for AGE (continuous), all variables were included as dummies: FML is female/male; OLIST is spondylolisthesis/degenerative disk disease;prev is previous surgery yes/no;labor is participation on labor market yes/no;cmob is ceiling effect in mobility yes/no; CSELF is ceiling effect in self-care yes/no; CANX is ceiling effect in anxiety/depression yes/no; CSOC is ceiling effect in social yes/no; and FROLE is floor effect in role limitation yes/no. MSE, mean squared error. butions of the EQ-5D single index have been characterized by a negative skew with two spikes whereas SF-6D distributions have appeared approximately normal [1,6,9]. The bimodal distribution of EQ-5D values may be related to the N3 term. McDonough et al. [9] commented in their discussion that the question remains as to whether the N3 term appropriately describes or inappropriately exaggerates severe health states. Brazier et al. [1] raised the same question and showed how respondents to the left of the gap between spikes all had N3 = 1. Although not reported here, we tested the influence of the N3 term by applying alternative scoring sets to our data that do not have an N3 term (Denmark, Japan, and Zimbabwe). As expected, scoring sets without an N3 term generated less significant bimodal distributions, which in turn support the view of others that the N3 term inappropriately exaggerates poor health states. We specified two models of the relationship between measures mainly for the purpose of relating this material to that of others. Our models adapted similar specifications to those of Brazier et al. [1] and, except for the magnitude of coefficients, the results were comparable. The upward shift in the EQ-5D coefficient (and downward shift of the intercept) when including the N3 term was not as marked in these data as in Brazier et al. s; moreover, this model seemed to fit our data marginally better with R 2 = 0.66 versus R 2 = 0.50 in Brazier et al. s study population. This could suggest that the (negative) effect of the N3 term is not as severe in chronic pain as it is in acute pain. An important question is whether (lack of) interchangeability would be more or less significant in some clinical subgroups. We specified models of between-measure discrepancy as a function of some basic characteristics to examine this. Unexpectedly, patient characteristics had no explanatory power at all i.e., the discrepancy is, on average, equal in males versus females, in older versus in younger, in those working versus those not working, etc. At first sight, this seems unlikely, and skepticism toward the model specification is warranted. Despite the fact that we tested for misspecification and it was rejected, our model could still be too simple. It is well established that patients suffering longlasting low back pain is a heterogeneous population irrespective of detailed eligibility criteria. A final attempt to inform the etiology of a between-measure discrepancy was made by means of modeling the discrepancy as a function of framing effects. The selection of framing effects was informed from the first part of our analysis, and thus the final model included dummies for the N3 term, three ceiling effects of the EQ-5D (in dimensions of mobility, self-care, and anxiety/ depression), and one ceiling effect and one floor effect of the SF-6D (in dimensions of social and role limitation). This model demonstrated an R 2 = 0.86, which is about 0.07 higher than the index model including only the N3 term. One might question whether this is a result of the explanatory power of the framing effects per se or rather the underlying dimensions. In particular for the EQ-5D, there are only three levels in each dimension, and thus a dummy for a framing effect in a dimension is very much the same as a dichotomized version of the dimension itself. Nevertheless, these findings do have a rationale in demonstrating how the limited number of levels in the EQ-5D plays a role in relation to between-measure discrepancy. One limitation of the present study is that no repeated measurements were conducted. According to Bland and Altman, comparative analysis should take between-measurements variation into account when examining between-measure agreement [32]. There is one study in the literature examining properties of repeatability for the two measures: Boonen et al. [33] found that the EQ-5D performed significantly poorer than the SF-6D when using the suggested Bland and Altman approach. Thus, they estimated the smallest detectable difference at 0.36 of the EQ-5D and 0.17 of the SF-6D. Although not strictly related to the present scope, this is noteworthy given that minimally important differences have been found to be two to three times smaller than what was identified as detectable [34]. Ultimately, the real interest of this research is whether or not between-measure discrepancy might potentially influence decision-making. This would be the case if the cost per qualityadjusted life year (QALY) is significantly altered when using one measure over its alternative. This was investigated by Conner- Spady and Suarez-Almazor in a setting of 161 patients visiting a rheumatologic clinic, who were followed for 12 months [6]. The authors reported an effect size (health gain in units of standard deviations) of 0.42 for the SF-6D and 0.52 for the EQ-5D, which is equivalent to mean improvements of 0.05 and 0.15, respectively. Using retrospective ratings by patients on whether they were in better, the same, or poorer health, QALY differences between better and poorer health were 0.09 for the SF-6D and 0.23 for the EQ-5D. An intuitive interpretation of these results would be that for a hypothetical comparison of an effective and an ineffective intervention, using the EQ-5D over the SF-6D will, ceteris paribus, generate nearly three times as large a QALY gain.

612 Søgaard et al. Conclusions Although the EQ-5D and the SF-6D are both psychometrically valid for generic outcome assessment in long-lasting low back pain, it appears that they cannot generally be used interchangeably for measurement of preference values. Sensitivity analysis examining the impact of between-measure discrepancy thus remains a necessary condition for the interpretation of the results of cost-utility evaluations. Source of financial support: No funding was received for the conduction of the present study. References 1 Brazier J, Roberts J, Tsuchiya A, Busschbach J. A comparison of the EQ-5D and SF-6D across seven patient groups. Health Econ 2004;13:873 84. 2 Barton GR, Sach TH, Avery AJ, et al. A comparison of the performance of the EQ-5D and SF-6D for individuals aged >/=45 years. Health Econ 2008;17:815 32. 3 Marra CA, Marion SA, Guh DP, et al. Not all quality-adjusted life years are equal. J Clin Epidemiol 2007;60:616 24. 4 Wee HL, Machin D, Loke WC, et al. Assessing differences in utility scores: a comparison of four widely used preference-based instruments. Value Health 2007;10:256 65. 5 Moock J, Kohlmann T. Comparing preference-based quality-oflife measures: results from rehabilitation patients with musculoskeletal, cardiovascular, or psychosomatic disorders. Qual Life Res 2008;17:485 95. 6 Conner-Spady B, Suarez-Almazor ME. Variation in the estimation of quality-adjusted life-years by different preference-based instruments. Med Care 2003;41:791 801. 7 Grieve R, Grishchenko M, Cairns J. SF-6D versus EQ-5D: reasons for differences in utility scores and impact on reported cost-utility. Eur J Health Econ 2008 [online first on Mar 9, 2008]. 8 Kopec JA, Willison KD. A comparative review of four preferenceweighted measures of health-related quality of life. J Clin Epidemiol 2003;56:317 25. 9 McDonough CM, Grove MR, Tosteson TD, et al. Comparison of EQ-5D, HUI, and SF-36-derived societal health state values among spine patient outcomes research trial (SPORT) participants. Qual Life Res 2005;14:1321 32. 10 Petrou S, Hockley C. An investigation into the empirical validity of the EQ-5D and SF-6D based on hypothetical preferences in a general population. Health Econ 2005;14:1169 89. 11 EuroQol Group. EuroQol a new facility for the measurement of health-related quality of life. Health Policy 1990;16:199 208. 12 The MVH Group. The Measurement and Valuation of Health: Final Report on the Modelling of Valuation Tariffs. York: University of York, 1995. 13 Dolan P, Gudex C, Kind P, Williams A. The time trade-off method: results from a general population study. Health Econ 1996;5:141 54. 14 Badia X, Roset M, Herdman M, Kind P. A comparison of United Kingdom and Spanish general population time trade-off values for EQ-5D health states. Med Decis Making 2001;21:7 16. 15 Wittrup-Jensen KU, Lauridsen JT, Gudex C, et al. Estimating Danish EQ-5D tariffs using TTO and VAS. In: Norinder A, Pedersen K, Roos P, eds, Proceedings of the 18th Plenary Meeting of the EuroQol Group (1st ed.). Copenhagen: IHE, The Swedish Institute for Health Economics, 2002. 16 Tsuchiya A, Ikeda S, Ikegami N, et al. Estimating an EQ-5D population value set: the case of Japan. Health Econ 2002;11: 341 53. 17 Jelsma J, Hansen K, De Weerdt W, et al. How do Zimbabweans value health states? Popul Health Metr 2003;1:1 11. 18 Greiner W. Health economic evaluation of disease management programs: the German example. Eur J Health Econ 2005;6: 191 6. 19 Shaw JW, Johnson JA, Coons SJ. US valuation of the EQ-5D health states: development and testing of the D1 valuation model. Med Care 2005;43:203 20. 20 Lamers LM, McDonnell J, Stalmeier PF, et al. The Dutch tariff: results and arguments for an effective design for national EQ-5D valuation studies. Health Econ 2006;15:1121 32. 21 Brazier J, Usherwood T, Harper R, Thomas K. Deriving a preference-based single index from the UK SF-36 Health Survey. J Clin Epidemiol 1998;51:1115 28. 22 Brazier J, Roberts J, Deverill M. The estimation of a preferencebased measure of health from the SF-36. J Health Econ 2002; 21:271 92. 23 Hollingworth W, Deyo RA, Sullivan SD, et al. The practicality and validity of directly elicited and SF-36 derived health state preferences in patients with low back pain. Health Econ 2002; 11:71 85. 24 Solberg TK, Olsen JA, Ingebrigtsen T, et al. Health-related quality of life assessment by the EuroQol-5D can provide costutility data in the field of low-back surgery. Eur Spine J 2005; 14:1000 7. 25 Brazier J, Deverill M. A checklist for judging preference-based measures of health related quality of life: learning from psychometrics. Health Econ 1999;8:41 51. 26 Tsuchiya A, Brazier J, Roberts J. Comparison of valuation methods used to generate the EQ-5D and the SF-6D value sets. J Health Econ 2006;25:334 46. 27 McDonough CM, Tosteson AN. Measuring preferences for costutility analysis: how choice of method may influence decisionmaking. Pharmacoeconomics 2007;25:93 106. 28 Brazier J, Deverill M, Green C. A review of the use of health status measures in economic evaluation. J Health Serv Res Policy 1999;4:174 84. 29 Christensen FB, Hansen ES, Eiskjaer SP, et al. Circumferential lumbar spinal fusion with Brantigan cage versus posterolateral fusion with titanium Cotrel-Dubousset instrumentation: a prospective, randomized clinical study of 146 patients. Spine 2002;27:2674 83. 30 Christensen FB, Hansen ES, Laursen M, et al. Long-term functional outcome of pedicle screw instrumentation as a support for posterolateral spinal fusion: randomized clinical study with a 5-year follow-up. Spine 2002;27:1269 77. 31 Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307 10. 32 Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res 1999;8:135 60. 33 Boonen A, van der HD, Landewe R, et al. How do the EQ-5D, SF-6D and the well-being rating scale compare in patients with ankylosing spondylitis? Ann Rheum Dis 2007;66:771 7. 34 Walters SJ, Brazier JE. Comparison of the minimally important difference for two health state utility measures: EQ-5D and SF-6D. Qual Life Res 2005;14:1523 32. 35 Bjorner JB, Damsgaard MT, Watt T, et al. Dansk Manual til SF-36 [Danish Manual to the SF-36]. København: Lægemidelforeningen (LIF), 1997. 36 Ware JE. SF-36 Health Survey: Manual and Interpretation Guide. Boston: Nimrod Press, 1993. 37 Ware JE, Kosinski M, Bayliss MS, et al. Comparison of methods for the scoring and statistical analysis of SF-36 health profile and summary measures: summary of results from the Medical Outcomes Study. Med Care 1995;33(Suppl. 4):S264 79.