Issues for selection of outcome measures in stroke rehabilitation: ICF Participation

Similar documents
Issues for selection of outcome measures in stroke rehabilitation: ICF activity

Table 3.1: Canadian Stroke Best Practice Recommendations Screening and Assessment Tools for Acute Stroke Severity

Canadian Stroke Best Practices Table 3.3A Screening and Assessment Tools for Acute Stroke

A Review of Generic Health Status Measures in Patients With Low Back Pain

Champlain Assessment/Outcome Measures Forum February 22, 2010

THE ESSENTIAL BRAIN INJURY GUIDE

17. Assessment of Outcomes Following Acquired/Traumatic Brain Injury

Chapter 2 A Guide to PROMs Methodology and Selection Criteria

Supplementary Appendix

The HeartQol questionnaire. Reliability, validity and responsiveness?

Kurtzke scales revisited: the application of psychometric methods to clinical intuition

Last Updated: February 17, 2016 Articles up-to-date as of: July 2015

15D: Strengths, weaknesses and future development

Measuring the Outcomes of Stroke Rehabilitation Results of a Canadian Stroke Strategy/Heart and Stroke Foundation National Consensus Panel

The EuroQol and Medical Outcome Survey 36-item shortform


Chapter 5: Patient-reported Health Instruments used for people with Chronic Obstructive Pulmonary Disease (COPD)

Psychometric Evaluation of Self-Report Questionnaires - the development of a checklist

Perspective. Making Geriatric Assessment Work: Selecting Useful Measures. Key Words: Geriatric assessment, Physical functioning.

Psychometric properties of the Chinese quality of life instrument (HK version) in Chinese and Western medicine primary care settings

Process of a neuropsychological assessment

LEVEL ONE MODULE EXAM PART TWO [Reliability Coefficients CAPs & CATs Patient Reported Outcomes Assessments Disablement Model]

Validation of the Russian version of the Quality of Life-Rheumatoid Arthritis Scale (QOL-RA Scale)

An International Study of the Reliability and Validity of Leadership/Impact (L/I)

Quality of Life Assessment of Growth Hormone Deficiency in Adults (QoL-AGHDA)

Agreement between Proxy and Patient Reports of HRQoL using the EQ-5D:

1. Evaluate the methodological quality of a study with the COSMIN checklist

Responsiveness, construct and criterion validity of the Personal Care-Participation Assessment and Resource Tool (PC-PART)

Low Tolerance Long Duration (LTLD) Stroke Demonstration Project

CHAPTER 3 METHOD AND PROCEDURE

NICE DSU TECHNICAL SUPPORT DOCUMENT 8: AN INTRODUCTION TO THE MEASUREMENT AND VALUATION OF HEALTH FOR NICE SUBMISSIONS

Food for thought. Department of Health Services Research 1

Validity and responsiveness of the Core Outcome Measures Index (COMI) for the neck

CHAPTER VI RESEARCH METHODOLOGY

4 Diagnostic Tests and Measures of Agreement

Chapter 10: Patient-reported Health Instruments: Carer Impact

Running head: CPPS REVIEW 1

Critical Evaluation of the Beach Center Family Quality of Life Scale (FQOL-Scale)

Description of instruments reviewed Generic patient-reported health instruments Older people-specific patient-reported health instruments

Validity and reliability of measurements

Validity and reliability of measurements

European Association for Cardiovascular Prevention & Rehabilitation (EACPR) A Registered Branch of the ESC

CHAPTER III RESEARCH METHODOLOGY

Cover Page. The handle holds various files of this Leiden University dissertation

The UK FAM items Self-serviceTraining Course

Measurement of health status or health-related quality of

Alexandra Savova, Guenka Petrova. Medical University Sofia Faculty of Pharmacy

CRITICALLY APPRAISED PAPER (CAP)

FOCUS: Fluoxetine Or Control Under Supervision Results. Martin Dennis on behalf of the FOCUS collaborators

Research Questions and Survey Development

William C Miller, PhD, FCAOT Professor Occupational Science & Occupational Therapy University of British Columbia Vancouver, BC, Canada

For more information: Quality of Life. World Health Organization Definition of Health

CRITICALLY APPRAISED PAPER (CAP)

Patient Outcomes in Pain Management

Table 7.2B: Summary of Select Screening Tools for Assessment of Vascular Cognitive Impairment in Stroke Patients

Reliability and Validity checks S-005

Family Assessment Device (FAD)

Final Report to the Ontario Stroke Network

Table 2B: Summary of Select Screening and Initial Assessment Tools for Vascular Cognitive Impairment in Stroke Patients (Updated 2014)

Interventions, Effects, and Outcomes in Occupational Therapy

Measures. David Black, Ph.D. Pediatric and Developmental. Introduction to the Principles and Practice of Clinical Research

Quality of Life after. A Critical Illness: A review of the literature

Reliability and Validity of the Pediatric Quality of Life Inventory Generic Core Scales, Multidimensional Fatigue Scale, and Cancer Module

Table of Contents. Preface to the third edition xiii. Preface to the second edition xv. Preface to the fi rst edition xvii. List of abbreviations xix

Type of intervention Treatment. Economic study type Cost-effectiveness analysis.

Ware NIH Lecture Handouts

*Department of Orthopaedic Oncology, University of Texas MD Anderson Cancer Center, Houston TX

South East Coast Operational Delivery Network. Critical Care Rehabilitation

ADMS Sampling Technique and Survey Studies

An adult version of the Screen for Child Anxiety Related Emotional Disorders (SCARED-A)

The Patient-Rated Elbow Evaluation (PREE) User Manual. June 2010

Spinal cord injury and quality of life: a systematic review of outcome measures

Practical measures for evaluating outcomes: Australian Therapy Outcome Measures (AusTOMs)

Functional Status and Health-related Quality of Life Assessment in Patients with Rheumatoid Arthritis

Adaptation and evaluation of early intervention for older people in Iranian.

COMPUS Vol 2, Issue 8 December 2008

Research Report. A Comparison of Five Low Back Disability Questionnaires: Reliability and Responsiveness

Jane T Osterhaus 1* and Oana Purcaru 2

Development of a self-reported Chronic Respiratory Questionnaire (CRQ-SR)

Recovery trajectories following critical illness: Can we really modify them? Tim Walsh Professor of Critical Care, Edinburgh University

A methodological review of the Short Form Health Survey 36 (SF-36) and its derivatives among breast cancer survivors

Using the ICF to clarify team roles and demonstrate clinical reasoning in stroke rehabilitation.

Measuring health-related quality of life in persons with dementia DOMS results & recommendations

Development of a New Communication About Pain Composite Measure for the HCAHPS Survey (July 2017)

Everyday Problem Solving and Instrumental Activities of Daily Living: Support for Domain Specificity

SUMMARY chapter 1 chapter 2

Validation of an Arabic Version of the ORWELL97 Questionnaire in Adults with Obesity

Validation of the SF-36 in patients with endometriosis

Review of Various Instruments Used with an Adolescent Population. Michael J. Lambert

Final Report. HOS/VA Comparison Project

DATA is derived either through. Self-Report Observation Measurement

ETHICAL DECISION-MAKING FRAMEWORK

HOW IS PACE TO BE USED

Neurologic Outcome Tools

Gezinskenmerken: De constructie van de Vragenlijst Gezinskenmerken (VGK) Klijn, W.J.L.

PTHP 7101 Research 1 Chapter Assignments

Critical Thinking Assessment at MCC. How are we doing?

MOVEMBER FUNDED MEN S HEALTH INFORMATION RESOURCES EVALUATION BRIEF. 1 P age

Transcription:

Disability and Rehabilitation, 2005; 27(9): 507 528 CLINICAL COMMENTARY Issues for selection of outcome measures in stroke rehabilitation: ICF Participation K. SALTER 1, J.W. JUTAI 1,3, R. TEASELL 1,2, N.C. FOLEY 1, J. BITENSKY 1 & M. BAYLEY 3 1 Department of Physical Medicine and Rehabilitation, St. Joseph s Health Care London, UK, 2 University of Western Ontario, London, Ontario, Canada, and 3 Neurorehabilitation Program, Toronto Rehabilitation Institute, Toronto, Ontario, Canada (Accepted date August 2004) Abstract Purpose. To evaluate the psychometric and administrative properties of outcome measures in the ICF Participation category, which are used in stroke rehabilitation research and reported in the published literature. Method. Critical review and synthesis of measurement properties for six commonly reported instruments in the stroke rehabilitation literature. Each instrument was rated using the eight evaluation criteria proposed by the UK Health Technology Assessment (HTA) programme. The instruments were also assessed for the rigour with which their reliability, validity and responsiveness were reported in the published literature. Results. Validity has been well reported for at least half of the measures reviewed. However, methods for reporting specific measurement qualities of outcome instruments were inconsistent. Responsiveness of measures has not been well documented. Of the three ICF categories, Participation seems to be most problematic with respect to: (a) lack of consensus on the range of domains required for measurement in stroke; (b) much greater emphasis on health-related quality of life, relative to subjective quality of life in general; (c) the inclusion of a mixture of measurements from all three ICF categories. Conclusions. The reader is encouraged to examine carefully the nature and scope of outcome measurement used in reporting the strength of evidence for improved participation associated with stroke rehabilitation. There is no consensus regarding the most important indicators of successful involvement in a life situation and which ones best represent the societal perspective of functioning. In particular, quality of life outcomes lack adequate conceptual frameworks to guide the process of development and validation of measures. Introduction Measuring the effectiveness of rehabilitation interventions is accepted as essential to good practice. Van der Putten et al. [1] point out that measuring the outcome of health care is a central component of determining therapeutic effectiveness and, therefore, the provision of evidence-based healthcare. Reliability, validity, and administrative burden are properties of measurement instruments that affect the credibility of the measurement process [2, 3] and the reporting of research findings [4 6]. Remarkably few published rehabilitation outcome studies appear to report these properties adequately in defending their research designs and interpreting their results [2]. Recently, there have been important advances in compiling and publishing the best-available scientific evidence examining the effectiveness of stroke rehabilitation [7, 8]. However, there are limitations to the successful transfer of the research results to clinical practice and service delivery, in part due to a lack of consensus on the selection of measures to best address and balance the needs and values of stakeholders in stroke rehabilitation, including patients and their caregivers, practitioners, and healthcare decision makers. Ultimately, the comparison of size and direction of treatment effects across areas of stroke rehabilitation will be most meaningfully interpreted when it is clear that comparable approaches to outcome measurement have been used [9]. To enhance the clinical meaningfulness of Correspondence: Department of Physical Medicine and Rehabilitation, St. Joseph s Health Care London and University of Western Ontario, 801 Commissioners Road East, London (Ontario) N6C 5J1, Canada. E-mail: Katherine.Salter@sjhc.london.on.ca ISSN 0963-8288 print/issn 1464-5165 online ª 2005 Taylor & Francis Group Ltd DOI: 10.1080/0963828040008552

508 K. Salter et al. the current evidence, this paper presents the best available information on how outcome measures might be classified and selected for use, based upon their measurement qualities. For this purpose, we have selected for review only some of the more commonly used measures in stroke rehabilitation. This paper is not intended to be a comprehensive compendium of stroke outcome measures. This paper attempts to describe how the ICF [10, 11] conceptual framework can be used for classifying outcome measures in stroke rehabilitation, and summarize aspects of measurement theory that are pertinent for evaluating measures. It also gives a template presentation on the characteristics, application, reliability, validity, and other clinimetric qualities of commonly used measures in a format for easy reference. For a more extensive discussion of outcome measurement theory and properties in physical rehabilitation, the reader is referred to Finch et al. [12] This paper will present only the information most relevant for the rehabilitation of stroke patients. Classification of stroke rehabilitation outcomes To be effective, outcomes research requires a systematic approach to describing outcomes and classifying them meaningfully. The study and assessment of stroke rehabilitation has sparked the development of numerous outcome measures applicable to one or more of its dimensions. In attempting to discuss some of the commonly used measures available for use within the field of stroke rehabilitation, it is useful to have guidelines available for classifying these tools. The WHO International Classification of Functioning, Disability and Health (ICF) [10, 11] provides a multi-dimensional framework for health and disability suited to the classification of outcome instruments. Originally published in 1980, the WHO framework has undergone several revisions. In the most recent version, the ICF framework [10, 11] identifies three primary levels of human functioning the body or body part, the whole person, and the whole person in relation to his/her social context. Outcomes may be measured at any of these levels Body functions/ structure (impairment); Activities (refers to the whole person formerly conceived as disability in the old ICIDH framework) and Participation (formerly referred to as handicap) (Table I). Activity and Participation are affected by environmental and personal factors (referred to as contextual factors within the ICF, Table I). Outcome measures can also be conceived of as falling along a continuum, moving from measurements at the level of body function or structure to those focused on participation and life satisfaction. It becomes more difficult to attribute outcomes to particular rehabilitation interventions as one moves away from body structure toward participation, since many variables other than the interventions might account for changes observed [13, 14]. We reviewed the findings from a number of recent studies that have examined the patterns of scale use in various settings, both clinical [15 17] and research [14, 18 21]. In the absence of an authoritative and comprehensive published list of recommended stroke rehabilitation measures, we focused our review on scales with which most stroke specialists would be familiar. Table II presents 20 of the most popular instruments from the stroke rehabilitation literature, classified by ICF category by the primary author, after consideration of the study author s stated purpose for the tool and the content of the instrument s items. This subjective component was introduced because there is no published consensus on how this kind of classification should proceed. The classification was reviewed independently by the co-authors. Table II reflects the consensus among the paper s authors. If a classification is to be useful for scientific research, the basic categories and concepts within it need to be measurable, and their boundaries clear and distinct. It is not yet clear from the research evidence whether the three ICF categories completely fulfill these criteria. Nonetheless, when applied to outcome assessment in stroke rehabilitation the ICF conceptual framework can be used to place outcome measures into one of the three categories depending upon what it is they purport to measure. Table I. ICF Definitions. Old terminology New terminology Definition Impairment Body function/ structure Physiological functions of body systems including psychological. Structures are anatomical parts or regions of their bodies and their components. Impairments are problems in body function or structure. Disability Activity The execution of a task by an individual. Limitations in activity are defined as difficulties an individual might experience in completing a given activity. Handicap Participation Involvement of an individual in a life situation. Restrictions to participation describe difficulties experienced by the individual in a life situation or role.

Issues for selection of outcome measures in stroke rehabilitation 509 Table II. Classification of outcome measures. Body structure (impairments) Activities (limitations to activity disability) Participation (barriers to participation handicap) 1. Beck Depression Inventory 6. Barthel Index 15. Euroqol-5D 2. Fugl-Meyer Assessment 7. Berg Balance Scale 16. Medical Outcomes Study Short Form 36 3. Mini Mental State Examination 8. Chedoke McMaster Stroke Assessment Scale 17. Nottingham Health Profile 4. Modified Ashworth 9. Functional Independence Measure (FIM) 18. Sickness Impact Profile (stroke adapted 5. Motor-free Visual Perception 10. Frenchay Activities Index version) Test 11. Modified Rankin Handicap Scale 19. Stroke Impact Scale 12. Rivermead Motor Assessment 20. Stroke Specific Quality of Life 13. Rivermead Mobility Index 14. Timed-Up-and-Go (TUG) It should be noted that linking existing measures to the ICF is not a straightforward process [22 24]. Many existing measures include items that fall into several ICF dimensions in addition to items that may not be included in the ICF at all. Instruments appearing in the Participation domain, for instance, assess participation in life situations such as social functioning or roles, but include the assessment of elements of one or both of the Body Structure/ Function and Activities categories. While these measures have been used to assess health-related quality of life, it is not the intent of this paper to define this construct or its assessment. The present study was not intended as an attempt to provide item by item mapping for each of the identified measures. The ICF was used as a framework within which measures were classified according to the level of assessment they include furthest along a continuum from body function, through activity to participation. However, the process of developing systematic approaches to establishing linkages between existing measures and the ICF is an important one in the ongoing attempt to create an international language and standard for measurement. Evaluation criteria for outcome measures While it is useful to have the ICF framework within which to classify levels of outcomes measures, it is necessary to have a set of criteria to guide the selection of outcomes measures. Reliability, validity and responsiveness have widespread usage and are essential to the evaluation of outcome measures [1, 14, 19, 25]. Finch et al. [12] provide a good tutorial on the general issues for outcome measure selection. The Health Technology Assessment (HTA) programme [26] examined 413 articles that focused on methodological aspects of the use and development of patient-based outcome measures. In their report, they recommended the use of eight evaluation criteria. Table III lists the criteria and gives a definition for each one. It also identifies a recommended standard for quantifying (rating) each criterion, where applicable, and how the ratings should be interpreted. The criteria, including some additional considerations described below, were applied to each of the outcome measures reviewed in this paper. Each measure reviewed in this paper was also assessed for the thoroughness with which its reliability, validity and responsiveness have been reported in the literature. Standards for evaluation of rigour were adapted from McDowell & Newell [27] and Andresen [4]. The authors assessed rigour in the manner described above for the other ratings, and scored each instrument on each of the three properties as follows: + + + Excellent most major forms of testing reported; + + Adequate several studies and/ or several types of testing reported; + Poor minimal information is reported and/or few studies (other than author s); N/a no information available. For example, a rating of + + + (or excellent) for validity meant that evidence has been presented demonstrating excellent construct validity based on the standards provided and in various forms including convergent and discriminant validity. In addition to the criteria outlined above, three additional questions were considered. Has the measure been used in a stroke population? Has the measure been tested for use with proxy assessment? What is the recommended time frame for measurement? The primary author reviewed and rated each instrument using these evaluative criteria. The results were reviewed independently by the coauthors. There were very few instances of disagreement among raters, and they were never more than one level apart in their evaluations. The results presented in this paper reflect the authors consensus on ratings after discussing all discordant ratings. Has the measure been used in a stroke population? Reliability and validity are not fixed qualities of measures. They should be regarded as relative

510 K. Salter et al. Table III. Evaluation criteria and standards. Criterion Definition Standard 1. Appropriateness The match of the instrument to the purpose/question under study. One must determine what information is required and what use will be made of the information gathered [85] 2. Reliability Refers to the reproducibility and internal consistency of the instrument. Reproducibility addresses the degree to which the score is free from random error. Test re-test & inter-observer reliability both focus on this aspect of reliability and are commonly evaluated using correlation statistics including ICC, Pearson s or Spearman s coefficients and kappa coefficients (weighted or unweighted). Internal consistency assesses the homogeneity of the scale items. It is generally examined using split-half reliability or Cronbach s alpha statistics. Item-to-item and item-to scale correlations are also accepted methods. 3. Validity Does the instrument measure what it purports to measure? Forms of validity include face, content, construct, and criterion. Concurrent, convergent or discriminative, and predictive validity are all considered to be forms of criterion validity. However, concurrent, convergent and discriminative validity all depend on the existence of a gold standard to provide a basis for comparison. If no gold standard exists, they represent a form of construct validity in which the relationship to another measure is hypothesized [12]. 4. Responsiveness Sensitivity to changes within patients over time (which might be indicative of therapeutic effects). Assessment of possible floor and ceiling effects is included as they indicate limits to the range of detectable change beyond which no further improvement or deterioration can be noted. Responsiveness is most commonly evaluated through correlation with other change scores, effect sizes, standardized response means, relative efficiency, sensitivity and specificity of change scores and ROC analysis. 5. Precision Number of gradations or distinctions within the measurement. E.g. Yes/no response vs. a 7-point Likert response set Depends upon the specific purpose for which the measurement is intended. Test-retest or interobserver reliability(icc; kappa statistics) [4, 86, 87] Excellent: 5 0.75; Adequate: 0.4 0.74; Poor: 4 0.40 Note: Fitzpatrick et al. [26] recommend a minimum testretest reliability of 0.90 if the measure is to be used to evaluate the ongoing progress of an individual in a treatment situation. Internal consistency (split-half or Cronbach s a statistics): Excellent: 5 0.80; Adequate: 0.70 0.79; Poor 5 0.70 4 Note: Fitzpatrick et al. [26] caution a values in excess of 0.90 may indicate redundancy. Inter-item & item-to-scale correlation coefficients: Adequate levels inter-item: between 0.3 and 0.9; item-to-scale: between 0.2 and 0.9 [26, 59] Construct/convergent and concurrent correlations: Excellent: 5 0.60, Adequate: 0.31 0.59, Poor: 4 0.30 [4, 26, 27, 88] ROC analysis AUC: Excellent: 5 0.90, Adequate: 0.70 0.89, Poor: 5 0.70 [27] There are no agreed on standards by which to judge sensitivity and specificity as a validity index [89]. Sensitivity to change: Excellent: Evidence of change in expected direction using methods such as standardized effect sizes: 5 0.5 = small; 0.5 0.8 = moderate 5 0.8 = large) Also, by way of standardized response means, ROC analysis of change scores (area under the curve see above) or relative efficiency. Adequate: Evidence of moderate/less change than expected; conflicting evidence. Poor: Weak evidence based solely on p-values (statistical significance) [4, 26, 27, 88] Floor/ceiling effects: Excellent: No floor or ceiling effects Adequate: floor and ceiling effects 4 20% of patients who attain either the minimum (floor) or maximum (ceiling) score. Poor: 4 20% [59]. Depends on the precision required for the purpose of the measurement (e.g., classification, evaluation, prediction). (continued)

Issues for selection of outcome measures in stroke rehabilitation 511 Table III. (continued) Criterion Definition Standard 6. Interpretability How meaningful are the scores? Are there consistent definitions and classifications for results? Are there norms available for comparison? 7. Acceptability How acceptable the scale is in terms of completion by the patient does it represent a burden? Can the assessment be completed by proxy, if necessary? 8. Feasibility Extent of effort, burden, expense and disruption to staff/ clinical care arising from the administration of the instrument. Jutai & Teasell [9] point out these practical issues should not be separated from consideration of the values that underscore the selection of outcome measures. A brief assessment of practicality will accompany each summary evaluation. indicators of how well the instrument might function within a given sample or for a given purpose [26, 28]. Responsiveness, too, may be condition or purpose specific. Van der Putten et al. [1] for example, in an evaluation of the Barthel Index and Functional Independence Measure, found both measures to be equally responsive in terms of effect sizes when used among stroke patients and patients with multiple sclerosis. Within the stroke group, floor and ceiling effects were within acceptable limits on both measures. However, the authors point out that within the MS patient group, there were larger ceiling effects associated with the BI scores and the scores from the FIM cognitive subscale. This, coupled with the much smaller effect sizes noted among MS patients leads the authors to suggest that these two instruments are better suited for used among stroke patients. Therefore, it would seem important for a measure to have been tested for use in the population within which it will be applied. Has the measure been tested for use with proxy assessment? When assessment is conducted in such a way as to require a form of self-report (e.g. interview or questionnaire in person, by telephone or by mail), stroke survivors who have experienced significant cognitive or speech and language deficits may be excluded from assessment because of their inability to complete it. In such cases, the use of a proxy respondent becomes an important alternative source of information. However, the use of proxy respondents should be approached with caution. Studies of proxy assessments report a tendency for significant others, including family members, to assess patients as more disabled than they appear on other measures of functional disability, including self-reported methods. This discrepancy becomes more pronounced for patients with more impaired levels of functioning [29 31]. Hachisuka et al. [31] suggested that this discrepancy could be explained by a difference in interpretation. Proxy respondents may be rating actual, observable performance, while patients may rate their perceived capability what they think they are capable of doing rather than what they actually do. Unfortunately, using a healthcare professional as a substitute for the family member or significant other as proxy does not solve this problem. A similar discrepancy has been noted in ratings when using healthcare professionals as proxy respondents though in the opposite direction. They may tend to rate patients higher than the patients themselves would [30, 32]. It has been suggested that, in this case, the discrepancy is due to a difference in frame of reference. A healthcare professional may use a different, more disabled group, as a reference norm whereas the patient would only compare him/herself to pre-stroke conditions [32]. What is the recommended timeframe for measurement? The natural history of stroke presents problems in assessment in that the rate and extent of change in outcomes varies across the different levels of ICF classification [19]. The further one moves along the outcome continuum from body structure toward participation, the more time it may take to reach a measurement end point, that is, participation within a defined social context may take longer to stabilize than the impaired body structure [33]. Jorgensen et al. [34] demonstrated that maximal recovery in Activities of Daily Living (ADL) occurred, in most patients, within the first 13 weeks following a stroke even though the time course of both neurological and functional recovery was strongly related to initial stroke severity. They suggested that a valid prognosis of functional recovery might be made within the first 6-months. According to Mayo et al. [35], by 6 months poststroke, physical recovery is complete, for the most part, with additional gains being a function of

512 K. Salter et al. learning, practice and confidence. Duncan et al. [19] support this time frame for assessment of neurological impairment and disability outcomes but suggest that participation outcomes should not be measured sooner than 6 months post-stroke, to provide the opportunity for the patient s social situation to stabilize. They also suggest that assessments at the time of discharge not be used as endpoint measurements. They argue that variability in treatment interventions and length of stay practices decreases the comparative usefulness of this information. Review of participation outcome measures This paper is the final in a series of three, and deals with the third level or category of the ICF classification system; Participation. The necessity for clearly defined boundaries between categories of classification is most apparent when one considers the ICF dimensions of Activity and Participation. Given that the domains associated with activity and participation are presented as a single, neutral list with several classification options [10], it is not surprising that questions have arisen with regard to the validity of separating them into distinct dimensions. While exploration of this issue is ongoing, it is worth noting that Jette et al. [36] recently provided empirical evidence of distinctly differing concepts conforming to the dimensions of activity and participation as defined within the ICF and suggested that the participation domain may represent more complex categories of life behaviours. Granlund et al. (2004) suggest that the participation dimension of the ICF can be used effectively to link items on existing measures to the ICF, although, they do not attempt to define its relationship to activity. Keeping in mind that the fit of a given instrument within a single category is rarely perfect, measures appearing in this section focus on the assessment of Participation. As defined by the ICF [10, 11], Participation is involvement in a life situation and represents the societal perspective of functioning. Participation restrictions, therefore, are problems an individual may experience in involvement in life situations or roles. If an activity limitation prevents a person from attending school or being employed, this is a participation restriction (handicap). In contrast with the ICF Activities category, tasks subsumed within the Participation level are relatively complex, related to quality of life, performed with others, more dependent upon environmental influences, assessed in the community by self or proxy report, and the focus of patients and caregivers [37]. According to Perenboom and Chorus [38], involvement in life situations includes the concept of autonomy even if one is not actually doing things themselves, and therefore, the assessment of participation should include the fulfilment of personal goals and societal roles rather than performance-based indicators alone. The EuroQol Quality of Life Scale (EQ-5D) The EuroQol scale (EQ-5D) is a generic index instrument, which was developed by a multi-country, multi-disciplinary team and is used to value and describe health states [39]. The EQ-5D was intended to be brief and simple to administer representing little or no burden to the patient. It focuses on a core set of generic, health-related quality of life items to provide a broad, generic assessment. The EQ-5D was intended to promote the collection of a common data set for reference purposes or as a complement to other, more comprehensive measures [27, 39 41]. The EQ-5D is a self-administered questionnaire, in two parts. The first part contains a simple descriptive profile of health in five dimensions (mobility, selfcare, usual activities, pain/discomfort and anxiety/ depression). Each dimension is represented by three statements corresponding to three levels of difficulty (some problem, moderate and extreme problems) within that dimension. The respondent chooses the statement within each dimension that is most applicable to (her)himself at the time of assessment. Each dimension statement selected receives a numerical rating of 1 (some or no problem), 2 (moderate problems) or 3 (extreme problems). These ratings are combined such that each combination of choices creates a 5-digit expression of a health state. Theoretically, there are 243 such representations possible. By applying scores from a standard set of values, each of these health states can be transformed into a utility value ranging from 0 (worst possible) to 1 (best possible). Standard weights or preferences were derived from population data obtained using time trade-off techniques [12, 42]. Values have been elicited for health states in Canada, Denmark, Finland, Germany, Japan, Netherlands, New Zealand, Slovenia, Spain, Sweden, UK, US and Zimbabwe. Part 2 of the EQ-5D consists of a visual analogue scale (VAS) on which respondents rate their current state of health from 0 (worst imaginable) to 100 (best possible). While the EQ-5D was originally designed for selfadministration, it can be administered by interview. The EQ-5D takes approximately 2 3 min to complete and yields three types of information; a profile indicating the extent of problems experienced on each of five dimensions, a population-weighted health index and a self-rated assessment of current perceived health [41]. The scale is in the public domain and may be used without cost for the most part. Restrictions on the use of the scale as well as current information and references regarding

Issues for selection of outcome measures in stroke rehabilitation 513 the EQ-5D are available from the website www. euroqol.org. The measurement properties of the EQ-5D are summarized in Table IV. Advantages The EQ-5D is very short and simple. High response rates have been reported (80% [43] and 80 86%) [44]. Reports of missing data are mixed although are relatively low overall [43, 45]. The scale also provides considerable flexibility. Though designed as a self-completed instrument to be administered by post, it can be administered in face-to-face interviews and has been evaluated for use with proxy respondents. In addition, the data can be presented and used in three distinct forms; a patient profile in five domains based on unweighted responses, a health utility or index and an overall rating of perceived health. Limitations The level of validity reported would suggest that the instrument may not be suitable for use in serial assessments of individual patients. It would be more appropriate for use in the study and comparison of groups [44, 45]. Brazier et al. [46] reported missing data rates of 10% when using the EQ-5D in an elderly population (mean age 80.1 years). This observation is supported by Coast et al. [47] who demonstrated that the ability to self-complete the EQ-5D is directly related to age and cognitive function (p 5 0.0001). The authors also report that the probability of requiring interview administration to complete the scale increases from 11% at age 65 to 73% at age 85. This would increase the costs associated with using the EQ-5D with elderly populations. While the scale has been assessed for use with proxy respondents post stroke, Dorman et al. [44] observed that reliability was consistently lower when a proxy respondent completed the questionnaire on the patient s behalf. Levels of agreement between proxy respondents and patients were acceptable for mobility and self-care. However, the more subjective the domain, the lower were the levels of agreement. In the case of depression/anxiety, agreement was no better than chance among the more severely affected stroke survivors [48]. The health state valuations used in the EQ-5D utility were derived from time trade-off techniques. These techniques may be prone to biases and have been shown to elicit lower values for minor and major stroke than standard gamble techniques [49]. Summary The ratings of methodological rigour associated with evaluation of the measurement properties of the EQ- 5D are presented in Table X. Practicality Interpretability EQ-5D uses population based utility weights (a set of empirically derived valuations) to provide a standard set of utility values for the 5-digit health state derived from the 5-domain index. These weights are available for a large number of countries and cultures. The health profile may also be considered as an unweighted profile in 5-dimensions and is accompanied by a rating of perceived health status. Acceptability Although designed to be short and simple, reports of missing data are mixed. Essink- Bot et al. [45] report higher rates of missing data for the EQ-5D than for the NHP or SF-36. However, its simplicity and brevity remain an advantage for use with stroke survivors. It has been evaluated for use with proxy respondents although only the mobility and self-care domains remain reliable. Feasibility The EQ-5D is designed as a self-completion questionnaire that may be administered as a postal survey or face-to-face interview. It requires no special training to administer and both the scale itself and supporting information are readily available. Medical Outcomes Study Short Form 36 (SF-36) The Medical Outcomes Study Short Form 36 (SF- 36) is a generic health survey created as part of the Medical Outcomes Study to assess health status in the general population [50]. It is comprised of 36 items drawn from the original 245 items generated by that study [50, 51]. Items are organized into eight dimensions or subscales; physical functioning, role limitations physical, bodily pain, social functioning, general mental health, role limitations emotional, vitality, and general health perceptions. It also includes two questions intended to estimate change in health status over the past year. These two questions remain separate from the eight subscales and are not scored. With the exception of the general change in health status questions, subjects are asked to respond with reference to the past 4 weeks. An acute version of the SF-36 refers to problems in the past week only [27]. The recommended scoring system uses a weighted Likert system for each item. Items within subscales are summed to provide a summed score for each subscale or dimension. Each of the eight summed scores is linearly transformed onto a scale from 0 100 to provide a score out of 100 for each subscale. In addition, a physical component (PCS) and mental component score (MCS) can be derived from the scale items. Standardized population data for several

514 K. Salter et al. Table IV. Measurement Properties of the Euroqol-5D. Reliability Validity Responsiveness Tested for stroke patients? Test-retest: Hurst et al. [90] reported the EQ-profile showed no significant change in any of 5 domains in self-reported stable patients (p 5 0.02), EQ-utility and EQ-VAS ICC = 0.73 and 0.70 respectively, over 3 months and 0.78 and 0.85 over 2 week retest interval; Dorman et al. [44] reported k = 0.66 (usual activities) to 0.85 (mobility) for the index, ICC = 0.86 for the VAS self-rating of health status and ICC = 0.83 for the health utility; Brazier et al. [46] reported r = 0.53 (VAS) and r = 0.67 (utility index) over a 6-month retest period. Construct validity: Patients reporting problems on EuroQol domains reported dysfunction on a standardized instrument in that domain OPCS locomotion related to mobility (r = 0.61), BI to self-care (r = 7 0.64), FAI to usual activities (r = 7 0.60), VAS pain scale to pain (r = 0.71), HADS mood to anxiety (r = 0.56) and depression(r = 0.35) median scores on standard instruments were ordered appropriately when compared with EuroQol levels (p 4 0.0002) [91]; Hurst et al. [90] reported EQ-5D levels of self-care, pain and anxiety/depression scores related to corresponding mean and median scores on standardized assessments (HAQ, Pain-VAS and HAD-mood; p 5 0.001), EQ-utility and EQ- VAS correlated with measures of disease activity (r = 0.32 to 0.57) as well as subjective measures of mood (HAD, r = 7 0.56 and 7 0.59) and pain (r = 7 0.73 and 7 0.63); Cup et al. [78] reported EQ-5D correlated with standardized functional measures BI (0.7) and FAI (0.65); Johnson & Coons [92] reported increasing age related to 4/5 EQ dimensions (all except anxiety/depression; p 5 0.01) as hypothesized employment status, education, household income, marital status and presence of chronic medical problems all significantly related to EQ-5D dimension scores in the expected direction (p 5 0.05). Construct validity (known groups): EuroQol scores on all dimension able to discriminate migraine sufferers from controls (p 4 0.03, ROC/AUC = 0.50 0.59) and between groups of migraine sufferers based on absence from work 0 vs. 5 0.5 days; p 5 0.0, ROC/AUC = 0.54 0.70) [45]; EuroQol profiles distinguished between major stroke syndrome groups and between groups based on baseline stroke severity the EuroQol VAS ratings of overall health could also discriminate groups based on severity (p 5 0.05) [91]; Hurst et al. [90] reported increasing problems on all 5 EQ-5D domains associated with increasing functional class in RA patients stratified by function (p 5 0.001), EQ utilities discriminated between all RA functional classes (p 5 0.001), EQ-VAS discriminated between functional class 1,2 3 (p 5 0.001) but not between 3 and 4 (more severe) [90]; Brazier et al. [46] reported EuroQol utility and VAS scores distinguished groups based on recent visits to GP, hospital inpatient stays and longstanding illness (p 5 0.05). Concurrent validity: Essink-Bot et al. [45] reported EQ dimensions correlated with corresponding COOP/WONCA chart items anxiety/depression with COOP feelings (r = 0.83), usual activities with COOP daily activities (r = 0.75) and COOP social activities (r = 0.61); EQ-5D dimension scores most closely related to corresponding COOP- WONCA charts physical chart to EQ mobility and self-care (r = 0.39 and 0.34), feelings to EQ anxiety/depression (r = 0.70), daily activities with usual activities (r = 0.59), pain to EQ pain (r = 0.74) COOP overall health related to EQ-5D utility score (r = 7 0.53) and VAS rating (r = 7 0.65) as was COOP change in health (r = 7 0.35) [47]; Cup et al. [78] reported EQ-5D correlated with SA-SIP30 (0.48); Dorman et al. [56] demonstrated mobility, self-care and usual activities correlated most strongly with SF-36 physical functioning (r = 0.57, 0.65 and 0.63, respectively), pain correlated with SF-36 bodily pain (r = 0.66) but anxiety/depression domain of EuroQol was moderately correlated with all SF-36 subscales r = 0.21 (mental health) to r = 0.44 (general health) VAS rating correlated most strongly to SF-36 general health (r = 0.66); EQ-VAS rating correlated with SF-12 PCS and MCS (r = 0.55 and 0.41, respectively) [92]; Bosch & Hunink [93] found EQ-5D score (utility) correlated with HUI3 before and after treatment for intermittent claudication (ICC = 0.49 0.78), change in EQ-5D and HUI3 scores over time was significantly correlated (ICC = 0.30, p 5 0.01) change in EQ-5D scores correlated with change in SF-36 scores on all dimensions ICC = 0.22 (energy) 0.43 (pain); EQ-5D index correlated with SF-12 PCS (r = 0.64) and MCS (r = 0.52) and HUI3 (r = 0.69) the EQ-VAS rating correlated with SF-12 PCS (r = 0.61), MCS (r = 0.41) and HUI3 (r = 0.56) as well as with SF-12 self-perceived health (r = 0.61) and the HUI overall health rating (r = 0.70) [94]. Construct Validity (convergent/divergent): Johnson & Coons [92] reported EQ mobility, self-care, usual activities and pain related to SF-12 PCS (0.12 0.41) but not to SF-12 MCS (0.02 0.04) while EQ anxiety and depression related to SF- 12 MCS (0.40) but not to SF-12 PCS (0.03); Lubetkin & Gold [94] reported EQ-5D mobility correlated more with HUI ambulation (0.59) and SF-12 PCS (0.49) than to HUI emotion (0.23) or MCS (0.24) the EQ-5D pain/ discomfort dimension showed a similar pattern of correlation and EQ-5D anxiety/depression more strongly correlated with MCS (0.48) and HUI emotion (0.55) than SF-12 PCS (0.23) or HUI ambulation (0.22). Examination of distribution of VAS ratings and EuroQol utility scores did not suggest problems with ceiling or floor effects [56], examination of distribution of baseline scores prior to admission to early discharge programme revealed minimum (no problem) scores in excess of 20% of patients on all dimensions but usual activities maximum scoring occurred in 47% of patients on the usual activities dimension (all other dimensions 5 20%) distribution of EQ utility scores and VAS ratings revealed no floor or ceiling effects [47]; in general population, 85% ceiling effect reported (rating of no problem) for self-care, mobility and usual activity dimensions and 58.5% for pain dimensions VAS ceiling effect less pronounced [92]; Brazier et al. [46] reported no floor or ceiling effects. Hurst et al. [90] reported significant change on EQ-profile among RA patients reporting improvement in all domains (p 5 0.05) except anxiety/depression over a 3-month period, SRM for EQ-utility and EQ-VAS = 0.71 and 0.70 respectively; EQ-5D scores showed significant improvement at 1, 3 and 12 months post-treatment (p 5 0.01) [93]. Yes [43, 44, 48, 56, 78, 91]. (continued)

Issues for selection of outcome measures in stroke rehabilitation 515 Table IV. (continued) Other formats Use by proxy? N/a Test-retest reliability for EuroQol when completed by proxy was k = 0.31 (mobility) to 0.61 (pain and usual activities), VAS rating of overall ICC = 0.74 and the assigned utility ICC = 0.8 [44]. Inter-rater reliability: Dorman et al. [48] reported agreement between scores on self-completed EQ vs. proxy-completed EQ (ICC = 0.53) for the VAS rating and for interview-completed EQ vs. proxy completed (ICC = 0.32). Agreement on domains of the EuroQol, was reported as k = 0.38 (anxiety/depression) to k = 0.57 (mobility and self-care) for selfcompleted vs. proxy-completed EQ when EQ was completed by interview, k = 0.05 (depression/anxiety) to k = 0.62 (self-care). Overall agreement between patient completed EQ (self-completed or interview) vs. proxy, k = 0.30 (depression/anxiety) to k = 0.64 (self-care); Agreement between subject and proxy responses on 5 domains ranged from k = 0.099 (pain/discomfort) to 0.601 (mobility) at baseline, k = 0.439 (anxiety/depression) 0.529 (self-care) at one month follow-up and k = 0.264 (anxiety/depression) to 0.598 (usual activities) at 4 months agreement on the index ranged from ICC = 0.447 at baseline to 0.581 at 4 months VAS rating agreements ranged from ICC = 0.221 to 0.498. Comparison of index change scores based on proxy vs. subject responses yielded ICC = 0.287 (1 month vs. baseline) to 0.504 (4 month vs. baseline) similar comparison of VAS rating change scores yielded ICC = 0 to 0.044 [95]. countries are available for the SF-36 [27]. The component scores have also been standardized with a mean of 50 and standard deviation of 10 [12]. The SF-36 questionnaire can be self-completed or administered either in person or over the telephone by a trained interviewer. It is considered simple to administer and takes less than 10 min to complete [52]. Permission to use the instrument should be obtained from the Medical Outcomes Trust who oversee the standardized administration of the SF-36 and will provide updates on administration and scoring [27]. The measurement properties of the SF-36 are summarized in Table V. Advantages The SF-36 is simple to administer. Either form of administration takes less than 10 min to complete [53]. As a self-completed, mailed questionnaire, it has been shown to have reasonably high response rates (83% [54, 55]; 75% 83% [44]; 85% [56]; 82% overall and 69% for those over age 85) [57]. Limitations Higher rates of missing data have been reported among older patients when using a selfcompleted form of administration [46, 53, 54]. O Mahoney et al. [55] found item completion rates to range from 66% to 96%. At the scale level, complete data collection (amount required to compute a scale score) ranged from 67% (role limitations emotional) to 97% (social functioning). Walters et al. [57] reported scale completion rates among community dwelling older adults ranging from 86.4% to 97.7% with all eight scales being calculable for 72% of respondents. Dorman et al. [56] reported a proportion of missing data on the scale level ranging from 2% (social functioning) to 16% (role functioning emotional). Given the lack of data completeness found, postal administration of the SF- 36 may not be appropriate for use among older adults. O Mahoney et al. [55] suggested that data completeness may be indicative of respondent acceptance and understanding of the survey. Hayes et al. [53] noted that the most common items missing on the self-completed questionnaire referred to work or to vigorous activity. Older respondents identified these questions as pertinent for much younger people and not relevant to their own situation. In a qualitative assessment of the physical functioning and general health perceptions dimensions of the SF-36, Mallinson [58] noted that the participants, who were all over the age of 65, tended to display signs of disengagement from the interview process and some participants expressed concern relating to the relevance of the questions. There was also considerable variation noted in subjective interpretation of items and most subjects used qualifying, contextual information to clarify their responses to the interviewer. As Mallinson points out, such individual issues of subjective meaning and context are lost when the questionnaire is scored. The SF-36 does not lend itself to the generation of an overall summary score. In scales using summed Likert scale scores, information contained within individual responses is lost in the total score that is, any given total score can be achieved in a variety of ways from individual item responses [56]. Hobart et al. [59] examined the use of the two-dimensional model, which consists of a mental health component (MCS) and physical health component (PCS). These two component scales could account for only 60% of the variance in SF-36 scores suggesting a significant loss of information when the 2-component model is used. The level of test re-test reliability reported in stroke populations indicate that the SF-36 may not be adequate for serial comparisons of individual patients, but rather should be used for large group comparisons only [44]. Weinberger et al. [60] also

516 K. Salter et al. Table V. Measurement Properties of the Medical Outcomes Study Short Form 36. Reliability Test-retest reliability: Brazier et al. [54] calculated correlation coefficients ranging from 0.6 (social functioning) to 0.81 (physical functioning). Mean differences ranged from 0.15 (social functioning) to 0.71 (mental health) with 91 98% cases falling into the 95% CI (constructed as per Bland & Altman) [96]; lower values reported in stroke population of 0.28 (mental health) to 0.80 (social functioning) reported substantial variability in individual responses, particularly for role limitations emotional [44]; Brazier et al. [46] reported r = 0.28 (social functioning) to 0.70 (vitality) over a retest period of 6 months. Internal Consistency: Brazier et al. [54] a 50.80 for all subscales but social functioning (a = 0.73). Reliability coefficients = 0.74 (social functioning) 0.93 (physical functioning); Anderson et al. [97] reported a of 0.6 (vitality) to 0.9 (physical functioning, bodily pain and role limitations emotional). Four scales fell below 0.80; Brazier et al. [46] reported a50.80 for all subscales but social functioning (0.56) and general health (0.66) inter-item correlations 5 0.73 with the exception of social functioning (0.56) and general health (0.66); Essink-Bot et al. [45] reported a = 0.76 (general health) 0.91 (physical functioning); Hobart et al. [98] found a of 0.68 (general health) and 0.70 (social functioning) to 0.90 (physical functioning) Correlations between 8 scales were lower than the reported alpha coefficients; Hobart et al. [98] found item-own exceeded item-other correlations by 4 2.5 SE for 6 of 8 scales social functioning scale and general health scale did not (i.e. limited ability to distinguish constructs); Walters et al. [57] reported a 50.80 for all scales but social functioning (a = 0.79). Validity Construct validity (known groups): Patients diagnosed with 5 1 chronic physical problems, had lower scores on all dimensions of the SF-36 except mental health, than healthy age-matched controls (p 5 0.001). SF-36 scores distributed as expected for sex, age, social class and use of health services [54]; SF-36 distinguished between groups based on functional dependence vs. independence based on BI scores (p 5 0.05 on all scales) and between groups based on mental health vs. ill-health defined by GHQ-28 scores (p 5 0.05 on all scales) [97]; Mayo et al. [99] reported SF-36 scores discriminated stroke survivors from age and gender-matched controls; Williams et al. [84] found the SF36 unable to discriminate between groups based on patient self-report ratings of overall HRQOL (same, a little worse or a lot worse than prestroke). SF-36 discriminated between age groups ( 5 75 years vs. 75 + ) on physical functioning, vitality and change in health subscales (p 4 0.006) and between groups based on setting (general practice vs. hospital outpatients) on the physical function and role functioning physical subscales (p = 0.16) [53]; Essink-Bot et al. [45] reported SF-36 able to discriminate between migraine sufferers and controls on all subscales (p 5 0.01; ROC/AUC = 0.54 0.67) and between groups of migraine sufferers based on absence from work (0 vs. 5 0.5 days; p 5 0.01, ROC/AUC = 0.61 0.79); Brazier et al. [46] reported SF-36 scores distinguished groups based on recent visits to GP, hospital inpatient stays and longstanding illness (p 5 0.05). Construct validity: Walters et al. [57] reported significant relationships in hypothesized directions to support construct validity among older adults scores in all scales were reported to decrease as age increased (p 5 0.001) women reported worse health than men on all scales even after adjusting for age (p 5 0.001) respondents who had recently visited their physician reported poorer health on all scales (p 5 0.001) and people living alone had lower scores (p 5 0.001) except on general health (p = 0.02). Convergent and discriminant validity: Correlations of 7 0.41 (social functioning vs. social isolation) to 7 0.68 (vitality vs. energy) between similar scales on the SF-36 and the Nottingham Health Profile were reported. Correlations between dimensions less clearly related ranged from 7 0.18 (physical functioning vs. emotional reaction) to 7 0.53 (social functioning vs. emotional reactions) [54]; Anderson et al. [97] reported BI scores (in stroke survivors) strongly associated (p 5 0.001) with physical functioning and general health -Mental health on the GHQ28 most strongly associated (p 5 0.001) with the social functioning, role limitations emotional and mental health scales of the SF-36; Dorman et al. [56] reported SF-36 physical functioning subscale correlated most closely with mobility, self-care and activities domains of EuroQol (r = 0.57, 0.65 and 0.63) and less strongly with the EuroQol psychological domain (0.34) SF-36 bodily pain correlated with EuroQol pain domain (r = 0.66) and moderately with all EuroQol domains role functioning, emotional correlated most closely with EuroQol psychological domain (r = 0.43) and least with EuroQol self care (r = 0.24) SF-36 mental health was not closely related to the psychological domain (r = 0.21) or to physical EuroQol domains (r = 0.06 0.10) SF-36 general health correlated with EuroQol overall HRQOL rating) r = 0.66; Lai et al. [100] reported r = 0.55 between SF-36 physical functioning scale and BI. Predictive validity: McHorney [101] examined data from Medical Outcomes Study reported general health perceptions scale to be most predictive of death (death rate of patients in lowest quartile for SF-36 general health scale was three times greater than for patients with SF-36 scores in the highest quartile), followed by scores in physical functioning. Baseline physical functioning, role functioning-physical and pain scales were most predictive of hospitalizations and pain, general health and vitality were most predictive of physician visits. Responsiveness Via item mapping social functioning subscale limited assessment of number and difficulty of activities demonstrated marked ceiling effects up to 60% for MRS grade 0) SF-36 physical function scale reported to have floor effects of 37% and 100% for patients with MRS grades 4 and 5 [100]; Large ceiling effects reported for the role limitations physical (53%), bodily pain (43%), social functioning (67%) and role limitations emotional scales (72%) no floor effects over 7% were reported scores for SF-36 physical functioning scale more uniformly distributed than BI scores suggesting lower floor and ceiling effects than the BI [97]; Brazier et al. [46] reported floor effects in excess of 25% for role limitations physical and emotional and ceiling effects 4 25% for social functioning and role limitations emotional and physical. (continued)