RATING SCALES FOR NEUROLOGISTS

Similar documents
Reading a Textbook Chapter

The burden of smoking-related ill health in the United Kingdom

Sequence Analysis using Logic Regression

The effects of bilingualism on stuttering during late childhood

The effects of question order and response-choice on self-rated health status in the English Longitudinal Study of Ageing (ELSA)

Assessment of neuropsychological trajectories in longitudinal population-based studies of children

describing DNA reassociation* (renaturation/nucleation inhibition/single strand ends)

Keywords: congested heart failure,cardiomyopathy-targeted areas, Beck Depression Inventory, psychological distress. INTRODUCTION:

One objective of quality family-planning services is to. Onsite Provision of Specialized Contraceptive Services: Does Title X Funding Enhance Access?

Mark J Monaghan. Imaging techniques ROLE OF REAL TIME 3D ECHOCARDIOGRAPHY IN EVALUATING THE LEFT VENTRICLE TIME 3D ECHO TECHNOLOGY

Sexual and marital trajectories and HIV infection among ever-married women in rural Malawi

Monday 16 May 2016 Afternoon time allowed: 1 hour 30 minutes

ACOG COMMITTEE OPINION

What causes the spacing effect? Some effects ofrepetition, duration, and spacing on memory for pictures

Addiction versus stages of change models in predicting smoking cessation

The impact of smoking and quitting on household expenditure patterns and medical care costs in China

PARKINSON S DISEASE: MODELING THE TREMOR AND OPTIMIZING THE TREATMENT. Keywords: Medical, Optimization, Modelling, Oscillation, Noise characteristics.

Are piglet prices rational hog price forecasts?

The comparison of psychological evaluation between military aircraft noise and civil aircraft noise

Regional Primary Care Team to Deliver Best-Practice Diabetes Care

Urbanization and childhood leukaemia in Taiwan

ACOG COMMITTEE OPINION

Determinants of disability in osteoarthritis of the

METHODS JULIO A. PANZA, MD, ARSHED A. QUYYUMI, MD, JEAN G. DIODATI, MD, TIMOTHY S. CALLAHAN, MS, STEPHEN E. EPSTEIN, MD, FACC

Detection and Classification of Brain Tumor in MRI Images

Primary care research and clinical practice: gastroenterology

Measurement of Dose Rate Dependence of Radiation Induced Damage to the Current Gain in Bipolar Transistors 1

The University of Mississippi NSSE 2011 Means Comparison Report

Effects of training to implement new working methods to reduce knee strain in floor layers. A twoyear

clinical conditions using a tape recorder system

Reading and communication skills after universal newborn screening for permanent childhood hearing impairment

The Assessment of Competence

Ayed Ahmad Khawaldeh, PhD. Assistant Professor, Jerash University. Jamal Fawaz Al-Omari, PhD. Assistant Professor, Balqa University

Daily Illness Characteristics and

Functional GI disorders: from animal models to drug development

Rate of processing and judgment of response speed: Comparing the effects of alcohol and practice

Utilizing Bio-Mechanical Characteristics For User-Independent Gesture Recognition

Evaluation of a prototype for a reference platelet

Circumstances and Consequences of Falls in Community-Living Elderly in North Bangalore Karnataka 1* 2 2 2

Historically, occupational epidemiology studies have often been initiated in response to concerns

Systematic Review of Trends in Fish Tissue Mercury Concentrations

HIV testing trends among gay men in Scotland, UK ( ): implications for HIV testing policies and prevention

RADIATION DOSIMETRY INTRODUCTION NEW MODALITIES

Effects of Temporal and Causal Schemas on Probability Problem Solving

Job insecurity, chances on the labour market and decline in self-rated health in a representative sample of the Danish workforce

Superspreading and the impact of individual variation on disease emergence

A Hospital Based Clinical Study on Corneal Blindness in a Tertiary Eye Care Centre in North Telangana

Computer mouse use predicts acute pain but not prolonged or chronic pain in the neck and shoulder

Shift work is a risk factor for increased total cholesterol level: a 14-year prospective cohort study in 6886 male workers

Overview. On the computational aspects of sign language recognition. What is ASL recognition? What makes it hard? Christian Vogler

OVERVIEW OF THE DIAGNOSIS AND MANAGEMENT OF BRAIN, SPINE, AND MENINGEAL METASTASES

Measurement strategies for hazard control will have to be efficient and effective to protect a

The use of the implantable cardioverter-defibrillator (ICD) for life threatening ventricular

Factors contributing to the time taken to consult with symptoms of lung cancer: a cross-sectional study

HOW TO GET THE MOST OUT OF NERVE CONDUCTION STUDIES AND ELECTROMYOGRAPHY

RECOGNISING AND EVALUATING DISORDERED MENTAL STATES: AGUIDEFORNEUROLOGISTS

On 2 August 1990, Iraq invaded Kuwait. Four days later, nearly US troops and an

Comparison of Bioimpedance and Thermodilution Methods for Determining Cardiac Output: Experimental and Clinical Studies

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Southwest Fisheries Science Center National Marine Fisheries Service 8604 La Jolla Shores Dr. La Jolla, California 92037

4th. generally by. Since most. to the. Borjkhani 1,* Mehdi. execution [2]. cortex, computerized. tomography. of Technology.

Anne M. Boonstra, MD, PhD 1, Michiel F. Reneman, PhD 2,3, Roy E. Stewart, PhD 3 and Henrica R. Schiphorst Preuper, MD 2,3

abstract SUPPLEMENT ARTICLE

Quantification of population benefit in evaluation of biomarkers: practical implications for disease detection and prevention

Hypofractionated Radiation Therapy for Localized Prostate Cancer: Executive Summary of an ASTRO, ASCO and AUA Evidence-Based Guideline

Determination of Parallelism and Nonparallelism in

Lung function studies before and after a work shift

American Orthodontics Exhibit 1001 Page 1 of 6. US 6,276,930 Bl Aug. 21,2001 /IIIII

Effect of Curing Conditions on Hydration Reaction and Compressive Strength Development of Fly Ash-Cement Pastes

Non-contact ACL injuries in female athletes: an International Olympic Committee current concepts statement

A HEART CELL GROUP MODEL FOR THE IDENTIFICATION OF MYOCARDIAL ISCHEMIA

R E Clouse, P J Lustman

Costly Price Discrimination

Community-Based Bayesian Aggregation Models for Crowdsourcing

The clinical impact of nucleic acid amplification tests on the diagnosis and management of tuberculosis in a British hospital

Tiny Jaarsma. Heart failure INTER-PROFESSIONAL TEAM APPROACH TO PATIENTS WITH HEART FAILURE

Reversal of ammonia coma in rats by L-dopa: a peripheral effect

Histometry of lymphoid infiltrate in the thyroid of primary thyrotoxicosis patients

Road Map to a Delirium Detection, Prevention and Management Program

Interrelationships of Chloride, Bicarbonate, Sodium, and Hydrogen Transport in the Human Ileum

Journal of Experimental Psychology: Human Perception and Performance

Biochemical and haematological indicators of

M ore than 25% of the U.S. population

Basal follicle-stimulating hormone level is a better predictor of in vitro fertilization performance than age*

Rosie Doy, Derek Burroughs, John Scott

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

It is well known that obesity has become a major health issue

PRION DISEASES. RSGKnight,RGWill. i36. copyright.

Cyclic Fluctuations of the Alveolar Carbon Dioxide Tension during the Normal Menstrual Cycle

Defective neutrophil function in low-birth-weight,

In-vivo determination of lead in the skeleton after occupational exposure to lead

A Diffusion Model Account of Masked Versus Unmasked Priming: Are They Qualitatively Different?

The role of dynamic subtraction MRI in detection of hepatocellular carcinoma

TREATMENT OF DEMENTIA

Effect of atorvastatin on inflammation and outcome in patients with type 2 diabetes mellitus on hemodialysis

FUNCTIONAL SYMPTOMS IN NEUROLOGY: MANAGEMENT

BTS guideline. Interstitial Lung Disease Unit, London, UK; 2 Royal Infirmary Edinburgh, Edinburgh, UK

Management of thyroid disorders in primary care: challenges and controversies

Transcription:

iv22 RATING SCALES FOR NEUROLOGISTS Correspondene to: Dr Jeremy Hobart, Department of Clinial Neurosienes, Peninsula Medial Shool, Derriford Hospital, Plymouth PL6 8DH, UK; Jeremy.Hobart@ phnt.swest.nhs.uk www.jnnp.om WHY A J Hobart J Neurol Neurosurg Psyhiatry 2003; 74(Suppl IV):iv22 iv26 neurologist one told me that he found the subjet of rating sales exeedingly dull, while another found the area abstruse. I have therefore attempted to produe an overview that is helpful and onveys some of the basi prinipals underlying outomes measurement and rating sales. Cliniians must realise that beause this is an alien and somewhat dry area, they may need to invest some time to appreiate the issues. Instead of disussing speifi sales or rating sales for rehabilitation, whih will only be relevant to a limited audiene, I have hosen to disuss the importane of rating sales and how to ahieve high quality measurement. I hope this makes the text more widely appliable to the neurologial ommunity. The take home message is simple; neurologists need to take their rating sales very seriously. ARE RATING SCALES IMPORTANT? Rating sales are important beause they are a method of measurement. Measurement is important beause inferenes are based on it. 1 For example, in linial trials we measure variables (for example, disability), perform statistial tests on the numbers generated by sales, and base onlusions on the results. These onlusions influene patient are, presribing, poliy making, and the expenditure of publi funds. Thus, the validity of inferenes from linial trials is diretly dependent on the quality of the measurement instruments used. Some measurements are lear ut for example, mortality rates. However, measurement beomes omplex for more abstrat, ill defined, soft outomes suh as patient s perspetives of the impat of disease and their quality of life. If we are serious about using these abstrat variables to evaluate linial pratie we must be serious about our attempts to measure them as rigorously as possible. Consider linial trials of interferons and glatiramer aetate in multiple slerosis (MS). These trials have produed interesting results: an inontrovertible redution in relapse rate and aumulation of magneti resonane image (MRI) lesions over time, but a debatable redution in the progression of physial disability. These findings have prompted major developments inluding: researh to understand the seemingly omplex relation between pathology and disability; alls for definitive studies that are free from pharmaeutial onflits of interest; a ontroversial review by the National Institute for Clinial Exellene; and the UK risk sharing sheme for presribing disease modifying therapies. Despite these major developments, few have questioned seriously how the hoie of rating sale may have influened this ourse of events. This is highly relevant beause these major developments in MS are effetively based on the assumption that Kurtzke s expanded disability sale (EDSS), the rating sale used in most MS treatment trials, was onsidered adequate enough to handle the task of measuring disability and deteting linially signifiant hange when it ours. Unfortunately, data onerning the measurement properties of the EDSS do not give us that onfidene. First, the EDSS mixes the measurement of different health domains that is, impairment in the early part of the sale, mobility in the mid range of the sale, and bulbar funtion in the upper part of the sale. As suh, the EDSS is not a pure disability measure. While this may not seem suh a big deal it is akin to having a sale that measures length at one end, weight in the middle, and volume at the end. Seond, the EDSS generates ordinal sores rather than interval measures. More about that later. Third, the EDSS has been proven less able than other disability sales to differentiate between individuals at one point in time and detet hange in disability over time. These fats undermine the validity of inferenes made on the basis of the analysis of EDSS sores. Consequently, we are at risk of making inaurate inferenes about disability in MS every time we use the EDSS. The use of different statistial methods to analyse results aquired from rating sales annot overome flaws within the rating sale itself. J Neurol Neurosurg Psyhiatry: first published as 10.1136/jnnp.74.suppl_4.iv22 on 1 Deember 2003. Downloaded from http://jnnp.bmj.om/ on 1 Otober 2018 by guest. Proteted by opyright.

DOES THE CHOICE OF RATING SCALE REALLY MAKE THAT MUCH DIFFERENCE? The above disussion labours the point that inferenes from studies are dependent on the quality of the rating sales used. However, surprisingly few studies have taken the next step to determine the impliations for linial trials of the hoie of rating sale. This supports the suggestion that liniians poorly appreiate the limitations of rating sales, perhaps beause measurement in laboratory disiplines presents few inherent diffiulties. Treatment studies in MS provide illustrations that the validity of inferenes made from all linial studies is dependent on the quality of the measurement instruments used: Cohen et al 2 ompared the EDSS with the MS Funtional Composite in a pivotal study of interferons. We ompared six disability measures, inluding the EDSS, in steroid treatment for MS relapses. 3 Both studies demonstrated that the statistial and linial signifiane of the results and, therefore, inferenes made about the treatment effetiveness, depended on whih sale was used. Some authors have played down the importane of rating sales in linial trials suggesting that trial design, in partiular randomisation and blinding, is more important. Maximising trial design will not overome the problems ause by weak sales, and vie versa. Attention to rigor is needed in both arenas. WHATTYPESOFSCALESARETHERE? Tables 1 and 2 show two rating sales, the Ashworth sale for measuring spastiity, and the MS walking sale for measuring patients pereptions of the impat of MS on walking Table 1 Ashworth sale of spastiity 0 No inrease in tone 1 Slight inrease in tone giving a ath when the limb is moved in flexion or extension [1+] Slight inrease in tone, manifested by a ath, followed by minimal resistane throughout the remainder (less than half) of the range of movement] 2 More pronouned inrease in tone but the limb easily flexed 3 Considerable inrease in tone, passive movement diffiult Table 2 The multiple slerosis walking sale (MSWS-12) These questions ask about limitations to your walking due to MS during the past two weeks For eah statement, please irle the one number that best desribes your degree of limitation Please answer all questions even if some seem rather similar to others, or seem irrelevant to you. If you annot walk at all, please tik this box% ability (MSWS-12). The two sales are very different. The Ashworth is an example of a single item sale. It onsiders spastiity as a ontinuum on whih eah of the five defined levels has a speifi meaning (for example, 1 = ath ). Other examples of single item sales are the EDSS and Rankin sale. In ontrast, the MSWS-12 is an example of a multi-item sale. It has 12 questions eah with a range of response options, and sores are summed aross items to generate a summed or total sore. Walking ability is therefore measured on a ontinuum with 48 levels (12 60). The theory underpinning multi-item sales is that when we are attempting to measure omplex linially relevant domains (for example, disability and quality of life) a single item is unlikely to represent well the broad sope of that domain. In addition, eah level of the Ashworth sale is open to individual variation of interpretation (that is, random error). While eah item of a multi-item sale ontributes unique information, it is impratial linially and analytially to allow eah item to at as a rating sale. Consequently we seek to ombine items to allow what they share in ommon to dominate the ways in whih they differ. Furthermore, ombining aross items anels out the unavoidable random error assoiated with eah single item, hene reliability is often high. It goes without saying that we must prove it is appropriate to ombine a set of items to generate a total sore. This is rarely done. More about that later. Single and multi-item sales ontrast in their interpretability and sientifi rigor. Single item measures are easy to interpret as eah level determines a speifi meaning. This is very meaningful linially. For example, an EDSS of 6.5 means a person an walk about 20 m using bilateral assistane. In ontrast, multi-item measures are less interpretable linially as any person s sore represents the sum of the item sores and any (exept min and max) sum an be ahieved by a variety of permutations. From a linial perspetive this reates problems with interpretation as a value of say 54 is somewhat intangible. It is, therefore, entirely understandable why liniians find single item measures more meaningful and therefore lean towards them. Single item sales are weak measures. They have poor reliability, validity, and limited ability to detet differenes between individuals at one point in time and detet hange In the past two weeks, how muh has your MS: Not at all A little Moderately Quite a bit Extremely 1. Limited your ability to walk? 1 2 3 4 5 2. Limited your ability to run? 1 2 3 4 5 3. Limited your ability to limb up and down stairs? 1 2 3 4 5 4. Made standing when doing things more diffiult? 1 2 3 4 5 5. Limited your balane when standing or walking? 1 2 3 4 5 6. Limited how far you are able to walk? 1 2 3 4 5 7. Inreased the effort needed for you to walk? 1 2 3 4 5 8. Made it neessary for you to use support when walking 1 2 3 4 5 indoors (e.g. holding on to furniture, using a stik, et)? 9. Made it neessary for you to use support when walking 1 2 3 4 5 outdoors (e.g. using a stik, a frame, et)? 10. Slowed down your walking? 1 2 3 4 5 11. Affeted how smoothly you walk? 1 2 3 4 5 12. Made you onentrate on your walking? 1 2 3 4 5 E2000 Neurologial Outome Measures Unit, Institute of Neurology, University College Hospital. www.jnnp.om iv23 J Neurol Neurosurg Psyhiatry: first published as 10.1136/jnnp.74.suppl_4.iv22 on 1 Deember 2003. Downloaded from http://jnnp.bmj.om/ on 1 Otober 2018 by guest. Proteted by opyright.

iv24 over time. A ouple of analogies may help to explain this situation that some liniians find paradoxial. Consider the introdution of a ompulsory multiple hoie examination for neurology trainees! The aim is to measure examinee level of neurologial knowledge. If the exam has one question the results will be heavily influened by the examinees knowledge of that speifi topi area rather than their overall neurology knowledge. The more questions asked, and aggregated, the better measure we get of that person s knowledge provided the questions have a reasonable overage of the subjet mater. Another analogy is the Barlayard Premiership. The league seeks to determine the best football team in the land, so it is a measure of ability. This season Manhester United (finished top) drew 1 1 with Sunderland (finished bottom). That single game was not a reliable or valid indiator of the relative differene in footballing ability of the two teams. From these analogies it is hopefully easier to appreiate why single item sales are likely to be unreliable (subjet to random error) and poorly valid (a limited indiator of neurology knowledge). Can we afford these sientifi weaknesses in our linial trials? HOW DO I CHOOSE THE BEST SCALE FOR MY PURPOSE? Cliniians often have to hoose one sale from among many potential andidates. Unfortunately, no one sale exhibits all desirable qualities, different sales have different virtues, and sales that are useful for one situation may not be useful for others. Therefore, a sale must be seleted for a partiular purpose. To do this sale users must be able to hoose measures intelligently based on their needs. Rating sales must be linially useful and sientifially sound. Clinial usefulness refers to the suessful inorporation of an instrument into linial pratie and its appropriateness to the study sample. Sientifi soundness refers to the demonstration of reliable, valid, and responsive measurement of the outome of interest. Clinial usefulness does not guarantee sientifi soundness, and vie versa. I will onede that diatribes on reliability and validity testing are dull. There are also many publiations on evaluating psyhometri properties and these are regularly updated as the field moves forward. 4 Here, then, I simply make a few key statements. Explore beyond the title of a sale. For example, onsider the Rankin sale whih is alled a handiap measure. It seems urious that the six levels mention symptoms (0 = no symptoms) and disability (1 = slight disability; 2 = mild disability; 3 = moderate disability; 4 = moderately severe disability; 5 = severe disability) but not handiap. Be very lear about what you want to measure. There is a urrent vogue to use quality of life as a primary outome for linial trials. But there are many definitions of quality of life. Also, quality of life may not be the most appropriate variable to measure. The more distant the outome hosen is in relation to the aim of the intervention, the greater the hane of onfounding. For example, hip replaement is often undertaken to relieve pain. Should we be disappointed, or ritial, if the effet on psyhosoial funtioning is far less dramati Studying the distribution of sale sores in samples is simple and a very valuable basi test to determine whether a sale will be useful in that sample. Although this does not provide evidene for reliability and validity per se, targeting of sales to samples is important as eiling and floor effets (perent soring maximum and minimum possible sores) represent sub-samples whose sores annot and may not hange regardless of the effets of the intervention. This simplest of analyses is rarely undertaken. Reliability, validity, and responsiveness are, to a large extent, independent psyhometri properties. Therefore, they must all be undertaken. There is little value in studying a single property alone even though this is more ommon than full psyhometri evaluations. Reliability, validity, and responsiveness are sample dependent properties. Hene it is important to study sales in different samples. This is partiularly important for generi sales; these are sales that an be used in a wide range of disorders. For example, the medial outomes study short form 36-item health survey (SF-36) is the most widely used health status measure aross the world. It is therefore tempting to use it. However, evidene demonstrates important limitations as an outome measure for linial trials in MS, stroke, and amyotrophi lateral slerosis/motor neurone disease. One of the best tests of validity is the development method of a sale. If reognised tehniques of rating sale onstrution were used the hanes of good reliability and validity are high. Using a sale in linial pratie or a study will usually provide enough information to make statements about its reliability and validity even though this may not be, or was not intended to be, a psyhometri study. Although, obviously, psyhometri properties should be tested and demonstrated before a sale is used, this retrospetive approah, whih enables liniians to support or refute some of the inferenes they make, is better than nothing. HOW ARE SCALES DEVELOPED? Developing rating sales is a labour intensive proess requiring onsiderable expertise in health measurement. Therefore, it is advisable to arefully evaluate existing measures before abandoning them. The psyhometri properties of available measures an be determined more quikly. Here is an overview of instrument development. Fuller aounts an be found elsewhere. Multi-item sale development an be onsidered to have four stages. First, define what you want to measure, whih in measurement speak is the onstrut, and any potential subdivisions of it (the sub-onstruts). Seond, generate a pool of items so that all important issues are onsidered for inlusion in the final sale. Third, administer the item pool to a sample of patients and, from the analysis of the resulting data, develop a sale(s) that are reliable and valid representations of the onstrut. Finally, examine the full properties of the sales in independent samples. ARE TOTAL SCORES GENERATED BY SUMMING ITEM SCORES REALLY GOOD MEASURES? The answer here is yes and no. It depends on the definition of measurement being used, and the goals that we are trying to ahieve. This issue is beoming very important, and therefore it is appropriate to onsider it. However, things do start to get a bit more omplex from here on in. If we make the assumption that measurement is quantifiation of a variable, and that variables an go from less of to more of, then it is reasonable to onsider the total sore generated by adding up a set of items is a measure of that variable provided that we have some way of demonstrating that the items address the same underlying onstrut. This is the basi theory that underpins multi-item rating sales and was disussed earlier. J Neurol Neurosurg Psyhiatry: first published as 10.1136/jnnp.74.suppl_4.iv22 on 1 Deember 2003. Downloaded from http://jnnp.bmj.om/ on 1 Otober 2018 by guest. Proteted by opyright. www.jnnp.om

Consider the MSWS-12. Our aim was to measure the impat of MS on walking ability. The variable (onstrut) we wished to measure was walking ability. By interviewing patients and liniians we got a set of statements on how MS affeted walking ability. When redundant statements were removed we were left with n = 12. A response option was written so that the impat of MS on eah item ould be graded. This potential sale was sent to a large group of people with MS and the resulting data analysed to determine if it was appropriate to ombine the sores of the 12 items to generate a total sore and if the total sore was reliable (reproduible) and valid (evidene that it was an indiator of walking ability). Evidene for this is presented in the development paper. While this all seems reasonable, we annot get away from the fat that summed sores make a series of assumptions that do not hold. First, the response ategories for eah item are given sequential integers (1, 2, 3, 4, 5). This assumes equal differenes between the different levels. This is not the ase, logially or empirially. Seond, quite a bit is assumed to be more than moderately. There is evidene that a substantial proportion of the population think moderately is more than quite a bit. Third, the use of total sores assumes that given differenes have equal meaning. That is, a sore differene of 10 points has the same meaning aross the sale range. There are lear demonstrations that this is not true. Over the last few hundred years mathematiians, physiists, psyhologists, measurement theorists, philosophers, and others have artiulated what they mean by measurement. It has been defined that measurement in the physial sienes, termed fundamental measurement by the physiist Norman Campbell, has five main harateristis: unidimensionality, linearity, sample independene, sale independene, and invariane. Consider a ruler for measuring height. The ruler desribes only one attribute (unidimensionality), whih it measures on a linear ontinuum (that is, the differenes between the alibrations are equal). The ability of a ruler to measure height is not seriously affeted by the people being measured (sample independent). It does not matter whih ruler is used to measure height (sale independent). The proess of measurement remains the same at different areas of the ontinuum (invariane). Campbell suggested that measurement in the soial sienes (effetively anybody using rating sales as measurement instruments) ould not be alled a siene until it ahieved these harateristis. Now when we use a rating sale, and sum the item sores to get a total sore, it is diffiult to be ertain that we are measuring a single onstrut. We have not proven that the distane between units is stable. We know the properties of sales are sample dependent and the measurement of people is sale dependent. In short, we have not and annot ahieve measurement as defined by others, and ertainly not the type of measurement ahieved in the physial sienes. When we think about it further, rating sales are merely ounts of disrete events. But this is the only format in whih we an get suh data and thus it is what we have to work with. It is lear then that something must be done to rating sale data before we an onsider total sores as measures that satisfy the harateristis stated by measurement theorists aross the years. HOW DO WE ACHIEVE LINEAR MEASURES FROM SUMMED SCORES? This brings us into the domain of new psyhometri methods Rash analysis (RA) 5 and item response theory (IRT). 6 There are statistial tehniques that an be applied to rating sale data. They attempt to transform ordinal sores, that are sale dependent and of limited auray, into interval measures that are sale independent and suitably aurate for individual patient assessment. In essene, these methods model the probability of an individual s response to an item. They are based on a logial assumption: individuals with high levels of whatever is being measured (for example, physial funtion) should have an inreased probability, relative to individuals with low levels, of getting a better sore on any item (for example, dressing). Tehnially this gets very omplex but it is important to onsider the linial benefits. There is a huge potential for new psyhometri methods to hange the fae of health outomes measurement. Using linear measures instead of non-linear raw sores would give a true refletion of disease impat, differenes between individuals and groups, and treatment effets. The value of this is highlighted by studies demonstrating that raw sore hanges underestimate interval level hange by up to ninefold. 7 Improved auray would enable individual patient assessment. The ability to generate interval measures that are independent of the rating sale used enables sales measuring the same health onstrut to be equated on the same linear ruler. This is the basis for omparisons of studies, meta-analyses, and systemati reviews. Moreover, the proess of sale equating generates a pool of ommonly alibrated items, an item bank. Item banks are flexible measurement methods beause any subset of items an be seleted from the bank to generate an aurate sore. Therefore, investigators are no longer wedded to defined sales and an simply selet the most appropriate group of items for their study. Alternatively, of ourse, they ould hoose a defined sale if they wish. The availability of item banks opens the way for the most exiting development in health measurement, omputerised administration of rating sales (omputer adaptive testing). Here, a omputer uses the response to an item to determine the next item presented to the respondent. As a result, the optimum items for any individual are identified thus providing rapid, effiient, user friendly, and preise individual person measurement. Computer adaptive testing offers the opportunity to bring patient based outome measurement into routine linial pratie and influene deision making for individual patients. Currently this does not happen. The last few years has seen the appliation of new psyhometri methods. Most studies have analysed existing sales. However, there is evidene that health measures an be suessfully equated. Computer adaptive administration of a alibrated item pool for the impat of headahe has been shown to generate rapid (five items or less) person measurement. These measurements are as preise as those generated by the entire item pool (54 items) and suitable for individual patient linial deision making. Given the linial potential of new psyhometri methods it is urious that they are not more widely available. There are a number of possible explanations for this. First, the area is omplex whih naturally attrats septiism and is off-putting. Complexity an lead to onfusion and misunderstandings (see later). Seond, PC based software for iv25 J Neurol Neurosurg Psyhiatry: first published as 10.1136/jnnp.74.suppl_4.iv22 on 1 Deember 2003. Downloaded from http://jnnp.bmj.om/ on 1 Otober 2018 by guest. Proteted by opyright. www.jnnp.om

iv26 undertaking Rash and IRT analyses have only beome available in the last few years. Perhaps the most important fat impeding progress in the field of new psyhometri methods is misunderstandings about the similarities and differenes between RA and IRT. The two statistial methods are onsistently onsidered as members of the same family, and usually termed IRT. This is probably beause of their theoretial and mathematial similarities. However, RA and IRT differ at the most fundamental level the philosophy underpinning their development. 8 The Rash model is a definition of measurement, a mathematial derivation from the requirement that stable linear measures be onstruted from the ordered qualities of rating sale data. Therefore, the aim of a Rash analysis is to determine the extent to whih observed rating sale data satisfy this stringent definition. Stable linear measures an be onstruted only when the data satisfy the model. Therefore, we seek data that fit the model. In stark ontrast, IRT models were developed to explain data. Therefore, the aim of an IRT analysis is to seek a model that fits the data. In his reent artile, Massof ompares and ontrasts RA and IRT, explaining this fundamental differene in detail, demonstrating the limitations of IRT, and the importane of Rash. 9 Massof demonstrates, empirially, that only the Rash model enables investigators to ahieve measurement, as desribed by measurement theorists, from rating sale data. He demonstrates that IRT models are not valid measurement models. It seems surprising that this fundamental differene between RA and IRT is not highlighted in any of the artiles in a reent supplement of Medial Care devoted to IRT. The two methods represent different paradigms with different researh agendas. They are, therefore, inompatible. 8 CONCLUSION Developments in basi neurosiene are generating new treatments that need to be evaluated and ompared. The emphasis is that these evaluations be done from the patient s perspetives. Unless high quality rating sales are available we run the risk of making inaurate inferenes from linial trials. However, this hallenge is not as daunting as it may appear beause tehniques are available to ahieve, from rating sale data, the type of measurement taken for granted in the basi sienes. It is time that liniians reognised that fat, insisted on better measures, and enourage investment in measurement researh. ACKNOWLEDGEMENTS I am grateful to Professors Alan Thompson (Institute of Neurology) and Benjamin Wright (University of Chiago) for their input to my work in this area. Some of this work was supported by the NHS Health Tehnology Assessment Programme, but the views and opinions expressed do not neessarily reflet those of the NHS Exeutive. REFERENCES 1 Bond T, Fox C. Applying the Rash model: fundamental measurement for the human sienes. New York: Lawrene Erlbaum Assoiates, 2001. An important text for those interested in Rash tehnology, although it is from the perspetive of eduational and psyhologial measurement. 2 Cohen JA, Cutter GR, Fisher JS, et al for the IMPACT Investigators. Use of the multiple slerosis funtional omposite as an outome measure in a phase 3 trial. Clinial Trial Arh Neurol 2001;58:961 7. 3 Hobart JC, Riazi A, Lamping DL, et al. Measuring the impat of MS on walking ability: the 12-item MS walking sale (MSWS-12). Neurology 2003;60:31 6. The paper outlining the development of the MSWS-12. 4 Sientifi Advisory Committee of the Medial Outomes Trust. Assessing health status and quality of life instruments: attributes and review riteria. Quality of Life Researh 2002;11:193 205. The most up to date guidelines for evaluating rating sales. 5 Rash G. Probabilisti models for some intelligene and attainment tests. Chiago: University of Chiago Press, 1960. 6 Lord FM, Novik MR. Statistial theories of mental test sores. Reading, Massahusetts: Addison-Wesley, 1968. Referenes 38 and 39 in this paper are the primary texts for Rash analysis and item response theory not for the faint hearted. 7 Wright BD, Linare JM. Observations are always ordinal: measurements, however, must be interval. Arh Phys Med Rehabil 1989;70:857 60. A lear explanation of why summed sores will not do. 8 Andrih D. Controversy and the Rash model: a lash of two paradigms. Medial Care (in press). Essential reading for anyone interested in Rash analysis and item response theory. It explains the similarities and differenes and the basis of a long standing ongoing ontroversy. 9 Massof R. The measurement of vision disability. Optometry and Vision Siene 2002;79:516 52. An exeptional doumentation of the history of rating sales and a fine demonstration of the limitations of traditional psyhometri methods, advantages of new psyhometri methods, and differenes between Rash tehnology and item response theory. J Neurol Neurosurg Psyhiatry: first published as 10.1136/jnnp.74.suppl_4.iv22 on 1 Deember 2003. Downloaded from http://jnnp.bmj.om/ opyright. on 1 Otober 2018 by guest. Proteted by www.jnnp.om