REHABILITATION RESEARCHERS and clinicians are

384 A Computer Adaptive Testing Simulation Applied to the FIM Instrument Motor Component Marcel P. Dijkers, PhD From the Department of Rehabilitation Medicine, Mount Sinai School of Medicine, New York, NY. Supported by the National Institute on Disability and Rehabilitation Research, Office of Special Education and Rehabilitative Services, US Department of Education (grant nos. H133N50008, H133N5000027). No commercial party having a direct financial interest in the results of the research supporting this article has or will confer a benefit upon the author(s) or upon any organization with which the author(s) is/are associated. Reprint requests to Marcel P. Dijkers, PhD, Mount Sinai School of Medicine, Dept of Rehabilitation Medicine, Box 1240, One Gustave Levy Pl, New York, NY 10029-6574, e-mail: marcel.dijkers@mountsinai.org. 0003-9993/03/8403-6254$30.00/0 doi:10.1053/apmr.2003.50006 ABSTRACT. Dijkers MP. A computer adaptive testing simulation applied to the FIM instrument motor component. Arch Phys Med Rehabil 2003;84:384-93. Objective: To determine whether computer adaptive testing (CAT) can be used to decrease the number of FIM instrument motor component items administered in assessing persons with spinal cord injury (SCI). Design: For a CAT simulation, a 3-step algorithm was used to select 6 FIM items for each individual; items were selected according to the subject s motor ability as estimated by 2 initial items. Separate estimates of motor ability for admission, discharge, and follow-up data (plus combined time points) derived from 6 items were compared statistically with estimates derived from 14 items (walking and wheelchair mobility were split). Setting: Records from the Spinal Cord Injury Model Systems (SCIMS). Participants: Patients served by the SCIMS, for whom complete motor FIM information was available for rehabilitation admission (N 5969), discharge (N 5964), or follow-up at a first or later anniversary (N 5176). Interventions: Not applicable. Main Outcome Measures: Similarity of mean, standard deviation, skewness, kurtosis, and Rasch reliability and separation of persons and items based on 6 and 13 items; intraclass correlation coefficient (ICC) for parallel estimates. Results: Calibrations for FIM items and FIM steps differed for the 3 time points, but showed sufficient agreement (ICC,.90) that combined calibration was feasible. Means and other distribution characteristics differed minimally between the 6- and 13-item estimates. The person and item separations and reliabilities were somewhat lower and the mean measurement errors somewhat higher for the 6-item estimates, but only marginally so. ICCs between 6- and 13-item estimates were.95 or higher. Conclusion: CAT can be used to reduce data collection time; the level of precision of estimates is minimally less than that provided by traditional assessment approaches. Key Words: Outcome assessment (health care); Questionnaires; Rehabilitation; Reproducibility of results; Spinal cord injuries. 2003 by the American Congress of Rehabilitation Medicine and the American Academy of Physical Medicine and Rehabilitation REHABILITATION RESEARCHERS and clinicians are increasingly confronted with the problem that fairly good measures of many patient or subject characteristics exist, but the time available to administer them is limited. A possible solution is individually customized tests containing carefully selected items that provide maximum information on the person to whom they are administered. New methods in fundamental measurement, and the wide availability of computers, facilitate such adaptive testing. This article shows the overall approach of computer adaptive testing (CAT). I used the FIM instrument to show how an algorithm for item selection and administration is developed. If an instructor were to administer to a class of college students taking a calculus course a midterm test that consisted entirely of third-grade arithmetic problems, 2 predictable results would ensue: all students would get an A, and the instructor would not obtain any useful information about their knowledge of mathematics, either absolutely or relative to each other. Conversely, if a teacher used a college calculus examination as the year-end test for third-grade arithmetic, we would not be surprised to get a class full of disappointed (if not bewildered) students, all with a grade of F, while the teacher would gain no information about how much mathematics knowledge the year s instruction had produced. In situations like these, it is easy to see that the difficulty of a test should be matched to the ability of the test takers, or no useful information will result. What holds true at the group level applies at the individual level. If the test contains no items uniquely attuned to the performance abilities of each level of test taker, the test takers cannot be differentiated. The arithmetic genius in third grade will get the same grade as the very good students an A unless the test contains items that only a mathematical prodigy can handle. The best and most efficient testing, of knowledge as well as of all other characteristics (eg, personality traits, functional ability), is done when the items have a level of difficulty that is approximately at the level of ability (strength of the characteristic, construct, or latent trait ) of the person being measured. 1,2 In that way, each additional item on the test or measure provides the maximum amount of information. If we know each item s difficulty level and we know which items a person passed and which ones he/she failed, a close estimate of the strength of the characteristic of interest is possible. This approach calls for individualized tests, rather than the one-sizefits-all approach typical of classical test theory, in which all subjects are administered exactly the same (lengthy) test or battery of tests. 2 We need an approach in which test difficulty is tailored to the (anticipated) skill level of the test takers. If no a priori basis exists for estimating that level, an algorithm can be used in which everyone starts off with an item of intermediate difficulty. Depending on a subject s success or failure on the first item, an easier or more difficult item is provided, until the appropriate zone of trait strength (eg, knowledge, functional ability, attitude) is reached. Next, the subject is given a

COMPUTER ADAPTIVE TESTING AND THE FIM INSTRUMENT, Dijkers 385 short series of additional items or questions at that level. This eliminates the chance of error and establishes that the subject s exact knowledge or ability level has been identified. 3,4 Theoretically, individualized testing can be done with paperand-pencil tests (1 item administered at the time, the next item selected from a pool with known difficulty level). In practice that approach is impossible, especially when groups of subjects are measured at the same time (eg, school end-of-term examinations). The alternative is CAT, which has a 30-year history in education. 5-9 In the CAT approach, banks of true-false and multiple choice items are developed. Their difficulty level is determined by administering them to large numbers of students of the same grade level(s) as the eventual CAT test takers. Computer programs present these questions 1 at a time on the test-takers computer screens; they respond by using the keyboard. Basing its logic pathway on the result for each item (pass or fail), an algorithm selects for each individual student easier or more difficult questions from the bank, until 1 of 2 conditions are met: (1) a prespecified number of questions has been administered (fixed test-length CAT), or (2) the student s competence level has been determined with a preset level of reliability. In the latter condition, the number of items administered may vary considerably between test takers, depending on the consistency of the test taker and other factors. 10 CAT has several advantages, the 2 most important of which are total test length can be shortened (typically by 50%, sometimes by more), and competence levels are known with much greater precision, not just for the middle band of students, but also for students with very low or very high skill (equiprecision). 11 The higher test security, instant scoring, and immediate feedback to the student that computers make possible are considered to be part of the secondary benefits of CAT. 3,10,12 CAT is now used outside of education, for instance, in credentialing examinations of health professionals. 13 CAT is only possible because of another innovation that started in education: Rasch analysis of tests. Rasch theory and analysis focus on the individual test item, whereas in classical test theory analysis focuses on the entire test. The classical approach has given us a multitude of mechanisms to evaluate the quality of a test as a whole, such as the Cronbach. But Rasch analysis concentrates on the relation between the latent trait to be measured and the specific items assembled as presumptive indicators of that trait. The goal of the analysis is to determine if the items all reflect the same dimension, and to find for each item the likelihood of endorsement by people with varying amounts of the trait. The analysis typically results in estimates of item difficulties (termed calibrations), and estimates of the abilities of the persons to whom the test was administered. Rasch analysis also provides diagnostic indices that indicate the degree to which individual items do not belong in the test (eg, a geography question in a math test), and the degree to which individual persons do not belong in the group being tested (eg, the bright child who aces the arithmetic test, except for all questions involving fractions, because he/she was sick when that topic was covered). In recent years, a number of introductory expositions, all nontechnical, of Rasch theory and analysis have been published in the medical rehabilitation literature. 16 For those who have grounding in statistics, there are more advanced accounts. 17,18 Rasch analysis has been used in rehabilitation research, but CAT has not. The likely reason is that applications of Rasch analysis in rehabilitation have focused on functional assessment. Investigators have used it to analyze the output from functional assessment instruments that were already in wide use before the introduction of these new analytical techniques. These studies 19-21 have shown that the activities of daily living (ADLs) included in the 3 most commonly used measures, the FIM instrument, the Level of Rehabilitation Scale (LRS), and the Patient Evaluation and Conference System (PECS), constitute a single dimension, and that the individual ADLs differ in difficulty level, from the easiest (generally: feeding) to the most difficult (always: climbing stairs). The number of ADLs included in these instruments is generally 15 or fewer seemingly not enough to make it worthwhile to do complicated analyses to determine whether targeting item selection at the patient s ability level may lead to test shortening. Furthermore, FIM, LRS, and PECS data commonly are the product of a comprehensive assessment at the time of rehabilitation admission or discharge. In the production of such data, issues of clinical relevance, completeness of assessment, and turf division between disciplines are all interwoven, making it unlikely that CAT or a paper-and-pencil analog would be introduced. However, other situations exist in which functional assessment instruments are used. Specifically, they are used in research. The only identified medical rehabilitation application of CAT 22 was a simulation intended to determine whether the motor FIM as administrated by means of a standardized questionnaire as part of research follow-up can be shortened. The results of that study indicated that with spinal cord injury (SCI) subjects, a motor FIM consisting of 7 items can be used, instead of the standard 13, with only a minuscule loss of reliability. The time savings from a typical 15 to 20 minute FIM interview may not be in proportion to the reduction in number of items, but even a 5-minute decrease is a gain. The same efficiency improvement may be achieved in studies in which the FIM is administered as a performance test: the subject is required to perform all items, at least in simulated fashion (eg, toileting). A recent study 23 indicated that in an SCI sample such testing took, on average, 60 minutes. The purpose of the present study was to replicate the earlier simulation study, 22 but using rehabilitation admission and discharge data in addition to follow-up information, and with other modifications to improve on the results and show the possibility of CAT in rehabilitation. METHODS Source of Data The National Spinal Cord Injury Database (NSCID) is a data set developed by the Spinal Cord Injury Model Systems (SCIMS) funded by the National Institute on Disability and Rehabilitation Research. 24,25 Currently, the NSCID has 16 contributing systems; the data set used in the present study includes cases submitted by some of Model Systems that are no longer funded. NSCID Form I is used to collect epidemiologic, demographic, medical, and functional information as well as data on initial hospitalization in SCIMS hospital(s) before the patient is definitely discharged to the community. It includes the motor component of the FIM, as rated by therapy staff at admission to and discharge from the rehabilitation hospital or unit. NSCID Form II is completed annually, approximately on the anniversary of injury. It is used to collect information on the person s status as of the anniversary and events occurring in the preceding year. The motor FIM 26 is part of the data collected; the information is self-reported or proxy-reported in an interview by a data collector who uses the FIM Guide 27 or some other standard approach to ask questions that produce the information needed to determine FIM item scores.

386 COMPUTER ADAPTIVE TESTING AND THE FIM INSTRUMENT, Dijkers Table 1: Results of the Item (Primary) Motor FIM Rasch Analysis: Item Weights and Step Weights Single Time-Point Calibration Combined Time- Item Measures Admission Discharge Follow-Up Points Calibration FIM Item Logit SE Logit SE Logit SE Logit SE Feeding A 1.80 0.01 1.53 0.02 1.41 0.02 1.66 0.01 Grooming B 1.35 0.01 1.13 0.01 1.05 0.01 1.23 0.01 Bathing C 0.06 0.01 0.15 0.01 0.19 0.02 0.02 0.01 Dressing: upper body D 0.45 0.01 0.52 0.01 0.56 0.02 0.51 0.01 Dressing: lower body E 0.37 0.01 0.03 0.01 0.13 0.02 0.15 0.01 Toileting F 0.38 0.01 0.30 0.01 0.06 0.02 0.25 0.01 Bladder management G 0.14 0.01 0.10 0.01 0.37 0.02 0.17 0.01 Bowel management H 0.37 0.01 0.42 0.01 0.47 0.02 0.41 0.01 Bed/chair/WC transfer I 0.00 0.01 0.20 0.01 0.33 0.02 0.17 0.01 Toilet transfer J 0.38 0.01 0.17 0.01 0.16 0.02 0.22 0.01 Tub/shower transfer K 0.54 0.02 0.35 0.01 0.28 0.02 0.37 0.01 Walking L1 0.25 0.03 0.51 0.03 0.73 0.04 0.45 0.02 Wheelchair mobility L2 0.42 0.01 0.76 0.01 1.23 0.02 0.76 0.01 Stairs M 1.54 0.03 2.13 0.01 2.58 0.01 2.29 0.01 Step Average Measures FIM Step Logit Logit Logit Logit Total assistance 1 1.82 1.76 1.60 1.81 Maximal assistance 2 1.10 0.97 0.71 1.01 Moderate assistance 3 0.65 0.42 0.37 0.54 Minimal assistance 4 0.29 0.06 0.10 0.09 Supervision or set-up 5 0.12 0.45 0.36 0.30 Modified independence 6 0.17 1.16 1.24 1.09 Complete independence 7 0.90 2.19 2.41 2.17 Abbreviation: WC, wheelchair. Case Selection The data set available in September 1999 was used. Case selection was performed separately for admission, discharge, and follow-up FIM analyses. Each case included for the admission data analysis had complete (nonmissing) and valid information on all 13 FIM items, as well as valid information on the common mode of locomotion. (The SCIMS database does not adhere to the Uniform Data System rule that a FIM item is scored 1 if the person is not tested; a separate missing code is used. However, because the admission and discharge data in the present study were derived from clinical files, it is possible that some values of 1 indicate not tested, rather than total assistance. ) I followed the same rules for the discharge and follow-up analyses. Motor FIM information was available for 5968 admissions and 5964 discharges. Follow-up information was available for 5176 anniversaries; some persons contributed multiple anniversaries. The analysis for all time points combined included 17,108 records. Creation of 2 Locomotion Variables Ability to move about by the mode that is most commonly used by the person (walk or wheelchair or both) is recorded on the FIM, listed in variable L, locomotion. An auxiliary variable is used to indicate for which specific mode information is recorded in the L variable. In the current analysis, item L was split into L1 (walking) and L2 (wheelchair use) based on the value of the auxiliary variable; persons who used both means about equally were assigned the score for walking, because that is the more difficult item. The fact that par force walking and wheelchair are alternatives to one another means that, if a person has a valid code for walking, he is coded missing for wheelchair use, and vice versa. Primary Rasch Analysis The rating scale model is a Rasch analysis technique for investigating (1) the difficulty level of a scale s ordinal level items, (2) the discriminating qualities of the individual items and of the scale as a whole, (3) the fit of each item within the scale, and (4) the spread of items along the dimension they define. 15,18,28,29 Rasch analysis calibrates the scale items according to difficulty level, and the individual calibrations can be combined to form meaningful interval-level measures for precise statistical analysis. Besides providing item difficulty estimates and their standard error (SE), Rasch analysis also produces estimates of each person s ability level, along with an SE for that estimate. All parameters are scaled and reported in logits (log odds), but can be translated into any desirable units. For purposes of simplicity, the logit was retained as the unit in all analyses reported here. I performed a first Rasch analysis on the entire item motor FIM for all subjects by using the rating scale model. Basing my decisions on the results of this primary analysis (table 1), I developed an algorithm to select, for each case, appropriate items to be included in the shortened motor FIM. Treatment of Data Collected at Separate Times If separate Rasch analyses are performed for data collected at different times for the same test and sample, discrepant calibrations may result. In effect, the ruler changes and it is not clear whether the different scores for the same person reflect real change. In a reasonable proposal for a simple algorithm, Chang and Chan 30 described 4 possible approaches to Rasch analysis in situations in which data for multiple time points are available and investigators are interested in overtime comparisons: (1) perform separate analyses of the data

COMPUTER ADAPTIVE TESTING AND THE FIM INSTRUMENT, Dijkers 387 Fig 1. Item measures and step average measures (in logits) for admission, discharge, and follow-up, based on single time-point calibrations. obtained on different occasions; (2) pool the data and perform a single Rasch analysis; (3) pool the data and perform 1 Rasch analysis that considers the parallel items for different occasions to be distinct items (eg, when there are admission and discharge FIM data, there would be 2 13 26 motor FIM items); and (4) consider time a facet, and perform a 3-facet analysis (of item, person, occasion). Chang and Chan suggest that analysts start with the first alternative, and move to the second if the item calibrations for the various time points are reasonably similar, as indicated, for example, by an intraclass correlation coefficient (ICC) of.90 or more. For more discrepant calibrations, the third approach is suggested, with the fourth reserved for special situations. This strategy should result in a parsimonious approach that gives a reasonable chance that a constant ruler is used. In the present study, I observed a reasonable similarity among calibrations (table 1, fig 1), so only the first 2 approaches were necessary. I call calibrations based on data for 1 time point single time-point calibrations. The calibration based on data for the 3 time points combined I call the combined time-points calibration. Item Selection Based on Estimated Ability by Using an Algorithm Because previous research 22 found that a 7-item motor FIM is adequate, and even a 5-item motor scale may be satisfactory, I decided to use a 6-item FIM instrument in the present investigation. Because walking and wheelchair locomotion have such different levels of difficulty (see table 1), especially at follow-up, separate algorithms were developed for walkers and wheelchair users (table 2). Even though the calibrations for the 4 primary analyses (admission, discharge, follow-up, combined) differed somewhat, the overall similarity of the profiles (fig 1) suggests that there is no need for separate item sequences for each time point in the algorithm used to select items for a shortened motor FIM. My general procedure (a variation on what Weiss 6 termed the pyramidal adaptive test algorithm) was the same in both instances. It is shown here for walkers (table 2). Bladder management (item G), an item of average difficulty (as determined in the primary Rasch analyses), was selected as the starting point. If a subject had high ability (raw score of 5) on bladder management, then bowel management (item H), a fairly difficult item, was selected next. If he/she again showed high ability (supervision/setup or better), the 4 most difficult items (J, K, L, M) were selected in addition to the initial 2. On the other hand, if on bladder management a person had a score of 4 or lower, indicating low motor ability, the algorithm next selected a fairly easy item, dressing upper body (item D). Based on her/his performance on this item, either an easy or a somewhat more difficult set of items was selected. Thus, 4 ability groups were distinguished by the algorithm, and for each walker and each wheelchair user an optimal set of items was selected in this manner. The item selection for the medium low and medium high groups was very similar, differing mostly because of the second item selected. Secondary Rasch Analysis The set of 6 items selected in this way was next analyzed with program Bigsteps a by using the rating scale model. In Table 2: Algorithm for Selecting a Set of 6 Motor FIM Items Group Ability Subgroup Item Item Additional Items 1 2 3 4 5 6 WC users Low G L2 A B D I Medium low G L2 C D E I Medium high G H C F J K High G H F J K M Walkers Low G D A B C I Medium low G D C E F I Medium high G H E F J K High G H J K L M Abbreviations: FIM items, see table 1.

388 COMPUTER ADAPTIVE TESTING AND THE FIM INSTRUMENT, Dijkers Item Table 3: Mean and SD for All Motor FIM Items (Raw Scores), by Time Period Letter Admission Mean SD Discharge Mean SD Follow-Up Mean SD Feeding A 4.1 2.5 5.9 1.8 6.0 1.8 Grooming B 3.6 2.4 5.6 2.0 5.7 2.2 Bathing C 1.9 1.3 4.5 2.1 5.0 2.5 Dressing: upper body D 2.5 2.0 5.1 2.2 5.3 2.5 Dressing: lower body E 1.7 1.3 4.6 2.4 4.7 2.7 Toileting F 1.6 1.4 4.3 2.4 4.8 2.7 Bladder management G 1.8 1.8 4.5 2.4 4.5 2.4 Bowel management H 1.7 1.6 4.2 2.4 4.5 2.5 Bed/chair/WC transfer I 2.0 1.4 4.8 2.2 5.1 2.5 Toilet transfer J 1.7 1.3 4.5 2.2 4.7 2.6 Tub/shower transfer K 1.5 1.2 4.3 2.2 4.6 2.5 Walking L1 2.6 1.9 5.6 1.3 6.3 0.9 Wheelchair mobility L2 2.3 1.9 5.0 1.7 5.6 1.4 Stairs M 1.1 0.7 2.3 1.9 2.4 2.3 Motor FIM raw total* 27.6 14.9 59.8 23.3 63.0 26.0 No. of cases (except for L1, L2) 5968 5964 5176 No. of cases for walking (%) 681 (11.4) 1357 (22.8) 1263 (24.4) No. of cases for WC mobility (%) 5287 (88.6) 4607 (77.2) 3913 (75.6) * Includes walking or wheelchair mobility. these secondary analyses (1 for each time point, 1 for the combined time-points calibration), each item included in the 6-item FIM motor scales was anchored by using as anchoring values the results obtained from the primary analysis of the full set of 14 items for the corresponding time point, as displayed in table 1. Anchoring imposes a set of parameters on the calibrations produced by Rasch analysis. In the present case, the program was forced to take, for each FIM item, the difficulty level (averaged across the 7-rating scale steps) estimated in the full-set (14 motor FIM items) analysis, and for the difficulty of the item steps relative to each other, the estimates produced in the same analysis (see table 1). Analysis of Agreement Between and 6-Item Motor FIM The real root mean square error item and person reliability estimate, item and person separation, and mean error of ability estimates for persons calculated by the Bigsteps program for the secondary analysis were used as indices of the 6-item FIM s potential to reproduce the full item FIM s estimate of the person s motor ability. The person ability estimates produced by the secondary analysis were also compared with the estimates produced in the analysis of the full set, using the 2 following statistical sets. Mean, standard deviation, kurtosis, and skewness. These characteristics of the distribution of motor ability should be minimally affected by substituting a FIM items subset for the item motor FIM. Intraclass correlation coefficient. ICC measures the agreement between 2 estimates of a characteristic of subjects (as opposed to the correlation coefficient, which measures congruence only). ICC 3,1 was the most appropriate of the 6 ICC variations for use here; it assumes that the individual subset of motor FIM items is the unit of analysis (rather than a mean of multiple subsets), and that there is no generalization from this subset to others that potentially could be used. The ICC was calculated for all subjects combined and for 4 subgroups. The latter were created by dividing the total group of subjects into 4 quartiles of about equal size, based on their item motor FIM ability measures. A measure based on a good subset of FIM items is expected to have a high agreement with the item measure, not only overall but also in each of the quartiles distinguished. As indicated earlier, the reason I used combined time-points calibration was to make sure that over-time comparisons were valid, that is, the same measuring stick was used. To determine whether estimates of change were affected by substituting 6 FIM items for the standard 13 (14 with the mobility item split), I calculated over-time changes (for individuals with multiple records in the database) in 2 ways: as a difference between item motor FIM estimates based on the combined timepoints calibration and as a difference between 6-item estimates. Depending on the available data, I calculated for each case the score differences between admission and follow-up, between follow-up and first anniversary, and between first and second anniversary of the injury. Bigsteps was used for the primary and secondary Rasch analyses. All other statistics were calculated by using SPSS, version 10.0, b and a macro written for SPSS. RESULTS The means and standard deviations (SDs) of the subjects scores on the 14 motor FIM items are listed in table 3. These raw scores are provided so that readers have some basis for linking logit values to the more familiar FIM categories. Considering the differences in item difficulty calibrations (see table 1), one should be cautious about comparing means across time. Also, the cases are not necessarily the same individuals from 1 time point to the next. However, on almost every item, the mean score was higher at discharge than at admission, and modest additional gains were reflected in the follow-up scores. At admission, 681 persons (11.4% of the total) used walking as their primary means of locomotion. They scored 2.6 on average, whereas the average score for wheelchair users was 2.3. The percentage of subjects who used walking for locomotion increased to 22.8% at discharge. Primary Rasch analysis results are in table 1. Feeding was the easiest item, and stairs the most difficult, by far. Walking was relatively difficult (typically, only persons with a complete SCI at a very low level or with motor-functional incomplete

COMPUTER ADAPTIVE TESTING AND THE FIM INSTRUMENT, Dijkers 389 Fig 2. Mean measures (based on 6-item FIM) for the 4 ability subgroups among wheelchair users (left) and walkers (right). Error bars (for admission and follow-up only) indicate the SD for each subgroup. injury regain walking), and wheelchair locomotion was relatively easy. The divergence between the 2 grew over time. Because of the large sample size, all items had small SEs. The difficulty calibrations of the various FIM rating scale categories, which had the expected increase from the easiest (total assistance) to the most difficult (complete independence), also had a small SE. The single time-point calibrations are in figure 1. Because of the separate calibrations, items can only be compared with each other at the same time point, and not across time points. Despite differences for specific items, especially at the time of follow-up, the overall trends are very similar, for both item and step calibrations. The ICC (model 2,1) for admission, discharge, and follow-up calibrations was.91 for the items and.93 for the steps. This similarity of calibrations was the reason for developing just 1 item-selection algorithm (table 2) instead of 4 separate algorithms (for admission, discharge, follow-up, combined time-points data). However, I used both the single time-point and the combined calibrations (see table 1) to calculate a motor ability estimate based on the 6 items selected for each case. Mean ability measures for the 4 subgroups of persons who used a wheelchair and of those who walked are in figure 2. For walkers, not much difference existed between the 2 intermediate groups, as was expected given the algorithm. In all instances, however, the low group differed clearly from the 2 medium groups, which in turn, on average, had lower motor ability than individuals in the high group. Results were very similar when means for the item FIM were plotted. Table 4 shows how the 6-item motor FIM subset performed on the specified criteria for adequacy of estimating person ability. When using single time-point calibrations, the mean and SD as well as skewness and kurtosis of the original (14 item) distribution of person ability were well reproduced (columns 2, 4, 6 vs 1, 3, 5, respectively). Person reliability as estimated by the Rasch analysis program was almost as good for the 6-item FIM as for the full-length instrument. The mean (across cases) error in estimating motor ability was somewhat higher when only 6 items were used, but not dramatically so. Item reliability in both instances was perfect (1.00), although item separation decreased. The agreement between the original measure and the 6-item estimate as quantified by the ICC was.95 or higher. The agreement between the 2 was lowest within the third quartile, at all 3 time points. This was because the third quartile was the narrowest, at 1.02 logits or less, between the upper and lower cutoff point, compared with at least 2 logits for the first and fourth quartile. When the data sets for the 3 time points were joined for a single calibration, the 6-item FIM estimate of motor ability also reproduced the item motor FIM very well (see columns 7 and 8, table 4). Person reliability and ICC were marginally higher because of a wider spread of abilities being included (compare the row of SDs). Because what is optimal for all records combined may not be optimal for the records from a single time point (table 1 and fig 1 indicate discrepancies in calibrations), the Rasch analysis of the data for 3 separate time points was repeated by using anchoring at the values produced by the combined time-points calibration. (Thus, values in fig 4, columns 10, 12, and 14 result from an analysis that used anchoring in the calibration underlying column 7). The ICC and other analyses were repeated for the 3 time points separately but with the ability estimates included based on the combined calibration. These results are in columns 9 through 14 of table 4. The estimates for the mean, SD, kurtosis, and skewness were very similar to those produced with the single time-point calibrations; the same can be said for the ICC values. For reliability for persons and items, relatively small decreases were observed. (The small increase in item separation for the admission data is not easily explainable.) In fact, the basis for calibration selected did not make much difference at all. The last 6 columns of table 4 are based on the step and item calibrations produced by an analysis to which the admission, discharge, and follow-up data, respectively, contrib-

390 COMPUTER ADAPTIVE TESTING AND THE FIM INSTRUMENT, Dijkers Indicator Table 4: Indicators of the Quality of the 6 Item Motor FIM Instruments, by Calibration Basis Single Time-Point Calibrations Combined Time-Points Calibration Admission Discharge Follow-Up Combined Admission Discharge Follow-Up Item 6-Item Item 6-Item Item 6-Item Item 6-Item Item 6-Item Item 6-Item Item 6-Item (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) Mean 1.82 1.64 0.41 0.48 0.89 0.90 0.25 0.15 1.89 1.67 0.40 0.49 0.88 0.88 SD 1.46 1.43 1.70 1.63 2.02 1.88 2.10 1.93 1.40 1.30 1.64 1.55 2.09 1.84 Skewness 0.19 0.11 0.36 0.07 0.13 0.27 0.13 0.17 0.14 0.04 0.32 0.13 0.07 0.07 Kurtosis 1.22 1.21 0.32 0.34 0.51 0.41 0.37 0.28 1.01 0.76 0.23 0.24 0.38 0.14 Rasch reliability: persons Separation 2.31 2.17 3.59 2.82 3.52 2.75 3.87 3.14 2.23 2.12 3.49 2.87 3.12 2.50 Reliability 0.84 0.83 0.93 0.89 0.93 0.88 0.94 0.91 0.83 0.82 0.92 0.89 0.91 0.86 Mean measurement error 0.30 0.34 0.32 0.39 0.37 0.44 0.33 0.38 0.29 0.33 0.32 0.40 0.36 0.42 Rasch reliability: items Separation 39.07 29.94 47.19 37.80 39.06 32.29 78.18 67.32 20.49 23.15 47.49 37.67 36.87 28.75 Reliability 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 ICC All cases combined.98.96.95.98.98.96.97 1st quartile *.94.92.95 *.93.92 2nd quartile.90.70.76.73.90.68.82 3rd quartile.50.36.49.73.48.28.35 4th quartile.80.81.94.92.81.82.94 No. of cases 5968 5964 5176 17,108 5968 5964 5176 * Statistic cannot be calculated. All cases have the same value for the ability estimate based on the item FIM and the 6-item FIM. Six-item estimates are based on anchoring in item calibrations; see text. uted. If the ICC is calculated for the correspondence between the motor ability estimates at admission based on 6 versus 14 FIM items, but the calibrations resulting from the analysis of the discharge data are used to anchor calculation of the estimates, the ICC is.98. In fact, the lowest ICC value observed when using another time point s calibration results was.96 (which is higher than the ICC for follow-up data based on follow-up calibrations). Thus, the calibration divergences in table 1 and figure 1 are not very important the overall hierarchy of items is largely the same, and minor differences affected motor ability estimates minimally. The agreements (ICC) between the over-time changes, calculated by using calibration estimates from item and 6-item combined time points, were quite high:.93 for the change from admission to discharge (n 5962 cases),.94 for the change from discharge to the first anniversary (n 1412), and.89 for the change from the first to the second anniversary (n 722). DISCUSSION In a previous study, 22 I evaluated 5 methods of selecting from the existing 13 FIM motor items a subset of items that can be used to estimate SCI patients motor ability. That research appraised subsets of 5, 6, and 7 items by using only follow-up data from the NSCID, which at the time contained a somewhat smaller number of cases. Of the 5 item-selection methods, the algorithm approach (similar to that used in the present study) produced the best results overall. This method generally generated the least change in mean, SD, kurtosis, and skewness of the distribution of motor ability estimates. It offered the best reliability, whether ICC, the concordance correlation coefficient, or the limits of agreement approach was used to assess it. All 3 indicated an almost perfect agreement with the 13-item estimate of motor ability, especially for the 7-item set. The superior performance of the algorithm method (CAT) was the reason why I used CAT in the present study. The present study differed from the earlier in 3 basic ways: (1) it featured a simplified selection algorithm that used only 4 ability groups instead of 7 as in the earlier study; (2) it distinguished between mobility types (splitting of the walking/wheelchair item) to avoid a mismatch between item difficulty and person ability, which can result from combining these 2 into a single FIM item; and (3) it offered analysis of admission and discharge data in addition to analysis of follow-up information. The use of a simplified algorithm certainly seems feasible: the ICC value for follow-up (.98) found in the present study is the same as that for the previous study (.98). The Rasch person and item reliability measures (.89 and 1.00, respectively) were also similar in the 2 studies. The improvement resulting from the splitting of the walking/ wheelchair item may have counteracted any loss of precision of estimates resulting from a simpler algorithm. Use of information on each subject s typical mode of mobility made more precise estimation possible, especially for follow-up. In the follow-up data, the difference between walking and wheelchair in difficulty level (average over all item steps) was 1.96 logits, versus only.67 and 1.27 logits for admission and discharge, respectively. Further improvement in creating short versions of the motor FIM (and in Rasch analysis of the full-length FIM) may be possible by introducing similar splits for the transfer items. Although there is no auxiliary variable for these to indicate how a transfer is performed, it is reasonable to assume that persons who walk make a transfer by using a standing pivot or a similar easy method. Persons who get around by wheelchair, on the other hand, transfer exclusively by means of their arms, supplemented by a sliding board or similar aid if they are not completely dependent on a helper. The problem lies in the term exclusively : the FIM auxiliary variable refers only to the common form of locomotion, and there is no claim that the persons marked as wheelchair users cannot walk or vice versa. The algorithm approach to selecting a subset of motor FIM items worked about as well for admission and discharge data as

COMPUTER ADAPTIVE TESTING AND THE FIM INSTRUMENT, Dijkers 391 for follow-up data (table 4), even though the relevant information was collected by completely different mechanisms: observer rating versus self-report. Generally, admission and discharge FIM data available in the NSCID (and other national rehabilitation databases, eg, those of the Traumatic Brain Injury Model Systems and the Burn Injury Model Systems) are a byproduct of the clinical process: they are collected for program evaluation purposes and borrowed from the institutional files for the research. In some institutions, the FIM is used for additional clinical purposes: patients rehabilitation treatment outcome goals may be set in terms of FIM motor items. This adds to the institutional value of admission and discharge detail information. As long as clinicians need and administrators expect complete FIM data, shortening of the item set in clinical settings is not likely. Another issue arguing against using an algorithm approach in current clinical practice is the fact that completion of the FIM items is commonly split over most knowledgeable disciplines for instance, occupational therapists score feeding and dressing, and nurses score bowel and bladder management. Even with an integrated computer system, the timing needed to produce a 6-item set, with specific item selection based on prior results, may not be feasible in a typical clinical setting. However, instances exist in which the FIM is administered (as a performance measure of ADL ability) by research staff 31,32 ; in these cases, the algorithm approach as a way to reduce the total time required makes sense. A compromise may be needed between the optimal sequence as dictated by an algorithm and the logical order of testing (eg, toileting tested in conjunction with toilet transfer), but the potential for shorter testing time is there. The time saved can be used to reduce costs, to collect more information in other areas, or to more carefully collect FIM information on the items selected. As in the previous simulation research on the application of an algorithm approach, 22 the present results lead to the recommendation that any research or program evaluation effort that uses the full motor FIM, collected by means of subject or proxy interview, should consider using the algorithm approach to reduce the data collection effort. Rasch analysis has been claimed to produce results that are sample-independent and item-independent. 33 That is to say, if a different sample is used, even one with a significantly higher or lower average ability, item calibrations should come out about the same. Vice versa, if a different set of items (measuring the same concept or latent trait) is used for the same group, person calibrations (ability estimates) should be unchanged. However, in the present case, item calibrations and step calibrations differed from 1 time point to another. The discrepancy was the reason for providing the 3 separate analyses in addition to the combined time-points analysis. Given the large sample sizes, sampling error is an unlikely explanation for the discrepancies shown in figure 1. It may be that unequal percentages of not tested masquerading as total assistance play a role. These likely are nonexistent for the follow-up data; however, as indicated, some may have found their way into the admission and discharge data. However, the discrepancies in item calibration at the 3 time points may be attributable to different behavior of the therapists (becoming easier raters for some items, tougher for others, from admission to discharge). The step calibration divergence for admission and discharge data offers an apparently simple explanation: it appears that therapists, on average, change their standards from admission to discharge. Rehabilitation professionals have suggested that the pressures on programs, especially clinical staff, to show results would create this situation, even though the FIM training and mastery testing process is aimed at standardizing how people evaluate patient performance. 34 Rumors circulate that bonus systems in place in (for-profit) institutions almost guarantee that staff will engage in down-coding at admission and up-coding at discharge. The present data suggest the opposite. Every FIM step had a lower logit value (average over 14 items) for admission than for discharge (see table 1, fig 1). The item difficulties (average over scale steps) in table 1 and figure 1 indicate that for 6 of 14 items raters were less lenient at discharge than at admission. The difference was especially large (.30 logits) for stairs and dressing lower body. Previous studies have produced data that are in line with the doomsayers predictions, 30,35 and it is unclear why the situation is different in the current data set. It may be that SCI rehabilitation constitutes a special case. Alternative explanations for the divergent calibrations may be that patients had changes in the quality of their performance or differences in scoring methodology (professional rating vs self-report by persons with SCI). More research is needed to determine the cause(s) of these step- and item-calibration divergences. However, the order of steps vis-à-vis each other was robust, the order of items fairly so, and interchanging item calibrations did not affect the relative or even absolute estimates of subject motor ability ICCs were in the.95 to.99 range, whatever basis for calibrations was used. If additional research indicates that combined time-points calibration of FIM items is not feasible, hard problems in assessing change in rehabilitation patients over time will need to be addressed. Chang and Chan s third and fourth approaches 30 to Rasch analysis of data for various time points do not truly offer an alternative. They constitute at best a patch to fix a calibration divergence. If analyses of other data sets, preferably with other diagnostic groups, indicate that calibration discrepancies are indeed more than a marginal issue, much additional research will be needed. Previous authors 36-38 who applied Rasch analysis to the FIM have noted how the bladder and bowel items are relatively misfitting. However, most of them did not see this as a reason to eliminate those 2 from their set of items used to estimate motor ability. The reason for the poor fit likely is the fact that the FIM as used in these studies combines into a single bladder (or bowel) score 2 discrepant characteristics: status of the sphincter (as indicated by the frequency of incontinence) and self-care ability (capacity to perform activities needed for hygienic elimination of waste). Although these were assessed separately (on 7-point scales), only the lowest (most impaired) score was recorded and available for analysis. The bowel and bladder items had a somewhat marginal fit in the current analysis (using traditional Rasch infit and outfit criteria), but they were not eliminated because of continuity with the earlier research. Fit is always relative, and traditionally, in both research and clinical applications, bladder and bowel scores are added to the scores for the other motor FIM items to obtain a total score. The dilemma of how to deal with these 2 FIM items has been resolved by the federal Centers for Medicare and Medicaid Services (CMS; the former Health Care Financing Administration). In the prospective payment system for inpatient medical rehabilitation, introduced per January 2002, CMS requires sphincter status and self-care ability to be separately assessed and recorded on claims, although the ultimate total FIM score calculation still uses the lower of the 2. Unfortunately, the definition for the various levels of incontinence have been changed from the old FIM, somewhat handicapping analysis of pre-2002 FIM data jointly with more recent FIM data. In the present study, all subjects had been administered all 13 (14) FIM items previously. The investigation focused on

392 COMPUTER ADAPTIVE TESTING AND THE FIM INSTRUMENT, Dijkers what would have happened if adaptive testing had been used instead, and as such is a simulation. The version of CAT simulated here is fairly simple, with fixed entry (all cases received item G first) and fixed length (all received 6 items in total). By using the 3-step algorithm in table 2, the FIM can easily be administered in paper-and-pencil format. More sophisticated procedures are possible, however, 6 and may be implemented when a computer is used to select items. For instance, the starting item may be selected based on known characteristics 4 (eg, neurologic status). For a person with complete tetraplegia, item D (dressing upper) may be a good starting point, and for someone with incomplete paraplegia, item F (toileting). The next items to be picked could be targeted based on a finer differentiation than scoring either 5 or more or 4 or less on the initial FIM item. In the previous CAT simulation study, 22 3 groups were used: 1 or 2; 3, 4, or 5; and 6 or 7. In fact, the optimal strategy is to calculate the subject s ability after each new item has been administered, and selecting as the next item the one that provides the most information about the subject s true ability level, given the estimated level. 4,6 Last, there is no need to have a fixed number of items. One common CAT strategy is to calculate for each person being tested the SE of measurement after administration of each new item and to terminate the test when the error drops below a prespecified level. In the case of the FIM, administration of items might in some instances go beyond 6. More items might be used for individuals who behave inconsistently for instance, those who have unusual spinal injury syndromes, or who get confused when answering interviewer questions. Programming the computers used for data collection may be worthwhile for larger research projects, in which the FIM and other time-consuming instruments are administered to large numbers of subjects. Researchers in the health-related quality of life (health status) area of research in recent years have discovered Rasch analysis, and have called for the development of item banks that make it possible to use tests that are customized to the person whose quality of life is being measured. 3,4 Their primary interest is in having measures that do not have floor or ceiling effects for patient groups with either very high or very low health status, and that offer equiprecision of measurement across the spectrum. Time savings have not been offered as the only or even primary benefit of customized tests, but all 3 benefits of Rasch analysis and CAT are related. Rasch analysis enables us to determine whether our instrument or item bank includes items that are suitable for the cases at the extremes of the continuum. 2 It also provides information by which an optimal set of items can be selected for all subjects, wherever along that continuum they may be located. Because there is no need to administer items that are too easy or too difficult, testing time can be reduced. Better, more efficient measurement is in our future, if rehabilitation researchers and clinicians accept the challenge of developing the item banks and software needed to implement CAT. The short version of the FIM is just the beginning. Acknowledgments: Thanks to Gwyn Kropp, MS, for preparing a data analysis file, and to Wayne Gordon, PhD, Ralph Marino, MD, MS, and several anonymous reviewers for comments on earlier versions of this article. REFERENCES 1. Nunnally JC. Psychometric theory. New York: McGraw Hill; 1967. 2. Bond TG, Fox CM. Applying the Rasch model. Fundamental measurement in the human sciences. Mahwah (NJ): Lawrence Erlbaum Associates; 2001. 3. McHorney CA. Generic health measurement: past accomplishments and a measurement paradigm for the 21st century. Ann Intern Med 1997;127:743-50. 4. Revicki DA, Cella DF. Health status assessment for the twentyfirst century: item response theory, item banking and computer adaptive testing. Qual Life Res 1997;6:595-600. 5. Drasgow F, Olson-Buchanan JB, editors. Innovations in computerized assessment. Mahwah (NJ): Lawrence Erlbaum Associates; 1999. 6. Weiss DJ. Adaptive testing by computer. J Consult Clin Psychol 1985;53:774-89. 7. Wainer H, editor. Computerized adaptive testing: a primer. 2nd ed. Mahwah (NJ): Lawrence Erlbaum Associates; 2000. 8. Luecht RM, Nungester RJ. Some practical examples of computeradaptive sequential testing. J Educ Meas 1998;35:229-49. 9. Green BF. Computer-based adaptive testing in 1991. Psychol Marketing 1991;8:243-57. 10. Waller NG, Reise SP. Computerized adaptive personality assessment: an illustration with the Absorption scale. J Pers Soc Psychol 1989;57:1051-8. 11. Weiss DJ. Improving measurement quality and efficiency with adaptive theory. Appl Psychol Meas 1982;6:473-92. 12. Kreiter CD, Ferguson K, Gruppen LD. Evaluating the usefulness of computerized adaptive testing for medical in-course assessment. Acad Med 1999;74:1125-8. 13. Bergstrom BA, Lunz ME. CAT for certification and licensure. In: Drasgow F, Olson-Buchanan JB, editors. Innovations in computerized assessment. Mahwah (NJ): Lawrence Erlbaum Associates; 1999. p 67-91. 14. McArthur DL, Cohen MJ, Schandler SL. Rasch analysis of functional assessment scales: an example using pain behaviors. Arch Phys Med Rehabil 1991;72:296-304. 15. Velozo CA, Kielhofner G, Lai JS. The use of Rasch analysis to produce scale-free measurement of functional ability. Am J Occup Ther 1999;53:83-90. 16. Andiel C. Rasch analysis: a description of the model and related issues. Can J Rehabil 1995;9:17-25. 17. Hambleton RK, Swaminathan H, Rogers HJ. Fundamentals of item response theory. Newbury Park (CA): Sage; 1991. 18. Andrich D. Rasch models for measurement. Vol 68. Newbury Park (CA): Sage; 1988. 19. Linacre JM, Heinemann AW, Wright BD, Granger CV, Hamilton BB. The structure and stability of the Functional Independence Measure. Arch Phys Med Rehabil 1994;75:127-32. 20. Velozo CA, Magalhaes LC, Pan AW, Leiter P. Functional scale discrimination at admission and discharge: Rasch analysis of the Level of Rehabilitation Scale-III. Arch Phys Med Rehabil 1995; 76:705-12. 21. Kilgore KM, Fisher WP, Silverstein B, Harley JP, Harvey RF. Application of Rasch analysis to the Patient Evaluation and Conference System. Phys Med Rehabil Clin North Am 1993;4:493-515. 22. Dijkers MP, Yavuzer G. Short versions of the telephone motor Functional Independence Measure for use with persons with spinal cord injury. Arch Phys Med Rehabil 1999;80:1477-84. 23. Karamehmetoglu SS, Karacan I, Elbasi N, Demirel G, Koyuncu H, Dosoglu M. The functional independence measure in spinal cord injured patients: comparison of questioning with observational rating. Spinal Cord 1997;35:22-5. 24. Richards JS, Go BK, Rutt RD, Lazarus, PB. The national spinal cord injury collaborative database. In: Stover SL, DeLisa J, Whiteneck GG, editors. Spinal cord injury: clinical outcomes from the Model Systems. Rockville (MD): Aspen; 1995. p 10-20. 25. Stover SL, DeVivo MJ, Go BK. History, implementation, and current status of the National Spinal Cord Injury Database. Arch Phys Med Rehabil 1999;80:1365-71. 26. Hamilton BB, Granger CV, Sherwin FF, Zielezny M, Tashman JS. A uniform national data system for medical rehabilitation. In: Fuhrer M, editor. Rehabilitation outcomes: analysis and measurement. Baltimore: Brookes; 1987. p 137-47.