A Comparison of Item Exposure Control Methods in Computerized Adaptive Testing

Size: px

Start display at page:

Download "A Comparison of Item Exposure Control Methods in Computerized Adaptive Testing"

Branden Cuthbert Long
6 years ago
Views:

1 Journal of Educational Measurement Winter 1998, Vol. 35, No. 4, pp A Comparison of Item Exposure Control Methods in Computerized Adaptive Testing Javier Revuelta Vicente Ponsoda Universidad Aut6noma de Madrid Two new methods for item exposure control were proposed. In the Progressive method, as the test progresses, the influence of a random component on item selection is reduced and the importance of item information is increasingly more prominent. In the Restricted Maximum Information method, no item is allowed to be exposed in more than a predetermined proportion of tests. Both methods were compared with six other item-selection methods (Maximum Information, One Parameter, McBride and Martin, Randomesque, Sympson and Hetter, and Random Item Selection) with regard to test precision and item exposure variables. Results showed that the Restricted method was useful to reduce maximum exposure rates and that the Progressive method reduced the number of unused items. Both did well regarding precision. Thus, a combined Progressive-Restricted method may be useful to control item exposure without a serious decrease in test precision. One of the main goals of computerized adaptive testing (CAT) is to obtain precise ability estimates with a small number of items. To achieve this goal, items are selected specifically for each examinee from a large bank. Selection is based on characteristics of examinees (their provisional estimated ability) and items (their difficulty and discrimination parameters). Thus, a different subset of items may be administered to each person (Hambleton, Swaminathan, & Rogers, 1991). The Maximum Information method (MI) is widely used to select items during the testing session (Wainer, 1990, p. 111). It selects the unused item of the bank that provides the most information at the last estimated ability. Item information is greater as the item difficulty approaches the ability level of the examinee, as the discrimination parameter increases and as the pseudochance level approaches zero (Hambleton & Swaminathan, 1985, pp ). Each item bank contains "good" and "poor" candidates for MI selection at a particular ability level; items with high discrimination (a) values are extremely good candidates if their difficulties lie close to test taker's proficiency. In practice, some items are used in most test administrations, but others are rarely (if ever) selected (Wainer, 1990, pp ; Mills & Stocking, 1995). For instance, Hulin, Drasgow and Parsons (1983) in a simulation study administered 4600 tests and found that 141 of the 260 items were never administered with the MI method. The proportion of times an item is used (its Item The authors wish to thank two anonymous reviewers for their extensive and thoughtful comments and Carmen Ximdnez for her help in the preparation of this manuscript. This research was partially supported by three DGICYT grants (PS , PS and PS ). 311

Revueha and Ponsoda Exposure Rate) depends, then, on (a) its psychometric properties, (b) what other items are available in the pool, and (c) the distribution of ability of the examinees.

2 Revueha and Ponsoda Exposure Rate) depends, then, on (a) its psychometric properties, (b) what other items are available in the pool, and (c) the distribution of ability of the examinees. Items with a high exposure rate may produce some undesirable effects. For example, if an examinee is evaluated several times with adaptive tests from the same item pool, he or she could be asked the same questions, and may have learned the correct response. The most frequently used items may soon become popular and lose their original psychometric properties, causing a decrease in the test's validity. From the test developer's perspective, it is also undesirable to create a large item pool and use an item selection method that leaves unused a large percentage of the items. It seems sensible to demand that all the items be administered at some time. This also guarantees more variety in the items the examinees receive. In short, test developers want to be sure that all items are used (for economic reasons), but that no item is overused (for security reasons). Item exposure control strategies have two main aims: (a) to prevent overexposure of some items, and (b) to increase the use rate of seldom or never-selected items. These strategies achieve a better control of item exposure by introducing several changes in the MI method, but not without cost: they also produce a loss of precision in ability estimation. In general, item exposure control strategies could be classified in two groups (Stocking, 1993): (a) Methods adding a random component to the MI item selection method. One such method selects the first item at random from the optimal five items, the second from the optimal four, the third from a group of three, the fourth from a group of two, and subsequently the optimal item is chosen (McBride & Martin, 1983). (b) Methods based on assigning a parameter to each item to control its maximum exposure. Sympson and Hetter (1985) developed a probabilistic method to achieve control over maximum exposure rates. The method acts in the same way as the MI method and selects the optimal item. However, this item will actually be administered not on 100% of the occasions when it is selected as optimum, as in the MI method, but only in 100k;% of the tests, where ki is the probability that item i is administered, given that it has been selected. Items with low exposure rates under the MI method will have k,. values close to one. Items with extremely high exposure rates will have smaller k i values. The set of k i values is determined through a series of simulations. Several refinements have recently been added to this method. Davey and Parshall (1995) extended the method to prevent not only individual item overuse, but also overexposure of item clusters. Stocking (1993) and Stocking and Lewis (1995) adapted Sympson and Hetter's method to structured item banks. Strategies derived from Sympson and Hetter's method present the important advantage of a direct control of unconditional and conditional exposure rates. For a bank of n items, one exposure-rate parameter per item has to be determined by simulation in the unconditional case. In the conditional-on-ability case (Stocking & Lewis, 1995), r exposure-rate parameters per item are needed, where r is the number of ability levels. Moreover, parameter values depend on CAT and item bank characteristics, such as number of items, stopping rule, and so on. Replacing or adding items to the bank, modifying test length, and so on, may cause changes in item functioning (Stocking & Lewis). Although this method is theoretically inter- 312

3 Item Exposure Control esting, it would be useful to find more manageable strategies that provide similar precision in ability estimation. The current report introduces two new methods for exposure control and compares them with several existing ones. The methods considered in this paper are described below. Methods Maximum Information Method (MI) The information provided by each unused item (Ii) at the last estimated ability is computed. The most informative item is administered. One Parameter Method (IP) This method is a modified MI method in which item discrimination and pseudochance parameters have no role: the unused item with the difficulty (b) parameter closest to the last estimated ability is selected and presented. The aim of this method is to increase the exposure of the items with a low a parameter and to reduce that of very discriminating items. One example of the application of this method may be foulad in Dodd (1990). She compared the 1P to the MI method in a computerized adaptive test measuring attitudes, based on the rating scale model. McBride and Martin Method (MM) As described above, a random component is added to the selection of the initial items. The first item is selected at random among the 5 most informative items, the second among the 4, and so on. The fifth and subsequent items were selected to be optimal. Other randomization schemes similar to the MM method have been proposed (Stocking & Swanson, 1993; Lewis, Subhiyah & Morrison, 1995; and Morrison, Subhiyah, & Nungester, In the Morrison et al. scheme, the first five items are selected at random from the optimal ten. For the sixth and following items, the most informative item is administered. Randomesque Method (RA ) Selection is always made at random among the 5 most informative items (Kingsbury & Zara, 1989; Morrison et al., 1995). Sympson and Hetter Method (SH) As mentioned above, in the unconditional case, this method assigns to each item a k; parameter ranging from 0 to 1. Once the most informative item has been selected, a random number from the uniform (0, 1) distribution is generated. If this random number is lower than k i the item is actually administered: if not, the item is set aside, the next-most-informative item is identified, a new random value is selected, and so on. Items are always selected from a set of items that have been neither administered nor set aside. The ki parameters are assigned by an iterative process of repeated simulations, until a maximum-exposure target is attained (0.40 in the simulations to be reported). An extensive description of this method can be found in Sympson and Hetter (1985), Stocking (1993), and Hetter and Sympson (1997). 313

4 Revuelta and Ponsoda Restricted Maximum Information Method (Rk) This was proposed as a practical alternative to the SH method. It avoids the complexities involved in the assignment of the k i parameters. Items are selected by the MI method, but none is allowed to be exposed in more than 100k% of the tests. When an item attains this limit it cannot be administered in the current test (Revuelta, 1995; Revuelta & Ponsoda, 1996). Suppose that a test has been administered t times. Let us call ai the number of times the item i has been administered in the previous t tests. The exposure rate for item i will be then a/t. The set of available items for the next test will be composed only of the items with exposure rates below k. Items are then selected from this reduced item pool by the MI method. The set of items that may be administered changes from test to test. A particular item will be available for some tests, then will not be available, but after being unavailable for one or more tests, its quotient a~/t will decrease and the item will be again available (when its exposure rate is again below k). The parameter k is the maximum allowed exposure rate. The only restriction on k is that, as some items will be unable to be administered in some tests, k's value must be greater than the reciprocal of the integer quotient between bank size and test length (maximum test length, in variable-length tests), to ensure that there will be enough available items for any test application. The SH method had a similar restriction (Stocking, 1993). Progressive Method (PR) This method was proposed by Revuelta (1995) and Revuelta and Ponsoda (1996). It also adds to the MI method a random component, whose contribution is important at the beginning of the test and increasingly less influential as the test progresses. These are the steps to administer a new item (let us call h the number of items already administered to this particular examinee) in a test with a maximum length of m items: For each unused item, the information at the ability estimated from the previous h items (li) was computed. Let us call H the highest information value obtained. A random value from the uniform (0, H) distribution was extracted (R~) for each item. The relative serial position of the item is defined as s = h/m. A weight is computed for each unused item as a linear combination of the random and information components, according to the formula w~ = (1 - s)r i + sli. The item with the highest weight is administered. Serial position values (s) linearly increase from 0 (for the first item to be administered) to a value close to 1 (for the last item). Since item information is multiplied by s and the random component by 1 - s, the importance of the two components changes in reverse directions as the testing session progresses. The information and random components are gaining and losing importance, respectively, throughout the testing session. The formula was motivated by the following considerations: when applying the MI method, the contribution to the test precision of the initial items is seldom great, since these items are very informative but for ability estimates that very often differ 314

5 Item Exposure Control markedly from the final estimates. It was therefore supposed that the PR method would reduce differences among items in item exposure rate without producing a serious loss in precision if the random component affected mainly the initial item selections. As the testing session progresses and provisional ability estimates approach final estimates, the information component is gaining the importance that the random component is losing. Two different studies are reported. Study One uses a real item bank and gives an initial impression of the methods in two conditions: fixed-length and variablelength tests. In Study Two, the methods MI, SH, PR, and Rk are explored in more detail. Simulated item banks were used in this case, making possible the comparison of conditions differing in test length and in the discrimination parameters of the items. Study One As mentioned above, in this study the described methods are compared using a real item pool (Ponsoda, Olea, & Revuelta, 1994) designed to evaluate examinees' English vocabulary. Due to the exploratory nature of the study, no strong predictions about results can be advanced. However, it was expected that the MI method would produce the best precision and the greatest differences among items in item exposure rates, since it aims at optimal precision and does not attempt any kind of item exposure control. Method Item Selection The six item selection methods described above (MM, RA, PR, R40, SH, and 1P) were compared with the MI method. A method in which items are strictly selected at random from the bank was also included as a control (CO) condition. The ki parameters of the SH method were estimated for each condition using 0.4 as the maximum desirable exposure rate. This same value was also used for the k rate of the Rk method, so that this method will be called R40 hereafter. The MI and CO methods were expected to produce the maximum and minimum ability estimation precision, respectively. With regard to item exposure control, these two methods were also expected to be the best (CO) and the poorest (MI) from the entire set. Conditions The eight methods were tested in two different conditions: (a) fixed-test length (35 items), and (b) variable-test length (the stopping rule was a standard error of ability lower than 0.22, or a maximum length of 50 items). Procedure The same 2000 simulees received a CAT under the two conditions and eight methods. Examinees' true ability parameters were normally distributed N(0, 1). The initial ability of all simulated subjects at the beginning of the test was taken as zero. 315

Revuelta and Ponsoda The program ADTEST (Ponsoda et al., 1994) was used for running the simulations. Abilities were estimated by the maximum-likelihood procedure.

6 Revuelta and Ponsoda The program ADTEST (Ponsoda et al., 1994) was used for running the simulations. Abilities were estimated by the maximum-likelihood procedure. A real item bank of 221 items was employed. The item bank was calibrated using the ASCAL program (Assessment Systems Corporation, 1988). The descriptive statistics of the bank (mean, standard deviation, minimum and maximum) were as follows: a parameter (1.10, 0.36, 0.40, 2.06), b parameter (-0.11, 1.50, -3, 3), and c parameter (0.21, 0.06, 0.05 and 0.39). Further details can be found in Ponsoda, Wise, Olea, and Revuelta (1997). Three indices were computed to compare ability estimate precision among the methods: (a) overall bias between estimated and true ability, (b) the standard deviation of the difference between estimated and true ability (Se), and (c) the mean test length (only for variable-length tests). The simulation provided for each simulee the true and estimated ability. The variable D, error in ability estimation, was defined as the signed difference between the estimated and the true ability. Overall bias and Se were computed as the mean and the standard deviation of D, respectively, in the 2000 simulees. When the 2000 simulated subjects had completed the test, the number of times each item was administered could be computed. As mentioned above, the variable "Exposure Rate", ranging from 0 to 1, is the proportion of times an item had been administered across the 2000 tests. The following indices were used to compare the methods: (a) the percentage of items never administered in the 2000 simulations, (b) the coefficient of variation (CV) of the variable "Exposure Rate", and (c) the minimum and maximum values of this variable. The exposure rate distribution, grouped in ten intervals, was also computed for each item selection method. The CV is computed as the standard deviation of the variable exposure rate, multiplied by 100 and divided by its mean. As a measure of dispersion, it will be large when some items have high exposure rates while others are seldom used. The CV may be preferable to the variance when the exposure control methods and/or conditions to be compared differ markedly in the mean exposure rate they produce. This would be the case, for example, when comparing different test lengths. The simulation corresponding to each condition and method was repeated five times. The same 2000 simulated subjects were used at each combination of condition and method. The variables described above were computed for each repetition. The results shown below represent the means of the five repetitions. Results Precision In the fixed-length condition, the MI, MM, and RA methods yielded the highest precision in ability estimation. Poorest precision was produced by the CO and IP methods. The remaining three methods (SH, R40, and PR) yielded a precision slightly inferior to that of the best three methods (see Table 1). In the variablelength condition, the variable "test length" ("Items" column in Table 1) made clear the differences in precision among the methods. The MI, MM, and RA produced the shortest tests. The R40 and PR methods needed approximately three more items to reach the same degree of precision, and SH needed four. The longest tests were 316

7 Table 1 Item Exposure Control Ability Estimation Precision in the Fixed- and Variable-length Conditions, by Methods. Fixed Variable Bias Se Items Bias Se Items MI MM ii RA PR R SH ip CO Note. "Bias" means overall difference between estimated and true ability; "Se" means the standard deviation of the differences between estimated and true abilities; "Items" means the mean number of administered items. "Bias" and "Se" are multiplied by i000. The stopping rules were two: Fixed-length (35 items) and variable-length (50 items or standard error <= 0.22). administered in the IP and CO conditions. The IP method needed on average 12 more items than the MI method. In the CO method, the standard error stopping rule was never reached, and all the simulated subjects thus received 50 items. Bias values were very small. Maximum bias was and no clear differences in bias emerged among the methods. The information provided by the first item given, second item given, and so on, evaluated at the final ability estimate for each of the 2000 simulees in the fixedlength test, was computed. Figure 1 shows the mean across simulees of these information values. The results showed that the MI method administered the best items at the beginning of the test. The methods R40, SH, and RA provided nearly identical mean information values, slightly below those of MI method, throughout the test. The Progressive method provided poor items at the beginning, but they were more informative than those provided by the MI method from the eleventh item until the end of the test. The MM method produced a curve similar to MI. Finally the I P method produced a curve lying midway between the MI and CO methods. Exposure Control As seen above, in the variable-length condition, different test lengths were obtained. Item exposure depended on test length because, as length increased, more 317

8 Revuelta and Ponsoda ] c- O t13 E c- e- 03,4 MI PR i c R40 -F SH co Item Position FIGURE h Mean Information by Item Position." the results for the methods RA, 1P, and MM are not shown as they could not be easily distinguished from other curves items had to be included in the test, so that the percentage of unused items and the coefficient of variation should decrease. This circumstance would make more difficult a fair comparison among the methods with regard to item exposure rate in the variable-length condition. For this reason, only the results for the fixed-length condition are given. As Table 2 shows, the MI, MM, RA, and PR methods provided the highest coefficients of variation (over 100). The R40 and SH methods showed values close to 100, and the CV of the IP method was clearly lower. As expected, when items were strictly selected at random (CO method), item exposure rates were quite homogeneous, and the coefficient of variation was small. The methods MI, MM, RA, R40, and SH provided percentages of items never administered ranging from 15 to 25, thus an important part of the bank was never used. However, this was not the case for the other three methods in Table 2 (PR, 1P, and CO). They had a zero percentage of items never administered, indicating that all the items were used at least once. With the MI and 1P methods, the maximum values are 100. Since all the simulated subjects start with the same initial estimated ability, all the tests begin with the same first item. As expected, the maximum exposure rate (x 100) for the method R40 was 40. The CO method provided the highest minimum and the lowest maximum. For all the methods except two (lp and CO), around 50% of the items were grouped in the first interval, meaning that half the items of the bank had an extremely low exposure rate: they were administered in less than 10% of the tests. Item exposure rates provided by the MI method are expected to be related to item characteristics. Correlations of item exposure rates with item parameters were

9 Item Exposure Control Table = Item Exposure Rate Distributions By Methods and Statistics in the Fixed-Length Condition, Methods Item Exposure Rate(xl00) MI MM RA PR R40 SH IP CO 0-I i Ii.i CV , ,2 5.1 Min(xl00) Max(x100) i Items Never Administered (%) Note. The entries of the table are the percentage of items in the bank at each exposure rate interval. Item pool size is 221 items. "CV" stands for coefficient of variation. Exposure rate, minimum and maximum values are multiplied by 100. (discrimination, p < 0.001), 0.02 (difficulty, p > 0.05), (absolute value of difficulty, p < 0.01) and (pseudochance, p < 0.01). These results confirmed that the most popular items have high discrimination, medium difficulty, and low pseudochance levels. Taking the results from precision and exposure rate together, we can draw some conclusions: first, the greater the precision, the greater the differences in exposure rate among items. The best methods with regard to precision are the poorest with regard to exposure rate control; inversely, low precision methods are superior to high precision methods in exposure rate control. Second, if test administrators are concerned only with precision they should apply the MI or MM method; if their main concern is item exposure rate, the CO or IP method would be the best choice. As test administrators are expected to be concerned with precision, but at the same time interested in a more or less tight control of exposure rates, some other methods may be a better choice. The main advantage of the PR method is that it increased the minimum exposure rate and reduced the number of unused items without a serious decrease in test precision. However, it produced a maximum rate that was too high. R40 and SH kept the maximum rate under control and showed adequate precision, but the minimum rate remained equal to that of the MI method. These results suggested that a new method resulting from combining the Progressive and Restricted methods would perform well in precision and exposure (both in maximum and minimum rate). This method (called PRk) is a progressive 319

10 Revuelta and Ponsoda one, but no item is allowed to be exposed in more than 100k% of the tests. As in the Restricted method, the set of available items is determined before each test, but the Progressive method is then applied for item selection instead of the MI method. Thus, the methods MI, PR, SH, Rk, and PRk were chosen for more detailed scrutiny. Study Two Results of the study just described may have been produced by specific characteristics of the simulation, such as the psychometric properties of the items, stopping test rules, distribution of ability, and so on. This second study again compared the methods MI, PR, Rk, PRk, and SH, and attempted to find out whether the conclusions of the Study One held for different item banks. Precision and exposure rates depend on item parameters and test length. As seen above, item discrimination parameters correlated with item exposure rates when the MI method was used. Test length should also affect item exposure rates; longer tests may produce fewer exposure rate differences among items, since the less popular items may have to be administered when the bank is running out of items. No exposure rate differences among items would, of course, emerge in the extreme case of a test length equaling bank size. Method Conditions Three discrimination parameter distributions, three test lengths, and two item pool sizes were considered. Test lengths were 20, 40, or 60 items. Item pool sizes were 500 or 1000 items. Discrimination parameter distributions were as follows: (a) Lognormal (-0.25, 0.5), which produced a mean of 0.93 in the item bank; (b) Lognormal (0, 0.5), mean of 1.07; and (c) Lognormal (0.25, 0.5), mean of The methods MI, PR, Rk, PRk, SH, and CO were applied in each one of these 18 (3 3 2) conditions. For the SH, Rk, and PRk methods the desired maximum rate was again fixed at 0.4. Thus, these last two methods will again be called R40 and PR40. Finally, the PRk method was also considered with k = The PRI5 method was added to explore its efficiency when a tight control of item exposure is needed. The mean exposure rate in the previous study's CO method was 0.15, below most typical maximum exposure rates in CAT. The value applied by Potenza and Stocking (1997) was Procedure In each condition a simulated CAT was administered to the same 2000 simulated subjects, as in Study One. Initial estimated abilities were again fixed at 0. A simulated item bank was created for each one of the six combinations of bank size x discrimination distribution. The values of difficulty and guessing parameters were drawn from a normal (0, 1) distribution (difficulty parameter) and from a Beta (5, 17) distribution (pseudochance parameter, which resulted in a mean of 0.23 in 320

11 Item Exposure Control the bank) for the six item banks. See Baker (1992, pp ) for a description of these parameter distributions. The indices of precision and exposure rate applied in the previous study were also used here. Exposure rate distributions will not be reported. As in the first study, the simulation corresponding to each condition was repeated five times using the same 2000 true abilities. The results shown are the means of the five repetitions. Results Precision was related to the method applied. The MI method was the most precise, followed by the PR, R40, PR40 and SH methods. The PRI5 method was less precise and CO was the poorest in terms of precision. It should be noted that PR40 produced similar Se values to R40. The increase in a parameter and itempool size was also related to improved precision. Finally, as expected, longer tests gave lower bias and Se values. Bias values were quite low. The highest bias is (see Table 3). Methods did not show noticeable differences in bias. The results seem to suggest, however, that bias is slightly higher for th6 control and PR 15 methods. As can be seen in Table 3 (bank size 500), with the PR15 method, test precision is lower in the 60- than in the 40-item condition. This result was unexpected, and we do not know the reason for it. It suggested that the increase in the number of items had a positive and a negative effect. All things being equal, longer tests have more precision. In this case, however, in some specific conditions it seemed as if longer tests were made up of poorer items than shorter tests and that the precision increase generated by their extra items could not compensate for their lower item quality. This possibility requires further exploration. As shown in Tables 4 and 5, test length also affected the exposure rates. Longer tests gave more homogeneous rates (i.e., smaller coefficients of variation) and smaller percentages of items never administered. In almost all conditions the minimum exposure rate was 0, due to the large item pool size compared to test length. With regard to the methods, marked differences were observed. The MI method gave the highest coefficients of variation and the highest maximum rates. The PR method provided a low number of items unused, but the maximum rate increased with test length as there was no direct control over it. The methods SH and R40 had control over the maximum rate but the number of items never used was similar to that found with the MI method. PRk had control over maximum rate, and the number of unused items and CVs were as low as those provided by the PR method. By combining results on precision and exposure rates, some conclusions may be reached. The MI method gave the highest precision but the poorest exposure control. The remaining methods were less precise, while their results in exposure varied. The PR method produced the lowest number of unused items, but the maximum rate was not under control. The methods R40 and SH permitted this control but provided an unacceptably high percentage of unused items. The combined PRk method gave good results in both variables, and for some maximum 321

12 Revuelta and Ponsoda Table 3 Ability Estimation Precision by a Distribution, Size Test Length, Method, and Bank Test length ~ distribution Bank Size 500 Bias Se MI i PR R SH PR PRI CO 17 i0 5 I0 12 i0 MI PR R SH PR PRI CO Book Size i000 Bias Se MI i PR i R SH PR40 i PRI CO MI PR R SH PR PRI CO Note. The "Bias" and "Se" indices are multiplied by The means of the parameters in the three distributions are 0.93 (i), 1.07 (2) and 1.30 (3). exposure rates (0.40 in our case) it yielded good precision when compared to the other exposure control methods. Discussion The current work compared precision and item exposure rates of eight itemselection methods. Results showed that the MI method may compromise item bank security, although it would be a good choice for item selection if test peculiarities allowed us to disregard the exposure control problem. This would be the case, for example, if no examinee were to receive the test more than once, or if no examinee were expected to pass on information about the test to another examinee. None of the considered exposure control methods are fully satisfactory. Decisions about the particular method to apply should depend on the level of exposure control required. To prevent most examinees from answering the same items at the beginning of the test, the methods MM and RA are good candidates. 322

13 Item Exposure Control Table 4 Coefficient of Variation, Minimum Rate, Maximum Rate and Percentage of Items Never Used by a Distribution, Test Length and Method in the 500 Bank Size Condition. Bank Size 500 Test length a distribution MI PR R CV SH PR PRI CO MI PR i R MinSH PR PRI i 0.i CO MI i00.0 i00.0 i00.0 I00.0 i00.0 I00.0 PR R Max SH PR PRI CO MI PR R N% SH PR PRI CO Note. Minimum and maximum values are multiplied by i00. The means of the parameters in the three distributions are 0.93 (i), 1.07 (2) and 1.30 (3). If the test administrator is mainly concerned with reducing extremely high exposure rates, then the SH or Rk are good choices. The SH method is widely used in operational CATs and has also been extended to structured item banks (Stocking, 1993). One difficulty of the SH method is the assignment of the k parameters. In this research, the set of k parameters in both studies had to be obtained via simulation for each different condition before running the reported simulations. In the Rk method, when the first examinee receives the test all the items are available. Items administered to the first examinee will become unavailable for the second and following examinees, until they reach an exposure rate below k. In consequence, a possible shortcoming of the method is that the ability estimation precision may be better for some examinees than for others, depending on when they take the test. However, as the SH and Rk methods did not differ in precision, it seems safe to conclude that this property of the Rk method had no consequences in the current studies. As two reviewers pointed out, the use of the Rk meihod may face some difficulties if testing takes place at different CAT sites, since a simultaneous control 323

14 Revuelta and Ponsoda Table 5 Coefficient of Variation, Minimum Rate, Maximum Rate and Percentage of Items Never Used by ~ Distribution, Test Length and Method in the 1000 Bank Size Condition. Bank Size i000 Test length ~ distribution 2O 4O MI PR R CV SH PR PRI CO 15.8 ii MI PR R MinSH PR PRI CO MI i00.0 i00.0 i00.0 i00.0 i00.0 i00.0 PR R , Max SH PR PRI C0 3.i MI PR R N% SH PR PRI CO Note. Minimum and maximum values are multiplied by I00. The means of the parameters in the three distributions are 0.93 (i), 1.07 (2) and 1.30 (3). of item exposures at different sites would be required. In this case several possibilities may be considered. If the whole item bank is stored in each computer administering the test, the Rk method may be applied independently on each computer. If the test is stored in a single server, and transmitted to the terminals by a computer network, this server may control the exposure rate across all the terminals, without the need to count bow many times each particular item is exposed at each single terminal. The PR method produced greater control over exposure rates (except for maximum rate) and a small cost in precision, when compared to the methods MI, MM, and RA. One advantage of the method is the flexibility offered by the possibility of manipulating its formula. The efficiency of the PR method for controlling the item exposure rates may be regulated by the magnitude of the random component, which could be related to some variables affecting test precision (such as the variability in the item discrimination parameters or the maximum test length). The random and information components are linearly combined to produce the weights governing 324

15 Item Exposure Control item selection. Other formulas may, of course, have been proposed. More research is needed to explore the efficiency of alternative expressions for guiding item selection. The combined method PRk reduced the maximum exposure rates and also the number of unused items. Moreover, its precision was similar to that of the Rk method. The combined method, then, provided the best overall results. The precision of the IP method was clearly lower than that of the MI method, though with regard to item exposure rates it was superior to the methods discussed above. The maximum rate was 1, since, as with the MI method, every test administered the same first item. The influence of a parameters on item selection differed for the methods applied. They had no role in the 1P method, but they correlated (r = 0.8) with exposure rates in the MI method. As one referee pointed out, this unequal influence of the discrimination parameter may also affect the amount of bias the methods produced when (as is the case here) maximum likelihood procedures were used to estimate ability. Kim and Nicewander (1993) found that maximum likelihood estimates became more biased as the a parameter was increased from 0.5 to 1 and 2. Our data did not reveal any noticeable differences in bias among methods (the highest absolute bias found in both studies was negligible). However, a slight increase in bias was registered for the control and PR15 methods in the shortest tests, indicating that differences in bias among the methods may be discovered when other conditions are explored. The influence of item selection method on test validity also needs more research. Green, Bock, Humphreys, Linn, and Reckase (1984) suggested that item discrimination parameters are related to the dimensionality of the bank. Items with low discrimination parameters are less related to the trait measured by the entire bank. Methods using items with lower a (such as the IP method) may then compromise the unidimensionality of the CAT, and therefore its validity and predictive power. The current study did not take into account test overlap. Methods revealed as efficient for exposure rate control may not be so for exposure rate control conditional on examinee ability. Methods not including a random component, such as MI and 1P, should produce a high degree of item overlap, as they may administer the same items to same-ability examinees. The Rk and PRk methods may be easily extended to prevent overlap by computing the exposure rates in the preceding tests conditional on ability levels. Overlap should not be so problematic for those methods in which a random component was present. Further research should pay attention to this problem (Davey & Parshall, 1995; Stocking, 1993). Non-psychometric constraints in item selection, such as item content, item type, and so on, were not considered, but they should be taken into account before applying these methods to more realistic contexts in which administered items must also conform to a test plan. For the SH method this extension has been accomplished by Stocking (1993). Other characteristics of the evaluation, such as the relationship between examinees' ability and item difficulties, may also affect exposure rates. If the mean ability of examinees is well above (or below) the mean of item difficulties, controlling for exposure rates may be quite costly in terms of test precision, as the few informative items will be unavailable for most tests. The 325

16 Revuelta and Ponsoda relationship between the two distributions and the effects of this relationship on precision and exposure rates need further research. In conclusion, none of the methods proposed may yet be considered to represent a final solution to the exposure control problem. Methods for controlling item exposure rates apply suboptimal item selection strategies in order to diminish the exposure rate differences among items produced by the MI method. Simulations with real and simulated item banks showed that precision loss is not important for most of the methods tried. Moreover, this loss in precision may, of course, be compensated by a small increase in test length. More visible differences among methods were found regarding exposure control. The progressive restricted method seems to perform well on precision and exposure control and no parameters have to be determined by previous simulations. This method is therefore a good choice for keeping maximum exposure rates under control. References Assessment Systems Corporation (1988). User's manual for the MicroCAT testing system, version 3. St. Paul, MN: Author. Baker, F. B. (1992). Item Response Theory, Parameter Estimation Techniques. New York: Marcel Dekker. Davey, T., & Parshall, C. G. (1995, April). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Dodd, B. G. (1990). The effect of item selection procedure and stepsize on computerized adaptive measurement using the rating scale model. Applied Psychological Measurement, 14, Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21, Hambleton, R. K., & Swaminathan, H. (1985). Item Response Theory. Principles and Applications. Boston: Kluwer-Nijhoff Pub. Ham bleton, R., Swam inathan, H., & Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park: Sage. Hetter, R. D., & Sympson, J. B. (1997). Item exposure control in CAT-ASVAB. In W. A. Sands, B.K. Waters, & J.R. McBride (Eds.), Computerized Adaptive Testing. Front Inquiry to Operation (pp ). Washington, DC: American Psychological Association. Hulin, C. L., Drasgow, E, & Parsons, C. K. (1983). Item Response Theory: Applications to Psychological Measurement. Homewood, IL.: Dow Jones-lrwin. Kim, J. K., & Nicewander, W.A. (1993). Ability estimation for conventional tests. Psychometrika, 58, Kingsbury, G.G., & Zara, A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2, Lewis, M. J., Subhiyah, R. G., & Morrison, C. A. (1995, April). A comparison of classification agreement between adaptive and full-length tests under the I-PL and 2-PL models. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. McBride, J. R., & Martin, J. T. (1983). Reliability and validity of adaptive ability tests in a military setting. In D.J. Weiss (Ed.), New Horizons in Testing (pp ). New York: Academic Press. 326

17 Item Exposure Control Mills, C., N., & Stocking, M., L. (1995). Practical Issues in Large-Scale High-Stakes Computerized Adaptive Testing. (Technical Report RR-95-23). Princeton, N J: Educational Testing Service. Morrison, C., Subhiyah, R., & Nungester, R. (1995). Item exposure rates for unconstrained and content-balanced computerized adaptive tests. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Ponsoda, V., Olea, J., & Revuelta, J. (1994). ADTEST: A computer adaptive test based on the maximum information principle. Educational and Psychological Measurement, 54, Ponsoda, V., Wise, S. L., Olea, J., & Revuelta, J. (1997). An investigation of self-adapted testing in a Spanish high school population. Educational and Psychological Measurement, 57, Potenza, M. T., & Stocking, M. L. (1997). Hawed items in computerized adaptive testing. Journal of Educational Measurement, 34, Revuelta, J. (1995). El control de la exposici6n de los items en tests adaptativos informatizados [Item exposure control in computerized adaptive tests]. Unpublished master's dissertation, Universidad Aut6noma de Madrid, Spain. Revuelta, J., & Ponsoda, V. (1996). Metodos sencillos para el control de las tasas de exposicion en tests adaptativos informatizados [Simple methods for item exposure control in CATs]. Psicologica, 17, Stocking, M.L. (1993). Controlling item exposure rates in a realistic adaptive testing paradigm. (Technical Report RR 93-2). Princeton, N J: Educational Testing Service. Stocking, M.L., & Lewis, C. (1995). A New Method of Controlling Item Exposure in Computerized Adaptive Testing. (Technical Report RR-95-25). Princeton, N J: Educational Testing Service. Stocking, M. L., & Swanson, L. (1993). A method for severely constrained item selection in adaptive testing. Applied Psychological Measurement, 17, Sympson, J. B., & Hetter, R. D. (1985). Controlling item exposure rates in computerized adaptive testing. Proceedings of the 27th Annual Meeting of the Military Testing Association (pp ). San Diego, CA: Navy Personnel Research and Development Center. Wainer, H. (1990). Computerized adaptive testing: a primer. Hillsdale, N J: LEA. Authors JAVIER REVUELTA is Associate Professor, Facultad de Psicologia, Universidad Autonoma de Madrid, Canto Blanco 28049, Madrid, Spain; javier.revuelta@uam.es. Degree: PhD, University of Madrid. Specializations: computerized testing, IRT methods. VICENTE PONSODA is Professor, Facultad de Psicologia, Universidad Autonoma de Madrid, Canto Blanco 28049, Madrid, Spain; vicente.ponsoda@uam.es. Degree: PhD, University of Madrid. Specialization: computerized testing. 327

A Comparison of Item and Testlet Selection Procedures. in Computerized Adaptive Testing. Leslie Keng. Pearson. Tsung-Han Ho

ADAPTIVE TESTLETS 1 Running head: ADAPTIVE TESTLETS A Comparison of Item and Testlet Selection Procedures in Computerized Adaptive Testing Leslie Keng Pearson Tsung-Han Ho The University of Texas at Austin